Stream: implementers
Topic: Modifiers for text searching on Tokens
Grahame Grieve (Jun 21 2021 at 20:15):
There's an inconsistency in the specification around the use of the :text modifier on token searches.
The specification says that the :text modifier "the match does a partial searches on the text portion of a CodeableConcept or the display portion of a Coding"
But it also says: "the search functions as a normal string search"
Those are not the same thing since a normal string search is left match only. (they were, once, so whatever we do counts as at least as a technical correction). And we don't want left match search only in this case.
But what do we want in this case? That's not so obvious, and we didn't quickly come to resolution on the FHIR-I call where we just discussed it. Partly that's because we're not clear on the use case boundaries. For instance, I've only seen the :text modifier used where a human is driving the search interactively - that matches the use of the :text modifier on a normal string search: we've never said exactly how this works, and we say explicitly that we don't specify it, on the grounds that the use case is some human searching speculatively and adapting to the behavior of the system. But is that what we expect for this field?
Given that, we'd like some feedback, in two different ways:
- if you have implemented :text search on a token search in your server, how did you do it, and why?
- what you would you like it to do, and why?
Daniel Venton (Jun 22 2021 at 17:36):
As a user I would have no idea if the keyword I was looking for is the 1st word, last word or some other position in any coding system. (What's a coding system?) My expectation would be to return everything that might be what I'm looking for. "Diabetes" Code.coding = "mellitus, Diabetes, Type 2" or "Diabetes, type" or .text = "Suspect Diabetes II".
In FHIR words it would be essentially :contains. Case independent, whitespace squashed, special characters removed, contains.
If I'm not a layperson, an actual clinical organization then it's likely I'd be searching by the code(s).
When it comes time for my requirements around implementing :text, that is what I'll be specifying.
Yunwei Wang (Jun 22 2021 at 18:50):
:contains
is substring search. When searching "eight", word based search returns like "every eight hour". :contains
substring search returns "body weight" also.
Yunwei Wang (Jun 22 2021 at 18:50):
@Paul Lynch
Daniel Venton (Jun 22 2021 at 20:16):
Yes, because if I'm a user (non-clinical) and I'm searching for my "weight" observations. I likely have no way to know that the actual dictionary definition of the observation is "body weight". Returning 0 records is dirty, you didn't specify the right magic word. There are no observations that "start-with" weight. Which is why I say a token param with a text modifier like "code:text" does the same operation that a StringParam:contains does.
Paul Lynch (Jun 23 2021 at 18:32):
:contains is definitely an improvement over left-matching, but it has the drawback that it finds the middle of words, as @Yunwei Wang pointed out. What users would want is matching on the start of words, not the middle of words. So, (to borrow Yunwei's example) "weight" should match "body weight" but "eight" should not match "body weight". However, if :contains behavior were the best that could be done for code:text, I would take it.
Grahame Grieve (Jun 23 2021 at 20:19):
why would you not want matching on the middle of words? Prefixing is a common practice...
Paul Lynch (Jun 23 2021 at 21:16):
If you are trying to do autocompletion, and the user types "h", they are expecting to see words beginning with "h", not containing "h".
Yunwei Wang (Jun 24 2021 at 13:54):
I use case I have is that when search cancer lung
, the results includes "lung cancer", "cancer of left lung", and "cancer of right lung". So the search is based on "any concept has both words lung and word cancer ignoring the order"
Yunwei Wang (Jun 24 2021 at 13:56):
Simple substring :contains could not achieve that.
Paul Church (Jun 24 2021 at 18:24):
Generally, the Google implementation uses three kinds of search - exact, substring, and word-based (i.e. matches entire words, in any order, occurring anywhere in the string, like....google search). Prefix search is pretty uncommon.
Our token:text is word-based. This covers most of the cases people have mentioned, without running into the substring matching problems of "eight" matching "weight" which seems unhelpful. To be really fancy you'd do stemming and synonyms on top of that, so "weight" still matches "weights" (we're not doing that currently).
This is what we think it should do. For that matter, it would be nice if default string search did the same but I suppose that ship has sailed.
Daniel Venton (Jun 24 2021 at 19:43):
If your parameter is specified as a TokenOrList then would you not query by:
token:text=lung,cancer
check the text of the token for lung or cancer? And if your :text modifier executes as a string:contains....
Maybe a TokenAndListParam
token:text=lung&token:text=cancer
Yunwei Wang (Jun 24 2021 at 21:11):
What do you mean "TokenOrList"?
Paul Lynch (Aug 17 2021 at 18:58):
Paul Church said:
Our token:text is word-based.
I just wanted a note here that Google's "word-based" searching is not the same as word-boundary matching. In our experience with Google's FHIR implementation, searching on code:text only matches on complete words, and won't match prefixes (unless maybe we're missing some configuration option). To be able to match as the user types, to provide a continually updated list of suggestions, prefix matching at word boundaries is needed. For example, "w" should match "body weight" (but "e" should not match it).
I agree the specification's decription of :text is ambiguous enough to allow for Google's interpretation, though.
nicola (RIO/SS) (Aug 18 2021 at 08:15):
Aidbox implemented it as ilike
(contains)
Alexander Kiel (Aug 18 2021 at 16:55):
I would expect a full text search here. There are many full text search algorithms like trigrams or distance based like Levenshtein.
nicola (RIO/SS) (Aug 19 2021 at 12:48):
trigram is ilike optimization :)
Last updated: Apr 12 2022 at 19:14 UTC