Stream: implementers
Topic: Should FHIR Search return a distinct set of resources?
Alexander Kiel (Sep 01 2021 at 21:10):
My question is in the title. Are all FHIR Search requests supposed to not return a resource twice? IMHO it would be common sense to require this behaviour. But I didn't find anything in the spec regarding this requirement.
Paul Church (Sep 01 2021 at 21:13):
It's enforced by the rules for bundles: http://hl7.org/fhir/bundle.html#bundle-unique
Lloyd McKenzie (Sep 01 2021 at 21:21):
So, within a page, they'll generally be unique. Across pages they might not be.
Paul Church (Sep 01 2021 at 21:23):
They're definitely not unique across pages for includes. I'm not sure if I can point to anything other than common sense that says the primary results shouldn't be duplicated across pages.
Lloyd McKenzie (Sep 01 2021 at 23:09):
I don' t know that we guarantee uniqueness even for the base resource across pages. Servers SHOULD ensure uniqueness, but some paging mechanisms could potentially result in the same resource popping up more than once due to changes in the resource set. (Those same servers would also risk missing resources entirely.)
Alexander Kiel (Sep 02 2021 at 07:41):
@Paul Church Thanks for the pointer to the Bundle.
In my implementation, I have no problem to provide a stable set (snapshot) of the database even through very long paginations. However currently, with OR-searches, it is possible that a resource appears multiple times over pages, if it is a match regarding more than one OR-case. So I somehow have to make the returned resources unique. And I like to do this under O(n) where n is the number of results. Currently I have O(k) where k is the number of results in one page.
Daniel Venton (Sep 02 2021 at 12:29):
I think the goal is that the final merged bundle (all pages combined into a single bundle) would have a distinct set of resources. (no duplicates)
Unfortunately due to the fact that we live in an ever changing world this will remain a goal.
If the server isn't doing snapshots, it's entirely possible that the last row of page n, becomes the 1st row of page n+1 between the time between when pages n and n+1 are requested.
That also means that the new resource that pushed that resource to the next page, won't be seen at all by the requester.
I'd say that this is ok, as the situation is unlikely to occur during treatment, instead only during research where the study has 999,999 items instead of 1,000,000.
Any consumer will have to take into account that they might get the same row twice, when spanning pages and update their store appropriately.
If the server is doing snapshots, then the chances are you won't get duplicates, but you'll still miss that new data.
@Alexander Kiel If I do an OR query against you where if A=A or B=B and there is a row that satisfies both conditions you'll return the row twice? Seems like your snapshot would reduce that down to a single row.
Alexander Kiel (Sep 02 2021 at 12:38):
@Daniel Venton Yes, if I have an OR and a resource satisfies both criteria, it will be returned twice, because I'll find it as match of both criteria. That is the case because I don't simply scan through all resources and evaluate the OR, instead I use an index for each criteria and iterate that index. I have to do an incremental distinct join somehow. So it has nothing todo with my snapshots, just with the implementation of evaluating an OR.
Josh Mandel (Sep 02 2021 at 14:21):
The total work required can't be less than O(number of results), right? At the end of the day you're serializing all of these resources and sending them out over the wire. Are you concerned about how O(n) work is amortized across paginated requests, or mainly about the constant factor associated with the O(n) work?
Lloyd McKenzie (Sep 02 2021 at 16:00):
There are different ways to handle paging. Ideally, when you start a query, a snapshot is taken of the entire result set and subsequent paging happens against that snapshot. "includes" will absolutely be duplicated across pages - they must be because there's no presumption that a simple client accessing a new page has any access to information received on previous pages. Also, there's no guarantee that pages will be accessed in order. Some one might jump from first page to last page and then navigate backward - and then maybe jump back to the beginning and page forward.
Alexander Kiel (Sep 02 2021 at 16:06):
Josh Mandel said:
Are you concerned about how O(n) work is amortized across paginated requests, [...]
Yes, I like to be able to return the first page without O(n) work and I like to keep the cost of returning a page constant. Most implementations will either need O(n) work for the first page and store the result somehow or use an offset into a stream of resources that will make the first page cheap but the last page costing O(n) because you have to skip through all not needed pages.
My implementation uses a stream of resources that comes from an index of the first search param. For the first page, I take _count + 1 resources from that stream. In addition to the search value, I can also use a resource ID to seek into that index. So for the next link, in addition to the original search params, I store the resource ID of the first resource of that next page. If I have to deliver that next page, I can seek into the index as described. So every page has equal cost.
For OR searches, I have multiple streams that I concat together. The problem with this approach is, that I have to somehow concat the streams so that I get one big stream of distinct resources.
Alexander Kiel (Sep 02 2021 at 16:14):
Lloyd McKenzie said:
There are different ways to handle paging. Ideally, when you start a query, a snapshot is taken of the entire result set and subsequent paging happens against that snapshot.
As my database engine provides snapshots after every transaction(write) automatically to the upper application layer, I don't need to materialize query results. I can just obtain a stable snapshot of the database using a simple integer identifier, the transaction number. Also the snapshots are queryable. So I can run every query on every snapshot, subject to garbage collection after some time.
"includes" will absolutely be duplicated across pages - they must be because there's no presumption that a simple client accessing a new page has any access to information received on previous pages.
That is clear.
Also, there's no guarantee that pages will be accessed in order. Some one might jump from first page to last page and then navigate backward - and then maybe jump back to the beginning and page forward.
Because I don't use simple integer offsets, and links should be opaque anyway, client cant jump until they have obtained all links. I also provide only next links, no prev and no last link. For my use cases that next link is sufficient. But that may change in the future.
Josh Mandel (Sep 02 2021 at 16:34):
Thanks for the details here -- this is super clear and helpful. I don't think you can do <O(n) work to generate the first page and have constant work per page, but you can at least defer the set difference calculations until they're actually required. Like, if you have multiple streams of resources (A, B, C) resulting from "OR" queries, so each stream is a proper set, but the streams may overlap with each other, you can create a table R
of returned results that starts off empty, and gets resources added to it only when you actually return a page of results. (R
here is like a snapshot in that it accumulates state and grows with the size of your result set :( ... but it grows incrementally so work is deferred :))
- Draw from
A
until A is consumed, accumulating intoR
and skipping items already inR
(n.b. won't actually skip any items yet, since items inA
are unique) - Draw from
B
until B is consumed, accumulating intoR
and skipping items already inR
(n.b. this takes > constant work because you can't be sure how much "skipping" you'll need to do, but it's work proportional to your returned page size at least) - Draw from
C
as forB
I don't see how to do better than this, while still giving clients a clean set of results. Of course, now you have snapshot management challenges (state management! what if they're big? how long do you keep them alive / when to evict and "kill" an open search?)
Alexander Kiel (Sep 03 2021 at 10:57):
Thanks @Josh Mandel Thats exactly my current solution. The state management part of it is the biggest challenge, especially as I have to sync that state between different nodes in my cluster.
Last updated: Apr 12 2022 at 19:14 UTC