Stream: bulk data
Topic: Export of documents
Veliyan Georgiev (Oct 10 2019 at 12:40):
Hello everybody. We have the following use case. A client wants to export a large number of documents (10K+) from our system. The IDs of these documents are already known by the client ahead of time. The client wants the metadata and the actual content of the documents.
We want to deliver this functionality using $export but we have 2 outstanding items:
- How to use the 10K+ IDs as part of the $export command? (we started looking at Group but group doesn't allow for DocumentReference to be a member yet)
- How to return the content of the document in the ndjson?
Michele Mottini (Oct 10 2019 at 13:29):
For (1) create a group with the patients those documents belongs to and then use that for the export
For (2) I remember some discussion but not conclusions - but the document content can be inside the document themselves
Veliyan Georgiev (Oct 10 2019 at 13:32):
Thanks Michele.
1) The client actually doesn't want all documents for a patient. The client has the actual document IDs of the documents of interest
2) I'd love to dig deeper into it. I also remember conversations but not actual conclusion.
Josh Mandel (Oct 11 2019 at 01:45):
Interesting. The export operation as currently defined doesn't have a good way to enumerate large numbers of individual resources, but the general pattern of async search might help for (1) --- which is, outside of $export in theory you could do a search for DocumentReference?_id=1,2,3,500
, (or inside of $export, you could try the experimental _typeFilter
parameter to pass through this kind of filter). Probably 10000 individual IDs will break in a single URL query.
For (2), we mostly talked about using FHIR Binary resources and returning using the standard async output mechanism, or just having the client issue a pile of parallel queries for the content if it's not modeled as Binary.
Josh Mandel (Oct 11 2019 at 01:46):
(Clearly the use case you're describing isn't quite the thing $export is designed for, and it's a struggle to make it fit.)
Isaac Vetter (Oct 11 2019 at 01:57):
Is this part of a conversion?, what's the use case?, how does the client already know the document ids?
This info may help to generalize the need.
Veliyan Georgiev (Oct 11 2019 at 12:26):
Josh - we thought of something similar and came to the same conclusion - GET will most likely break. Obviously making $export a POST request will get around the limitation and to some extent even represent the change in state more accurately (an export command is being created on the server). The subsequent status checks can be GET of course.
On (2) - I think it is worth brainstorming on this one a bit more and even talk F2F next month. We proposed very similar workaround to the dev team and committed to starting this conversation here. Couple points that were brought up were - if we fast forward in the future - how would a $import work using the output from $export? - a 2-step process of getting the metadata first and then the binaries might end up making the job of resolving references much harder if not impossible.
Isaac - this is a part of a payer workflow in which the "payer" selects only those documents that need to be presented as evidence (so it is not all documents per patient). The closest example that comes to mind is a shopping cart-style workflow - they browse the store (document repository) and select the products (documents) and eventually those products show up in a package (fhir export)
Josh Mandel (Oct 11 2019 at 14:16):
But for a shopping cart, the numbers just aren't that large. Is there something about your process for selecting documents that causes you to go from having 0 documents of interest to having huge numbers all at once?
John Moehrke (Oct 11 2019 at 18:00):
For a folder of documents there is DocumentManifest and List. Both of these are used in interfacing with XDS/XCA/XDM/XDR environments. A DocumentManifest is the same as an XD* SubmissionSet, and a List is the same as a XD* Folder.
Veliyan Georgiev (Oct 14 2019 at 12:50):
Josh - yeah. There is a particular workflow. All documents are NLP-ed and the extracted information is what the user performs searches on. Once the user searches the NLP extract on let's say "give me all documents between these 2 dates that are radiology reports and have mention of nodules etc.." (which could be tens of thousands) - then the user will request an export of the underlying documents. The only thing the user has at this time is pointers to the original documents (those are also stored in the NLP metadata of course)
Paul Church (Oct 15 2019 at 15:38):
As a data point on something similar, the Google Cloud DICOM store has a filtered export operation (in alpha) that takes a path to a cloud storage bucket containing a flat file with a list of studies/series/instances to export. This may not be easily standardized, but it's indicative of what we had to do to facilitate certain use cases.
In any kind of cohort selection process the user will often come up with a very large "shopping list" of items to get from the server. Even if the items are patient bundles and Group $export applies, FHIR needs a scalable pattern for communicating that group to the server. The discussion of paginated Group/List with $add/$remove operations is enough to make it possible, but doing 100,000 $add operations to build a temporary Group as a parameter to a one-off export is inefficient.
Veliyan Georgiev (Oct 15 2019 at 18:32):
@Paul Church I was just about to type up today that we have the exact same situation with our DICOM repo as well. Very interesting to see how Google is addressing it
natus (Oct 19 2019 at 19:30):
@Veliyan Georgiev i would simply use the List
resource.
- create a list with the 10k documents
- give the list
id
to your client. - the client use the
_list
parameter over theComposition
resource
in this scenario, there is no transfer of the 10k docuemnt ids.
Last updated: Apr 12 2022 at 19:14 UTC