FHIR Chat · Ideas on $import · bulk data

Stream: bulk data

Topic: Ideas on $import


view this post on Zulip Paul Church (Mar 18 2019 at 22:05):

Some lessons learned from implementing an import operation in Google Cloud.
- We're loading data from cloud storage (GCS). Not much opinion on how to upload large data in general; cloud has already addressed that for us.
- Flexibility in input is good. We take wildcards. Limiting $import to accept only the file structure of $export would be undesirable, datasets come from many sources.
- The goal of $import should be an async operation with massive scalability, think gigabytes to terabytes of resources.
- Error handling: it's very helpful to have an out of band error channel - we take a user-specified GCS location where the server will write errors. If the client uploads 1M resources and 5K of them are bad, the server needs a scalable way to communicate those errors.
- We allow partial success and will import all the good resources. On the other hand, if the client uploads 1M resources and all of them are bad, maybe a fail-fast option?

view this post on Zulip Paul Church (Mar 18 2019 at 22:05):

We have targeted a limited use case to start:
- empty FHIR store + many resources on disk
- not empty FHIR store, but loading a chunk of self-contained resources like a terminology or profile

For this case, the import semantics are that every resource must have an ID (every operation is a PUT), referential integrity is ignored, and existing resources are overwritten without creating a _history entry.

We support two modes: ndjson where every line is a resource, and ndjson where every line is a bundle. In the bundle case, bundles are all treated as 'collection' type with no semantics (except for history bundles, handled separately). This allows importing the contents of searchset bundles, for example the output from Patient/$everything.

History bundles can be imported if the store config allows; they will populate the corresponding history as if it had occurred in the past. This is blocked from overwriting or adding to an existing history as the result might not make sense.

view this post on Zulip Paul Church (Mar 18 2019 at 22:06):

Referential integrity, proper change history on overwrite, and batch/transaction semantics are all possible but likely to have a high tradeoff vs. throughput. Most of our users have wanted to "database restore" extremely large amounts of data so we have optimized for that.

I'm quite curious how people want to define the converse of the $export operation - the user journey is not clear to me. Suppose I call $export on several FHIR servers to accumulate data for a set of patients, can I call $import to get all of it into one place without additional client processing? If the data has several copies of Practitioner/123, how does the server decide whether those are actually the same resource and which is the most current?

view this post on Zulip Josh Mandel (Mar 18 2019 at 22:08):

These are great questions; look forward to digging in (and will propose discussion as part of a connectathon track too).

view this post on Zulip Josh Mandel (Mar 18 2019 at 22:09):

For resource IDs, is a current expectation that you'll only import resources whose IDs are valid within the target system? (Some servers impose limits on the ID space, like hapi by default.)

view this post on Zulip Paul Church (Mar 18 2019 at 22:27):

Yes, that is the current expectation. It's not ideal.

Server-assigned IDs are conceptually not too bad to implement: just walk the input one additional time in advance, assign new IDs to everything, and keep track of how to rewrite references. Parallelizing that to handle very large input is another layer of complexity.

There is a tension between having the client write out the desired semantics explicitly (e.g. by giving batch bundles with the verbs they want) vs. the client handing the server a bunch of resources and saying "make it happen". The output from $export isn't batch bundles...

view this post on Zulip Grahame Grieve (Mar 18 2019 at 22:32):

one key question is that if there are duplicate ids, is that by accident, or is the intent to replace them?

view this post on Zulip Brian Postlethwaite (Mar 19 2019 at 00:49):

We are planning to use a similar mechanism in our server, and will be moving the "external" ID that comes in from the import to an identifier that we can match into our system to be able to perform the appropriate operation on them - locating the correct resource.

view this post on Zulip Josh Mandel (Mar 19 2019 at 00:50):

+1 I think this pattern is going to be pretty common. Do you have a strategy for assigning a system on these identifiers?

view this post on Zulip Brian Postlethwaite (Mar 19 2019 at 00:57):

The system will be the base URL where we got them from (at least that's where we were thinking) otherwise would need to have a system parameter somewhere in our server

view this post on Zulip Paul Church (Mar 19 2019 at 00:59):

I like that. Rewrite all the references by searching on their external IDs as well?

view this post on Zulip Josh Mandel (Mar 19 2019 at 00:59):

Yeah

view this post on Zulip Paul Church (Mar 19 2019 at 01:22):

If the user wants to consolidate resources from different sources that represent the same thing (perhaps a rarely-changing entity like Location), they could have one resource with multiple external ID codes. Would make analysis easier.

view this post on Zulip Grahame Grieve (Mar 19 2019 at 01:36):

I was more thinking of reuploading new versions of existing content

view this post on Zulip Josh Mandel (Mar 19 2019 at 04:18):

Right; I think the same approach works there too as long as the source system's base URL is included

view this post on Zulip Paul Church (Mar 19 2019 at 16:06):

How do you want to handle referential integrity? Perhaps during the initial pass through the data the server could conditional create stub resources for each resource in the import so that during the actual import references are rewritten to point to a resource that is already valid on the server.

view this post on Zulip Josh Mandel (Apr 05 2019 at 17:04):

We've got a quick write-up for $import at the connectathon! I got some early input from several folks here, but I'll personally take responsibility for any ambiguities and mistakes :-)

Thanks @Kurt Ericson @nicola (RIO/SS) @Dan Gottlieb @Michael Hansen @James Agnew for initial discussions.

https://confluence.hl7.org/pages/viewpage.action?pageId=46892105#id-2019-05BulkDataandAnalyticsTrack-1.Testthedraft$importoperation

Please share any thoughts on the approach; we wanted to define something simple with limited scope, but not close the door to more complex use cases downstream.

view this post on Zulip Dan Gottlieb (Jun 07 2019 at 13:21):

@Josh Mandel @Kurt Ericson @nicola (RIO/SS) @Michael Hansen @James Agnew @Nikolai Schwertner (and anyone else who's interested) Want to find time on Monday or Tuesday at Dev Days for a followup discussion on $import (we could use some of the session to come up with a plan for the September connectathon)?

view this post on Zulip Nick Robison (Jun 07 2019 at 13:32):

@Dan Gottlieb Would love to join as well, I'm planning on being there for the Tuesday meeting, if that works for everyone else.

view this post on Zulip Kurt Ericson (Jun 07 2019 at 14:51):

Yes sounds good; I'll be there and available both days to meet.

view this post on Zulip Michael Hansen (Jun 07 2019 at 15:48):

I might want to include a few people from the team too like @Jack Liu

view this post on Zulip Dan Gottlieb (Jun 07 2019 at 18:18):

Great! Thoughts on timing - maybe during lunch on Tuesday?

view this post on Zulip nicola (RIO/SS) (Jun 07 2019 at 18:40):

Maybe session - not lunchtime? I see free slots in the schedule.

view this post on Zulip Michael Hansen (Jun 07 2019 at 18:50):

Isn't there a popup session option or something like that?

view this post on Zulip Dan Gottlieb (Jun 07 2019 at 19:14):

@Rien Wertheim can we schedule a popup session around defining a import operations for Tuesday?

view this post on Zulip Rien Wertheim (Jun 07 2019 at 19:31):

I think so. Let me check and send you an email.

view this post on Zulip Nikolai Schwertner (Jun 08 2019 at 21:39):

I have a morning talk on Tuesday until 10am. Any time after would work for me.

view this post on Zulip Edward Yurcisin (Jun 10 2019 at 17:49):

please include me in the popup session on Tuesday. edward.yurcisin@carejourney.com.

view this post on Zulip Dan Gottlieb (Jun 10 2019 at 18:59):

We're scheduled for a popup session to discuss an import operation on Tuesday from 2.10-2.50 in Sonora.

view this post on Zulip Andrey Lyashin (Jun 20 2019 at 15:59):

Hi! Could you please advise whether I could find the latest draft spec for bulk import. Thanks.

view this post on Zulip Michele Mottini (Jun 20 2019 at 16:49):

https://github.com/smart-on-fhir/fhir-bulk-data-docs

view this post on Zulip Andrey Lyashin (Jun 20 2019 at 17:36):

Thank you. I am sorry but I have found only 'export' docs.

view this post on Zulip Paul Church (Jun 20 2019 at 17:43):

I'm not sure if @Josh Mandel has moved it to a github location, but the working doc on $import from the connectathon is https://docs.google.com/document/d/1e9YYtDxc2Yed2RWU27jPLhfyIH0d6IH5Y9ligJUBFZE/edit#heading=h.bbcu6h9umbza

view this post on Zulip Josh Mandel (Jun 20 2019 at 18:20):

(We'll move to GH and update the doc to point there before the next connectathon)

view this post on Zulip Andrey Lyashin (Jun 21 2019 at 08:07):

Thank you!


Last updated: Apr 12 2022 at 19:14 UTC