FHIR Chat · Ideas on $import

Some lessons learned from implementing an import operation in Google Cloud.
- We're loading data from cloud storage (GCS). Not much opinion on how to upload large data in general; cloud has already addressed that for us.
- Flexibility in input is good. We take wildcards. Limiting $import to accept only the file structure of $export would be undesirable, datasets come from many sources.
- The goal of $import should be an async operation with massive scalability, think gigabytes to terabytes of resources.
- Error handling: it's very helpful to have an out of band error channel - we take a user-specified GCS location where the server will write errors. If the client uploads 1M resources and 5K of them are bad, the server needs a scalable way to communicate those errors.
- We allow partial success and will import all the good resources. On the other hand, if the client uploads 1M resources and all of them are bad, maybe a fail-fast option?

Paul Church (Mar 18 2019 at 22:05):

We have targeted a limited use case to start:
- empty FHIR store + many resources on disk
- not empty FHIR store, but loading a chunk of self-contained resources like a terminology or profile

For this case, the import semantics are that every resource must have an ID (every operation is a PUT), referential integrity is ignored, and existing resources are overwritten without creating a _history entry.

We support two modes: ndjson where every line is a resource, and ndjson where every line is a bundle. In the bundle case, bundles are all treated as 'collection' type with no semantics (except for history bundles, handled separately). This allows importing the contents of searchset bundles, for example the output from Patient/$everything.

History bundles can be imported if the store config allows; they will populate the corresponding history as if it had occurred in the past. This is blocked from overwriting or adding to an existing history as the result might not make sense.

Paul Church (Mar 18 2019 at 22:06):

Referential integrity, proper change history on overwrite, and batch/transaction semantics are all possible but likely to have a high tradeoff vs. throughput. Most of our users have wanted to "database restore" extremely large amounts of data so we have optimized for that.

I'm quite curious how people want to define the converse of the $export operation - the user journey is not clear to me. Suppose I call $export on several FHIR servers to accumulate data for a set of patients, can I call $import to get all of it into one place without additional client processing? If the data has several copies of Practitioner/123, how does the server decide whether those are actually the same resource and which is the most current?

Josh Mandel (Mar 18 2019 at 22:08):

These are great questions; look forward to digging in (and will propose discussion as part of a connectathon track too).

Josh Mandel (Mar 18 2019 at 22:09):

For resource IDs, is a current expectation that you'll only import resources whose IDs are valid within the target system? (Some servers impose limits on the ID space, like hapi by default.)

Paul Church (Mar 18 2019 at 22:27):

Yes, that is the current expectation. It's not ideal.

Server-assigned IDs are conceptually not too bad to implement: just walk the input one additional time in advance, assign new IDs to everything, and keep track of how to rewrite references. Parallelizing that to handle very large input is another layer of complexity.

There is a tension between having the client write out the desired semantics explicitly (e.g. by giving batch bundles with the verbs they want) vs. the client handing the server a bunch of resources and saying "make it happen". The output from $export isn't batch bundles...

Grahame Grieve (Mar 18 2019 at 22:32):

one key question is that if there are duplicate ids, is that by accident, or is the intent to replace them?

Brian Postlethwaite (Mar 19 2019 at 00:49):

We are planning to use a similar mechanism in our server, and will be moving the "external" ID that comes in from the import to an identifier that we can match into our system to be able to perform the appropriate operation on them - locating the correct resource.

Josh Mandel (Mar 19 2019 at 00:50):

+1 I think this pattern is going to be pretty common. Do you have a strategy for assigning a system on these identifiers?

Brian Postlethwaite (Mar 19 2019 at 00:57):

The system will be the base URL where we got them from (at least that's where we were thinking) otherwise would need to have a system parameter somewhere in our server

Paul Church (Mar 19 2019 at 00:59):

I like that. Rewrite all the references by searching on their external IDs as well?

Josh Mandel (Mar 19 2019 at 00:59):

Yeah

Paul Church (Mar 19 2019 at 01:22):

If the user wants to consolidate resources from different sources that represent the same thing (perhaps a rarely-changing entity like Location), they could have one resource with multiple external ID codes. Would make analysis easier.

Grahame Grieve (Mar 19 2019 at 01:36):

I was more thinking of reuploading new versions of existing content

Josh Mandel (Mar 19 2019 at 04:18):

Right; I think the same approach works there too as long as the source system's base URL is included

Paul Church (Mar 19 2019 at 16:06):

How do you want to handle referential integrity? Perhaps during the initial pass through the data the server could conditional create stub resources for each resource in the import so that during the actual import references are rewritten to point to a resource that is already valid on the server.

Josh Mandel (Apr 05 2019 at 17:04):

We've got a quick write-up for $import at the connectathon! I got some early input from several folks here, but I'll personally take responsibility for any ambiguities and mistakes :-)

Thanks @Kurt Ericson @nicola (RIO/SS) @Dan Gottlieb @Michael Hansen @James Agnew for initial discussions.

https://confluence.hl7.org/pages/viewpage.action?pageId=46892105#id-2019-05BulkDataandAnalyticsTrack-1.Testthedraft$importoperation

Please share any thoughts on the approach; we wanted to define something simple with limited scope, but not close the door to more complex use cases downstream.

Dan Gottlieb (Jun 07 2019 at 13:21):

@Josh Mandel @Kurt Ericson @nicola (RIO/SS) @Michael Hansen @James Agnew @Nikolai Schwertner (and anyone else who's interested) Want to find time on Monday or Tuesday at Dev Days for a followup discussion on $import (we could use some of the session to come up with a plan for the September connectathon)?