FHIR Chat · NDJSON Resource Expectations · bulk data

Stream: bulk data

Topic: NDJSON Resource Expectations


view this post on Zulip Matthew Spielman (Jun 30 2021 at 11:19):

Any likelihood of being more prescriptive on the permitted formats of the NDJSON output from a bulk FHIR endpoint? My concern as we implement this in our stack is that there is going be ambiguity on the client side on what to expect in terms of processing a NDJSON file. While I believe the language of the spec is very permissible (i.e. it does not prohibit a bundle of resources, transaction bundle, or individual NJDSON files for each resource type) I am concerned that, in the field, implementations will be very slow and patchy as clients written to the generic spec will struggle with compatibility with server payload formatting. Have I just missed some language in the IG that clarifies this?

view this post on Zulip Michele Mottini (Jun 30 2021 at 13:18):

The intent (and all implementation I saw) is definitely to have different NDJSON files for each resource type, with the resources stored directly, not inside bundles. I guess you can have NDJSON of Bundles if that is a native resource type in that server, but that's it.

view this post on Zulip Josh Mandel (Jun 30 2021 at 13:44):

When you're performing an export, a client names a list of types of data to export. Those types correspond to the output types in the resulting manifest. If there is some ambiguity, we could add clarification, but we have not seen any practical deviations or confusion on this point, to my knowledge.

view this post on Zulip Yunwei Wang (Jun 30 2021 at 13:54):

I think it is pretty clear: "Each file SHALL contain resources of only one type, but a server MAY create more than one file for each resource type returned."

view this post on Zulip Cooper Thompson (Jun 30 2021 at 15:00):

I will mention that we have run into some challenges in our Bulk Client implementation, where having Transaction Bundles might actually make things easier. For example, a lab result will often be represented by a combination of DiagnosticReport and Observation resources. And since we are a facade server, we really want to file everything related to a single result at once. The current structure does make this difficult, as we basically need to process all resources in all files before we can start doing any data filing (which kinda defeats the purpose of ndjson). Have other folks (who probably aren't using FHIR as the data storage format) run into similar client-side processing issues?

view this post on Zulip Yunwei Wang (Jun 30 2021 at 15:46):

Has this been address in bulk $import?

view this post on Zulip Paul Church (Jun 30 2021 at 16:14):

The Google import operation has a mode that accepts bundles and unwraps them into individual resources (without regard for the bundle type or semantics, it's just a bag of resources - but this could be improved in the future) and users have found that useful. It comes up a lot in cases like synthea where the input is per-patient bundles, but could also apply to scenarios like this.

I think there's value in expanding beyond the "ndjson file per resource type" use case currently in $export.

view this post on Zulip Josh Mandel (Jun 30 2021 at 16:52):

I think there's value in expanding beyond the "ndjson file per resource type" use case currently in $export.

Just to be clear: value in expanding the model within the $export paradigm or value in expanding to support other paradigms like $import ?

view this post on Zulip Paul Church (Jun 30 2021 at 16:58):

Probably both. The current $export paradigm is pretty narrow, for the case @Cooper Thompson raises it sounds entirely reasonable that the server would want to organize resources in a way that makes more sense to the use case. Then on the other end, $import benefits from more flexibility anyway as it may get input from sources that are not just $export.

view this post on Zulip Matthew Spielman (Jun 30 2021 at 17:11):

Yeah, I think others may have the same challenge as well. I am trying to build this into our commercial development/integration stack which will ultimately service multiple internal products as well as our ISVs that use our technology. For some of our products as well as some of our customers as well, the internal APIs to grab patient data are optimized at the patient level (i.e. pull a patient and all the associated data) and are not efficient with requests where we doing things like "grab all medications for this cohort of patients, then give me their problems, etc."

While it's not a technical blocker, it does force us into some really inefficient processing where we need to collect the data for potentially thousands of patients and then re-sort all their data into different files by resource type instead of just being able to pipe each patient's compartment into a bundle sequentially

view this post on Zulip Josh Mandel (Jun 30 2021 at 17:15):

It's definitely the case that for population-level export, we modeled the output in a way that biased it towards convenience of consumption, rather than convenience of production (we heard that it's often useful for clients to populate DB tables directly from these files).

view this post on Zulip Matthew Spielman (Jun 30 2021 at 17:33):

That is true if the client system has a relational data model that is aligned with FHIR and isn't too strict about referential integrity. Otherwise you need to spend a bit of time hopping between files to grab required data for a transaction. If the client system is actually mapping the FHIR resources to internal API calls, the same problem applies. I would posit that a lot of existing, established commercial offerings will have that kind of challenge as bulk FHIR clients.

Again, it's a challenge that our developers can solve (and probably get paid more to do so) but it does introduce some real inefficiencies.

view this post on Zulip Matthew Spielman (Jun 30 2021 at 17:34):

I don't necessarily know that there is a perfect answer that makes everyone happy but it could introduce some friction into the adoption of the IG

view this post on Zulip Cooper Thompson (Jun 30 2021 at 21:27):

Josh Mandel said:

It's definitely the case that for population-level export, we modeled the output in a way that biased it towards convenience of consumption, rather than convenience of production (we heard that it's often useful for clients to populate DB tables directly from these files).

I wish Epic had been about a year farther a long as a Bulk FHIR client so that we could have represented that side better in the Argo project last year. But as a client, we've decided that we have the same problems when consuming as we do when producing. Which is to say, our system is patient oriented (just like @Matthew Spielman ) and it would be much easier if we could file all the data for a single patient at once before moving on to the next patient.

view this post on Zulip Josh Mandel (Jun 30 2021 at 21:36):

That's fair, thanks for the comments Cooper. In terms of your ideal export configuration...

  • Would all data about a given patient be in the same ndjson file as discrete resources/lines?
  • Would all data about a given patient be wrapped up in a bundle within a single ndjson line (I assume not since .. these get unbounded)
  • Would an ndjson file have data about lots of patients, or would you have one patient per file?
  • What about resources like organizations or practitioners that aren't associated with a single patient? Repeat them, or represent them in distinct files from the patient data?

view this post on Zulip Brian Postlethwaite (Jun 30 2021 at 21:57):

Could also be a step in staging the content, rather than the format.
Patient per file(or am I mistaking something there) would blown out the number of files in the status file...

view this post on Zulip Josh Mandel (Jun 30 2021 at 23:03):

Patient per file(or am I mistaking something there) would blown out the number of files in the status file...

For sure; we hadn't ever really considered it, for that reason. Manifest can't be broken up, and it'd grow 1:1 with patient count. (But, maybe it's not totally untenable; like I said, we never really considered it.)

view this post on Zulip Yunwei Wang (Jun 30 2021 at 23:16):

I think output as Bundle is an option. Bulk Export does not exclude Bundle as a output resource, All it says is "Each file SHALL contain resources of only one type". So bunlde.ndjosn is definitely feasible.

view this post on Zulip Josh Mandel (Jun 30 2021 at 23:20):

That's only okay (per the current spec) if you're exporting bundles -- the equivalent of GET /Bundle from a FHIR REST API -- asking the server for Bundles it has stored

view this post on Zulip Josh Mandel (Jun 30 2021 at 23:21):

It's not a way to somehow get your regular Patient data exported in Bundles.

view this post on Zulip Josh Mandel (Jun 30 2021 at 23:22):

If we want new behavior or flexibility, we need to specify it. I'm not convinced, but I am open minded / sympathetic / interested in learning more about what folks would find useful.

view this post on Zulip Matthew Spielman (Jul 01 2021 at 03:04):

I'd be willing to discuss more and explain some of the types of (ostensibly impedance mismatch) issues that exist between products and technologies written 10-20 years ago that I'm trying to address with our current development around FHIR and Bulk FHIR. I've started and stopped trying to summarize them in this thread a few times this evening but it just turns into a dissertation that no one is going to want to read. Perhaps there will be a good opportunity to discuss synchronously at some point in the future.

view this post on Zulip Josh Mandel (Jul 01 2021 at 03:55):

Yeah -- would you be up for a discussion next week, ideally one we can record/share?

view this post on Zulip Matthew Spielman (Jul 01 2021 at 14:17):

Josh Mandel said:

Yeah -- would you be up for a discussion next week, ideally one we can record/share?

Of course, more than happy to. My schedule is only a mild disaster next week but if you send me an email, perhaps Thursday or Friday? (Matthew.spielman@intersystems.com)

view this post on Zulip Sam Sayer (Jul 01 2021 at 18:42):

With the new PnP spec for $import, putting this Bundle capability into $export would fix it for both

view this post on Zulip Sam Sayer (Jul 01 2021 at 18:43):

This was a roadblock we've run into when trying to figure out PnP $import as it relies on $export and it forces us to basically "stage" all the resources and then figure out the references between them

view this post on Zulip Paul Church (Jul 01 2021 at 19:19):

Definite +1 on that, figuring out the references in any non-trivial import is a big problem. Solutions that work for 1000 resources might not work for 1 million resources, and 1 billion is even worse.

Having the exporting server assemble them into transaction bundles that are each self-contained is a good way out, at least for use cases where the exporter can anticipate what the reference structure looks like.

view this post on Zulip Josh Mandel (Jul 01 2021 at 21:14):

I think we'd want an explicit client-supplied option on export to say "partition data into transaction bundles" -- and also we'd want a way to distinguish these bundles from the ones that'd result from "GET /Bundle". Might require a new output type/flag/extension.

view this post on Zulip Notification Bot (Jul 02 2021 at 17:14):

This topic was moved by Josh Mandel to #ibm > NDJSON Resource Expectations

view this post on Zulip Sam Sayer (Jul 06 2021 at 14:57):

I think some kind of "partition resource" flag which indicates the resource to build the Bundle around would work. I think we'd still have to figure out how to de-dup things like providers (since the same provider would be referenced by multiple patient bundles in our case), but that's fairly straightforward using the IDs from the data provider.

view this post on Zulip Josh Mandel (Jul 09 2021 at 20:18):

Thanks @Matthew Spielman, Alexander Brown, and @Dan Gottlieb for the discussion this afternoon. I've uploaded to https://youtu.be/75Rsgu1Rqnc in case others are interested

In this session, I catch up with Alexander Brown and Matthew Spielman (Intersystems) and Dan Gottlieb (Central Square Solutions). We discuss FHIR Bulk Data formats for $export and $import with; dig into architecture for data processing; and compare strategies for grouping data. We review regulatory timeframes for ONC Certification / Cures Rule compliance, $export API extensibility, and techniques to balance the diverse needs of clients and servers.

view this post on Zulip Sam Sayer (Aug 04 2021 at 16:47):

Is anyone planning on doing anything related to this at the Connectathon?

view this post on Zulip John Silva (Aug 04 2021 at 18:42):

Where can one find (prelim?) specs on Bulk Data Import? Advance Thanks!

view this post on Zulip Josh Mandel (Aug 04 2021 at 18:42):

https://github.com/smart-on-fhir/bulk-import/blob/master/import.md

view this post on Zulip Josh Mandel (Aug 04 2021 at 18:43):

This links to two proposals; the "Ping and Pull Proposal" is our latest/best current thinking.

view this post on Zulip Raj Sankuratri (Oct 07 2021 at 13:18):

Hi Josh, is there any update to the discussion you posted on YouTube regarding grouping data for bulk $export and $import?

view this post on Zulip Josh Mandel (Oct 07 2021 at 13:39):

For $import, I would certainly encourage folks to experiment. We have a draft specification but it's only going to be through real world feedback that we might take this into a formal standards project.

For $export, these operations have been formally standardized and I don't think we would change their default behavior, but we could introduce new output modes to accommodate different grouping methods; if anyone is interested in prototyping this, I would suggest just adding an optional flag to the kickoff request, defaulting it to the currently standardized grouping method (by resource type) but introducing client facing options for finer grained control.


Last updated: Apr 12 2022 at 19:14 UTC