Stream: smart
Topic: flat-file export format
David Teirney (Aug 16 2017 at 21:27):
Is there more information about what the flat-file export format might be relevant to http://docs.smarthealthit.org/flat-fhir/?
We've started using NDJSON formatted content for a number of our bulk export formats (http://ndjson.org/), the same as what Elastic Search uses so we can actually stream out the data from the system without any significant memory burden on our servers.
Where would we comment on what the export format might end up being?
Dan Gottlieb (Aug 17 2017 at 13:13):
I was initially thinking of FHIR bundles as the output format, but agree that newline delimited JSON seems like a better fit for large datasets. In general, we should probably use github issues to manage changes to the spec, but since it's at such an early stage I went ahead and just updated the proposal.
David Hay (Aug 17 2017 at 21:06):
So right now, it's just a proposal (the export part I mean?) It's just that it's inclusion in an RFI would suggest that it's a bit more mature than that...
Dan Gottlieb (Aug 18 2017 at 13:11):
Yup, just a proposal and not very mature. Also, to clarify, it's not in the published model RFP language, just in the optional add-on functionality section in the draft langauge we posted for comment as a possible method of specifying bulk data export. That said, any thoughts on the best way to make it more clear on the flat fhir page that it's still a very early stage proposal?
Grahame Grieve (Aug 18 2017 at 23:25):
using ndjson doesn't begin to cover the kind of use cases we have been talking about
Dan Gottlieb (Aug 19 2017 at 18:57):
Interesting - what's an example of such a use case?
Grahame Grieve (Aug 21 2017 at 06:43):
see the analytics on FHIR stream, and look for mention of Parquet
Josh Mandel (Aug 21 2017 at 16:30):
On flat file export: it might still be helpful to think about patient-oriented exports that leverage $everything
(which could mean newline-delimited JSON where each line is a whole patient-specific $everything bundle)
Dan Gottlieb (Aug 24 2017 at 19:58):
Thanks Grahame - so the use case is to have the data in a format that can be used natively in the Hadoop ecosystem? I don't have big data experience, so this may be a naive question, but how critical is this relative to exporting in ndjson and converting for use in Hadoop? It seems like going directly to a binary format like Parqet will complicate things on the FHIR server side (which can already serialize json), and preclude a bunch of other use cases (like easily loading a small or mid-sized dataset into mysql).
Dan Gottlieb (Aug 24 2017 at 20:03):
That’s a good point Josh. One file per patient has the nice property of making it simple to filter patients and only ETL a subset, though it may result in a lot of files with just a few resources in each for a nightly update. As you pointed out, in either case, the ndjson format would work and has the nice property of being parallelizable in both the generation and ETL stages.
Grahame Grieve (Aug 24 2017 at 21:10):
that's not a definition of parallelizable I'm familiar with
Grahame Grieve (Aug 24 2017 at 21:11):
I don't think hadoop itself is a requirement, but the approach should be scalable. to me, ndjson isn't really - you still need to parse json
Dan Gottlieb (Aug 24 2017 at 21:43):
Why isn't parsing json scalable? A binary format would be smaller and faster, but json has the benefit of simplicity. Is the performance penalty really too big to use it as an intermediate data format?
Grahame Grieve (Aug 24 2017 at 21:45):
it's certainly simpler. But there's no answer to the second question - the only way to find out is to try in a variety of contexts
Last updated: Apr 12 2022 at 19:14 UTC