Stream: analytics on FHIR
Topic: Parquet
Grahame Grieve (Apr 18 2017 at 21:03):
so on the implementers channel, there's some interest from Cerner in defining and encouraging use of a parquet format for FHIR resources.
Grahame Grieve (Apr 18 2017 at 21:03):
does anyone have any comment about this?
Fahim Shariar (Apr 19 2017 at 09:36):
@Grahame Grieve Apache Parquet will be a nice fit for Hadoop and deep analytics I agree with that . I would encourage that but I think involving on the parquet format will be a challenge because many of us are not familiar with the format
Grahame Grieve (Apr 19 2017 at 12:15):
but do your tools support it?
Fahim Shariar (Apr 22 2017 at 22:25):
No our tools does not support it . We are trying a different approach ( Ontological ) with Graph database . Currently with Neo4j . We convert the FHIR JSON resource to Graph Nodes and Edges. Currently we are studying FHIR Resource and trying to find possible graph nodes relationship within a single resource which wont have an orphan child node and will be able to connect to other nodes from other resource .
nicola (RIO/SS) (Apr 23 2017 at 05:58):
@Grahame Grieve parquet is related to avro and our JSON format (and JSON schema) - if we fix polymorphics & extensions - we'll get clear JSON schema - and avro + parquet for free :)
To convert your JSON, you need to convert the records to Avro in-memory objects and pass those to Parquet, but you don't need to convert a file to Avro and then to Parquet.
Conversion to Avro objects is already done for you, see Kite's JsonUtil, and is ready to use as a file reader. The conversion method needs an Avro schema, but you can use that same library to infer an Avro schema from JSON data.
nicola (RIO/SS) (Apr 23 2017 at 06:02):
Here is also apache arrow, protobuf and plenty of other formats- yaml, edn, message pack - most of them are very sensitive to FHIR JSON design
Krzysztof Suchomski (Apr 24 2017 at 08:26):
We are also looking at analytics on FHIR and introducing parquet/avro format would help with that. Tools for data analytics work better with those formats then with pure JSON. As @nicola (RIO) wrote this would probably end up with some fixes applied but it would open a lot of possibilities around using Hadoop based tools.
A side effect of this step would be also easier upgrade with schema versioning. https://avro.apache.org/docs/1.8.1/spec.html#Schema+Resolution
natus (Apr 25 2017 at 08:35):
Hi all,
FHIR resources can indeed be serialized with every nested data structure (json, avro, parquet, hive table...postgresql, mysql jsonb datatype, python/R backed application and more). Then each resources are linked with relational links then nosql such couchdb, mongodb looks not to be a good solution to me.
PARQUET format is AFAIK not intended for real time application because of immutable format (columnar format oriented). Moreover, the way one use parquet (thought apache DRILL or HIVE) is not user friendly to query nested structures.
AVRO is designed (row format oriented) to ingest real time data.
Beyong those technical aspects can one discuss those step in order to deliver a "FHIR ready database":
1) define profiles and tel how you implement resources (is spark furore the only existing tool?)
2) implement resources with data
3) implement resources with metadata (say data elements, coding systems...)
In theory this would allow data-scientists to prepare programs using only defined profiles ?
natus (Apr 25 2017 at 10:12):
Hi,
I have found more details about apache drill, parquet and FHIR analytics.
http://amia.analystseim.com/
http://wiki.hl7.org/index.php?title=201705_Data_Analytics
Michel Rutten (Apr 25 2017 at 11:07):
Hi @natus FYI Forge is the FHIR profile editor by Furore, my company (spark is the name of our old and obsolete DSTU2 FHIR server, which is superseded by our new Vonk server). David Hay develops ClinFhir, an alternative profiling tool designed specifically for clinicians. Trifolia by Lantana is another online tool that supports FHIR profiling. And the MDHT tool for UML-based modelling also supports FHIR.
Michael Osborne (Apr 25 2017 at 14:26):
@Michel Rutten Does Vonk have an End Point yet?
Abbie Watson (Apr 25 2017 at 14:43):
"but do your tools support it?"
The Meteor/Node ecosystem apparently has a decently robust node-parquet package that we could pull into Meteor on FHIR. There's also a mongo/hadoop adapter on the Hadoop side of things. Wouldn't be our first choice of formats, but we can roll with parquet.
https://www.npmjs.com/package/node-parquet
https://github.com/mongodb/mongo-hadoop
Michel Rutten (Apr 25 2017 at 15:02):
@Michael Osborne certainly, the official Vonk STU3 release is scheduled for tomorrow. We will announce the new endpoint on Zulip. Stay tuned!
Ryan Brush (Apr 25 2017 at 21:18):
I'll add some context from the initial thread that kicked off this discussion. We (Cerner) are interested in an efficient binary format that can be used to share hundreds of millions of FHIR resources and load them into a variety of analytic tools. As others have pointed out in this thread, there are a number of good options here: things like Protocol Buffers or Avro are certainly up to the task as well. If those (or others) have a deeper integration into other ecosystems and are more easily used, then it's worth looking into.
I originally floated the idea of Parquet since it uses a columnar structure that can be efficiently and directly served by a large set of scalable analytic tools, requiring no (potential expensive) transformation. This includes Apache projects like Spark, Impala, and Presto and several scalable database systems like Amazon Athena, Google BigQuery, Microsoft PolyBase, and HP Vertica.
Of course, other formats could be converted to Parquet (there already exists good tooling for bidirectional Avro/Parquet integration), so we don't necessarily need to conflate efficient data serving with efficient data sharing.
Grahame Grieve (Apr 25 2017 at 21:19):
Avro is better documented than Parquet.
Ryan Brush (Apr 25 2017 at 21:24):
@Grahame Grieve Agreed. Avro is also more mature and evolved to be more user-facing, whereas Parquet evolved as a file format.Avro could make a great deal of sense for those reasons, and I think we'd be happy with that if it seems like a better fit to the community.
Grahame Grieve (Apr 25 2017 at 21:37):
Typically, I'd prefer a file format but I just can't find any documentation about Parquet, and nothing tells me how to handle things like schema modularity
natus (Apr 25 2017 at 21:40):
There is also ORC columnar format. These days ORC is a good challenger. ORC is highly compatible with HIVE, and presto. I am supposed to give it a try soon. https://orc.apache.org/ Moreover, HIVE offers a friendly way to query nested structure see http://thornydev.blogspot.fr/2013/07/querying-json-records-via-hive.html . (see section "The best option: rcongiu's Hive-JSON SerDe"). It is really better thant apache drill way of exploding json objects/arrays
Peter Bernhardt (May 04 2017 at 17:12):
@Grahame Grieve parquet is a binary file format. Here's a link that provides details: http://bigdata.devcodenote.com/2015/04/parquet-file-format.html
Peter Bernhardt (May 04 2017 at 17:13):
We use it at RH
Grahame Grieve (May 04 2017 at 17:54):
well, that's a summary. It hardly constitutes 'details'
Peter Bernhardt (May 04 2017 at 17:55):
very funny , @Grahame Grieve
Grahame Grieve (May 04 2017 at 17:56):
I still haven't found enough documentation to describe the actual format so I could produce a file. It's weird...
Peter Bernhardt (May 04 2017 at 17:57):
fair nuff... i'll hunt around some more and let you know if i find anything better
Grahame Grieve (May 04 2017 at 17:59):
thanks
Ryan Brush (May 04 2017 at 19:34):
Coming back to this after some offline discussion with @Michelle M Miller, we took a stab at some criteria for what would make sense for bulk sharing, and listed some technology options below along those lines. This isn't comprehensive, and the proposed requirements are certainly subject to discussion and change, but this could help create some structure around our options:
Required criteria
- Supports billions of records
- Efficiently ETL-able into MPP databases or analytic tools, such as Apache Drill, Impala, Presto, Spark, Hive, or others.
- An efficient encoding
- Works with open source tools with liberal licensing for academic and commercial use.
Preferred Criteria
- A self-describing file format, so users need not find an external schema or other metadata to ingest the file. This lets such files be shared without * having to pass additional metadata or references with the file to use it.
- The format is splittable, so multiple threads or processes can ingest a file in parallel rather than waiting on a single thread to go top to bottom of a file. (This can make a big difference when ingesting billions of rows.)
Given these, the technologies break down as follows:
Meets Required and Preferred:
- Apache Parquet (a columnar format driven by Cloudera and Twitter)
- Apache Avro (a row-wise format with good tooling, probably the most frequently used binary format in the Hadoop ecosystem)
- Apache ORC (a columnar format and Parquet competitor driven by Hortonworks)
Meets Required but not Preferred:
- Apache Thrift (files are not self-describing)
- Google Protocol Buffers (files are not self-describing)
- Messagepack (the format is not easily splittable)
- HDF5 (the format is not easily splittable)
Anand Mohan Tumuluri (May 04 2017 at 22:17):
With Parquet, code is the best documentation. Probably you can have a look at the Thrift data types along with the README in github. Also the below trivial example is a good starting point. https://github.com/apache/parquet-cpp/blob/master/examples/reader-writer.cc
Ofcourse this uses Parquet-C++ which is not always at par with the Java implementation but is quite good for understanding.
Another user of Parquet + Spark in previous life
Grahame Grieve (May 05 2017 at 03:33):
I've never used c++. Code is good documentation for very fine details, but hard as a general documentation format
Grahame Grieve (May 05 2017 at 03:34):
Ryan, that's good analysis. I agree with required and preferred, except for one additional criteria that should be required from my point of view- * has a healthy and growing community
Mythreyi Solai (Jun 23 2017 at 19:19):
Hi Ryan/Graham,
I am working on loading a large number of FHIR resources as Apache Avro objects into Apache Spark for some batch processing. I need to build an Avro schema to facilitate the conversion of JSON into Avro. http://avro4s-ui.landoop.com/ website creates an Avro schema for a specific JSON string but it does not work for a different record of the same resource type. I am looking for a more flexible schema that would accommodate all possible differences in a FHIR resource. Do I need to manually create this schema or is there an easy way to do it? Any pointers in this regard would be greatly appreciated.
Thanks,
Mythreyi
Mythreyi Solai (Jun 23 2017 at 19:23):
Hi Ryan/Graham,
I see you have been discussing Apache Avro and Parquet. I have been using FHIR for a while but I am new to Avro and Spark.
I am working on loading a large number of FHIR resources as Apache Avro objects into Apache Spark for some batch processing. I need to build an Avro schema to facilitate the conversion of JSON into Avro. http://avro4s-ui.landoop.com/ website creates an Avro schema for a specific JSON string but it does not work for a different record of the same resource type. I am looking for a more flexible schema that would accommodate all possible differences in a FHIR resource. Do I need to manually create this schema or is there an easy way to do it? Any pointers in this regard would be greatly appreciated.
(I replied to your thread about Parquet with my question but then thought I should send you a private message also, just in case)
Thanks,
Mythreyi
Grahame Grieve (Jun 23 2017 at 22:06):
this is on my todo list but I haven't actually done anything yet
Michelle (Moseman) Miller (Jun 26 2017 at 21:25):
@Grahame Grieve In May, you suggested that you wanted to get a Connectathon track planned in Sept to have a more in depth discussion about bulk data serialization format for analytics. Do you know if that is still planned or not? I think @Ryan Brush is available, but I couldn't find the track details to share with him.
Grahame Grieve (Jun 26 2017 at 21:28):
There hasn't been a lot of interest, so I haven't actually proposed it
Grahame Grieve (Jun 26 2017 at 21:29):
I've tried to solicit some wider interest, but haven't come up with much beyond in principle interest
Grahame Grieve (Jun 26 2017 at 21:29):
is it worth pursuing?
Radu Craioveanu (Jun 26 2017 at 21:32):
we are exploring FHIR and Avro. Whether it is Avro or Parquet, we would be interested in joining connectathons, etc... We want to put Spark on top of Avro/Parquet on top of FHIR. @Mythreyi Solai works for our team at Fresenius Medical Care North America.
Ryan Brush (Jun 27 2017 at 02:01):
I'm new to this process, but if there is enough interest we'd be happy to work through sharing data along these lines at the connectathan. We can share test datasets of several FHIR resources in Parquet, Avro, or ORC today, and can adapt to others if there is a better direction. I'm assuming this is the best venue to gauge interest, but if there are others, let me know.
Grahame Grieve (Jun 27 2017 at 02:03):
this is the place. I'm interested, though I'm pretty heavily committed to other streams. So we have 2.25 participants now. Traditionally, we want 3+ - but in the end it's up to you guys. If you think that 2.25 is enough, let's do it
David Taylor (Jun 27 2017 at 02:11):
I've been working with the Mitre.org Synthentic Mass data with FHIR format in bulk in a JSONL format with Spark SQL - it handles the nested json objects one per line. I'm taking that data, creating the flattened dataframes from it and persisting as Parquet format. works well.
Grahame Grieve (Jun 27 2017 at 06:27):
I don't know if that's an expression of interest in the connectathon stream?
Wayne Kubick (Jun 27 2017 at 20:32):
I had found several people with interest in the Biopharma community, but nobody who can get it together in time for San Diego. More likely for New Orleans. But will pass this on if something gets going.
Grahame Grieve (Sep 15 2017 at 19:29):
@Ryan Brush - there's lots of interest in using NDJson for bulk data. It doesn't meet some of the requirements that we talked about a few months ago, but it's like 100x easier to produce. Have you got any thoughts on this?
Grahame Grieve (Sep 16 2017 at 00:10):
@Chris Grenz you talked about converting ndjson to parquet - what library would you recommend for doing this from the command line on windows?
Ryan Brush (Sep 18 2017 at 18:25):
@Grahame Grieve, I can appreciate the tradeoffs between efficiency and simplicity of implementation for NDJson, particularly if the workload involves relatively infrequent extracts where an N-times performance penalty isn't important. It would be interesting to know what N is in this example to make an informed decision here though. I might put together some simple benchmarks when time allows to give us an idea. Also, would we keep the door open for content negotiating a more efficient format if the client and server offer them?
Josh Mandel (Sep 18 2017 at 18:32):
We'd definitely want to keep the door open for content-type negotiation; the thought is that it's helpful to have a common "must support" baseline format. Our in-person discussions assumed O(100x) load-time difference between NDJson-vs-Parket.
Ryan Brush (Sep 18 2017 at 18:52):
This sounds reasonable to me. Lots of details to work out, and I'll want to better understand the impact on our workloads to determine how quickly we might pursue other formats, but as a broad direction I think this works.
Grahame Grieve (Sep 18 2017 at 21:06):
yes. mandating support for nd-json wouldn't preclude support for parquet by content-negotiation
Chris Grenz (Sep 20 2017 at 14:06):
I've been using Drill directly
Chris Grenz (Sep 20 2017 at 14:13):
Impala also supports parquet creation from anything it can read.
Chris Grenz (Sep 20 2017 at 14:15):
The tricky part is retaining the JSON structure - many tools will convert csv or a flat format. Drill does well with the nested.
Chris Grenz (Sep 20 2017 at 14:20):
Spark supports both nd-JSON and Parquet, so that's an option: http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files
Grahame Grieve (Sep 20 2017 at 21:56):
how would I do it using Drill directly?
Chris Grenz (Sep 21 2017 at 21:10):
1. Install Drill (download and unpack tarball) and run at command line: https://drill.apache.org/docs/installing-drill-on-linux-and-mac-os-x/
2. Open browser to http://localhost:8047/storage
3. Update dfs plugin - copy one of the workspaces and update the location as you see fit, set writable:true
4. Go to query and run something like: CREATE TABLE dfs.mywkspc.mypqfiles AS SELECT * FROM dfs.mywkspc.`some.file.json`
Chris Grenz (Sep 21 2017 at 21:13):
If you really want to do this from command line, once the workspace is configured you can run commands at the command line with the sqlline executable and the --run option: https://community.mapr.com/docs/DOC-1570
Chris Grenz (Sep 21 2017 at 21:49):
Note this silly issue: https://issues.apache.org/jira/browse/DRILL-5698
Workaround is comment out lines 396-399 in the bin/drill-config.sh file
Grahame Grieve (Sep 21 2017 at 21:49):
yay for open source
Chris Grenz (Sep 22 2017 at 21:43):
Deep dive on Parquet for the nerds: https://www.youtube.com/watch?v=MZNjmfx4LMc
Lee Surprenant (Jan 21 2020 at 14:29):
@Gidon Gershinsky and his team have been doing some work with FHIR in Parquet and recently shared this in the bulk export channel: https://chat.fhir.org/#narrow/stream/179250-bulk-data/topic/Parquet.20Bulk.20Data.20format/near/186183104
Thought it might be of interest for folks in this analytics stream as well.
Last updated: Apr 12 2022 at 19:14 UTC