Stream: implementers
Topic: Bulk Serialization
Michelle (Moseman) Miller (Apr 03 2017 at 21:44):
For bulk serialization (to support research use cases), has there been any consideration or talk of FHIR supporting parquet to maximize efficiency?
Grahame Grieve (Apr 03 2017 at 22:14):
parquet?
Joel Schneider (Apr 03 2017 at 22:17):
Presumably https://parquet.apache.org/
Grahame Grieve (Apr 03 2017 at 22:19):
given that FHIR is based on tree structures with internal cardinality > 1, tables are problematic
Joel Schneider (Apr 03 2017 at 22:38):
A parquet schema can specify "repeated" fields, which could be a straightforward way to handle cardinality > 1.
Grahame Grieve (Apr 03 2017 at 22:39):
nested repeated?
Joel Schneider (Apr 03 2017 at 22:41):
I'm not an expert, but the doc says, "Parquet is built from the ground up with complex nested data structures in mind ..." https://parquet.apache.org/documentation/latest/
Grahame Grieve (Apr 03 2017 at 22:45):
hmm. well, it would be interesting to investigate. As is also the R data_gram format
Michelle (Moseman) Miller (Apr 04 2017 at 01:14):
@Grahame Grieve How can we help with the investigation process? We have an internal proof of concept of sorts where we are using Parquet files with (Apache) Spark. It is based on FHIR resource schema. Would it help if I emailed you a couple example Parquet files?
Grahame Grieve (Apr 04 2017 at 03:07):
it would help if they were publicly available. let me look at them and then propose a way to engage the community
Geoff Low (Apr 04 2017 at 17:01):
Interested to hear what particular research use cases are being tracked; relating to drug trial usage or more population health research?
Ryan Brush (Apr 04 2017 at 20:26):
Hi, I'm a colleague of Michelle's and have some context behind this question. We've found Parquet to be interesting because:
1.) It is flexible enough to encode FHIR resources with arbitrary nested or repeated elements. I've attached examples of a (tiny) Parquet file of Observation resources and the Parquet schema it uses. (The schema itself was generated from resource definitions, and the Parquet file was generated by code that encodes FHIR resources using the HAPI FHIR API. This is proof-of-concept work so I can't rule out bugs in this yet, but we will be able to share/open source/contribute this if that would be of interest.)
2.) Parquet is an efficient binary format with significant mind share in many analytic tools, and can be easily and efficiently loaded into them and queried.
Our specific use case is centered around deep analytics of population health data. The ability to share very large FHIR datasets that can be imported and queried supports that, and surely a number of other needs as well.
Side node: Since Parquet is binary format, users will need something like parquet-tools (linked below) to inspect it directly, or import it into a system that understands the format.
In any case, I'm happy to engage on this topic in anyway that could help.
[1]
https://github.com/apache/parquet-mr/tree/master/parquet-tools
EDIT: might be having trouble with the attachment, but here is a link to the Parquet schema generated from the resource definition: https://gist.github.com/rbrush/af4b8f7bed8ec3cd5fc34419325c4e86
obs_example_parquet_schema.txt
Grahame Grieve (Apr 06 2017 at 00:26):
thanks, can you comment on how the parquet format can be used? how many analytical tools support it? How about R and the related AI community
Joel Schneider (Apr 06 2017 at 20:44):
Parquet is commonly used in conjunction with Spark (spark.apache.org) and/or Hadoop. That ecosystem also includes Spark/R (http://spark.apache.org/docs/latest/sparkr.html).
Grahame Grieve (Apr 06 2017 at 21:53):
I can't find an assigned mime for Parquet format?
Joel Schneider (Apr 06 2017 at 23:01):
I couldn't find an assigned MIME type for Parquet format either, although the related Apache Thrift format does have assigned mimes. For FHIR, maybe application/fhir+parquet would make sense.
Ryan Brush (Apr 07 2017 at 01:25):
As for how Parquet is used: it can be read in R and converted to data frames via Apache Spark, but is also natively supported by a number of open source tools (Apache projects like Spark, Impala, and Presto, as well as Java, Python and C++ libraries) and several scalable database systems (like Amazon Athena, Google BigQuery, Microsoft PolyBase, and HP Vertica). I'm sure I'm missing others.
Our goal is to share large datasets (many terabytes) with users looking to do deep analysis in a standard format that is easily plugged into a variety of tools, and FHIR-in-Parquet makes for a good combination of standards, scalability, and mindshare across systems.
Grahame Grieve (Apr 07 2017 at 02:19):
it's certainly a problem that we're interested in
Last updated: Apr 12 2022 at 19:14 UTC