Stream: analytics on FHIR
Topic: FHIR in Apache Spark
Ryan Brush (Nov 27 2017 at 21:42):
I thought this group might be interested in this. We've open sourced a library that can import FHIR resources as native Apache Spark data structures. The goal is to make it easy to do deep analysis of FHIR data with popular tools at scale. Details are here:
http://engineering.cerner.com/blog/announcing-bunsen-fhir-data-with-apache-spark/
Ryan Brush (Nov 27 2017 at 21:44):
I'm juggling a couple other things today, but will be checking on this stream for any questions, or if anyone wants to discuss the topic further.
Grahame Grieve (Nov 27 2017 at 22:35):
looks good - I'll be giving it a go some time in the nexts few weeks
Grahame Grieve (Nov 27 2017 at 22:35):
thanks
Ryan Brush (Nov 27 2017 at 22:46):
@Grahame Grieve Sounds good, and thanks. It's still early, but we plan on growing it based on emerging demands. R support and integration with emerging bulk data sharing specifications are likely early candidates.
Grahame Grieve (Nov 27 2017 at 23:07):
have you seen the R on FHIR Library that Furore and I have been working on?
Ryan Brush (Nov 27 2017 at 23:14):
Yes, I read your posts on that. I think we might have complementary audiences here: There are R users looking for a good, straightforward way to tap into FHIR services, and also R users with a "data lake"-style environment looking to do analysis at a very large scale. That said, we'd be interested in harmonizing those efforts around any commonality that we find along the way.
Grahame Grieve (Nov 27 2017 at 23:27):
makes sense
Boris Tyukin (Dec 02 2017 at 15:09):
@Ryan Brush Cerner client here :wave: It is awesome to see Cerner doing these things! Have your thought about normalizing FHIR nested structures into flat non-nested tables so one can use traditional BI tools that would not understand nested structures? I starred Bunsen's github project, will definitely check it out!
Boris Tyukin (Dec 02 2017 at 22:09):
@Ryan Brush my problem with Spark/Parquet approach is that it is not possible to make it real-time and also do updates/deletes as we are getting bundles from a third party vendor and they might resend the same bundle with changes more than once. Impala and Hive support json nicely but I am thinking to use Kudu and normalize FHIR resources into flat tables like patient_names, patient_history etc. Kudu does not support nested structures (yet) but handles updates, deletes and random lookups nicely and still providing decent performance for analytical and BI type queries through Impala. Do you think it is a bad approach? Another option I have is to load Json as value to Oracle database but my concern there is scalability and performance
Ryan Brush (Dec 05 2017 at 01:15):
Thanks, @Boris Tyukin!
To answer your questions, we want to keep the nested structures by default because there isn't a clean way to flatten nested lists without losing data, and quite a few FHIR resources take advantage of this. However, if you want a flattened view, it's simple to write a Spark SQL query that transforms a nested structure into a flattened view that is appropriate for other tools. We actually do this quite a bit to create easily-consumed reports. Also, it's convenient to use structures for passing to user-defined functions, like the in_valueset seen in the link above, so it can work with the full structure.
That all said, we'd be open to adding "flatten" and "expand" functions to this system if there is some agreed upon convention for handling things like nested lists.
Regarding Parquet: while we do use Parquet for our default format just because Spark , this system isn't tightly coupled to it. A Kafka stream of these messages, for example, could be used for realtime processing of data. I'm sure we could find lots of permutations for this sort of thing...we just happened to start with an analytic and research workload rather than a streaming one.
Boris Tyukin (Dec 05 2017 at 03:25):
@Ryan Brush I like that! I am new to FHIR and the more I learn about FHIR models, the more I come to realize that it does not make sense to normalize resources because of complexity of nested lists. My other concern that normalization rules will change from one FHIR version to another so I totally see your point. I know you use HBase a lot over at Cerner - have you tried to store FHIR resources into HBase? I know it is not good for analytics but curious if you tried that.
As far as Bunsen, any plans to support DSTU 2 soon? I think this is the version currently supported by Cerner Ignite. Our other vendor happened to be on DSTU 2 as well...
Ryan Brush (Dec 05 2017 at 05:39):
@Boris Tyukin We haven't stored FHIR resources themselves in HBase, but we store similarly nested internal structures there frequently, so I'm certain it can be done.
Also, I've logged the first issue to Bunsen, to add DSTU2 support if you want to track it here: https://github.com/cerner/bunsen/issues/1. We had started with DSTU3 on this since we knew we would need it no matter what, but I think we have clean approach that we can use for DSTU2 as well.
Boris Tyukin (Dec 09 2017 at 16:18):
@Ryan Brush wanted to try it out with synthea generated data sets but looks like the library does not work with spark 2.2 which is what we have. I already saw a github issue opened. For now, I just ended up using native Spark's json processing but I think I understand now why you guys built Bunsen - too many lines of codes to get to FHIR data and model properly using FHIR schema. It was not clear to me at first why you built it.
Josh Mandel (Dec 10 2017 at 04:15):
@Boris Tyukin thanks for the report. Could you perhaps share some brief examples of the kinds of processing you're doing natively that's verbose/painful? It'd be nice to have some concrete examples to talk about.
Boris Tyukin (Dec 10 2017 at 19:08):
@Josh Mandel All I wanted to do is to take a bunch of FHIR bundles from our vendor and to load them to Hive/Impala, using parquet as a file format. Every bundle has 5-10 different resource types and depending on a use case (vendor supports a few around specific medical problems), resources can have different elements even for the same resource type.
When I tried to load bundles as is to Spark, Spark would infer schema from JSON but the schema would include ALL the resource types elements so elements from a patient resource and an observation resource would mix and the schema will contain all of them. I solved that by parsing bundles first by resource types and then storing these resources on HDFS separately (patient parquet files, observations parquet files etc.) This required one extra pre-processing step which is not optimal but I do not see a way around it.
Next challenge was to do some SQL - even basic joins (e.g. join patient data with observation and encounter data) require good understanding of FHIR and the way identifiers work.
But the main problem is that both our mainstream BI tools (Qlik and Oracle BI) do not quite work with nested data so I am thinking now to map some of the data stored in nested parquet files to flat columnar tables which is quite a task. I wanted to come up with some generic approach to do so, but realized it would not make a lot of sense because of highly nested structures, various versions of FHIR and the fact that a lot of FHIR resource elements are optional. But at the same time I do not expect our analysts to dig into FHIR specs every time they want to do simple queries against FHIR data - it is just too time consuming unless you deal with FHIR every day :)
Boris Tyukin (Dec 10 2017 at 19:18):
@Josh Mandel just noticed you have this project on Github https://github.com/jmandel/synthea-to-bigquery - interesting, I will check it out later. Looks like you also ended up parsing resources out of bundles and I see quite a lot of code to generate schemas for resources. I was looking for a tool or library to generate schemas for FHIR resources for Avro or Parquet based on profiles but could not find one.
Boris Tyukin (Mar 30 2018 at 19:32):
@Ryan Brush Where/how do you guys store FHIR bundles? In my case, it can grow to 100s of millions of FHIR bundles and I do not want them to store as individual json files on HDFS. I am considering document-based NoSQL database like MongoDB but prefer to use just Hive / Spark since most of the analytics on these bundles will be ad-hoc.
Ryan Brush (Apr 05 2018 at 18:56):
@Boris Tyukin Hi, my apologies, somehow I missed this message earlier. For our workload we're not natively storing the FHIR bundles, but are generating them as needed from an internal data model in our Spark jobs...and then storing the resulting datasets as Parquet files via Bunsen for analysis.
I have not really used MongoDB, but if your primary access pattern is Spark, then I'd recommend just storing them as rows in a Hive table. This will be the most efficient way to load them in the Hadoop ecosystem, and generally I like to try to keep everything we possibly can as a registered Hive table since it's easy to discover datasets and manage them from one place. Of course, this all depends on your access patterns, so your mileage may vary...
Boris Tyukin (Apr 05 2018 at 22:10):
Hi @Ryan Brush, this is my thought exactly. With your recent commit to Bunsen, it is possible now. I do not really want to manage another database on our cluster and so far Hive/Spark/Parquet/Bunsen combo works really well. I am hoping to blog about it soon if you do not mind.
BTW I have also looked at Apache Drill, but with Drill, one to have dump individual json files to an HDFS folder - in my case it is not going to work as we might end up with 100s millions of bundles once we are in production.
Ryan Brush (Apr 05 2018 at 23:54):
@Boris Tyukin Yes, agreed. And I would love to see a blog post on this topic. Please send it my way when it is up!
Christiaan Knaap (Apr 06 2018 at 18:30):
I think @Chris Grenz has experience with Apache Drill with lots of resources.
Chris Grenz (Apr 08 2018 at 18:07):
Indeed I do. Accessing resources with Drill in Bundles is possible, but some inconsistencies in schema between resources can cause it to be tricky at best. It's probably adequate for exploration.
Chris Grenz (Apr 08 2018 at 18:09):
One nice thing about Drill is the ability to create parquet files based on a query. A strategy that may work well is to leave the FHIR bundles as-is and use Drill to extract the set interesting to you to create parquet files for purpose. Added benefit is that they are much more compact than the JSON, even gzipped, and perform on an order of magnitude better in query.
Grahame Grieve (Apr 09 2018 at 02:53):
aren't bundles unnecessary? They're just a packaging mechanism. Wouldn't you be better off stripping that out?
Chris Grenz (Apr 10 2018 at 21:28):
They are - the value is accessing the original data as-is without creating a "stripped out" copy
Last updated: Apr 12 2022 at 19:14 UTC