Stream: analytics on FHIR
Topic: Common Query Language
Grahame Grieve (Jul 06 2018 at 02:15):
hey all. There was some short discussion at FHIR DevDays about coming up with a standard query language for analytics - e.g. if you have a common data store of JSON resources, you could share queries.
is anyone interested in this?
Kevin Weaver (Jul 06 2018 at 11:05):
I'm interested. In what ways would this differ from Clinical Quality Language, if at all?
Grahame Grieve (Jul 06 2018 at 11:20):
very different... @Ryan Brush might want to comment further
Kevin Weaver (Jul 06 2018 at 11:33):
Something more akin to writing SQL queries against a JSON document store perhaps? Sort of like Cerner's Bunsen, or the kind of thing you could do with Amazon Athena or Redshift using their semi-structured data querying capabilities?
Grahame Grieve (Jul 06 2018 at 11:47):
yes Bunsen is one of the projects at question
Alexander Zautke (Jul 09 2018 at 06:23):
I would be interested as well. Do you already have some use cases in mind for sharing the queries?
Chris Grenz (Jul 09 2018 at 14:22):
Bunsen looks interesting. I like solutions that stick very closely to the original format. My concern with the protobuf implementation is that inevitably it must make decisions about format conversion whereas other similar engines simply read the original JSON. Looks like Bunsen is the latter. Apache Drill over nd-json is another.
Aslan Brooke (Jul 09 2018 at 19:23):
@Grahame Grieve I’m interested as well, I’ve planned to look more closely at Bunsen(w/ AWS EMR/RedshIft) and the Protobuf(w/ Goofle BIgQuery) implementation. This is the first time I’ve heard of Apache Drill over nd-json, I’ll be sure to take a look at that too.
Ryan Brush (Jul 09 2018 at 19:30):
Hey everyone, author of Bunsen here. I'd be happy to answer any Bunsen-related questions or discuss these SQL-over-FHIR options more broadly. For those who weren't at the FHIR DevDays talk, much of the content is now a blog post at https://engineering.cerner.com/blog/data-engineering-with-bunsen/, and the examples I showed at DevDays are here: https://github.com/cerner/bunsen-tutorial/blob/master/data_engineering_tutorial.ipynb
Ryan Brush (Jul 09 2018 at 19:35):
While Bunsen itself is fully open source, I think we can broaden this approach by working towards common standards for representing FHIR in SQL-queryable structures. Ideally we could take the same structure-aware SQL queries over large FHIR datasets and run them on many different query engines, making our work portable. If this has broad interest it might be worth formalizing the effort. (After chatting informally with Eyal and Patrik from Google, I think they are open to the idea.)
Kenny Blanchette (Jul 10 2018 at 13:48):
I'm also interested. To @Alexander Zautke's question above, do we have an initial set of use cases?
Ryan Brush (Jul 10 2018 at 20:42):
Our use cases center on leveraging the excellent work done in defining and curating the FHIR data models in a "data lake"-style environment. So we might query billions of observations and hundreds of millions of conditions to identify people at risk of some given complication, for instance. Another example is using FHIR as a basis for machine learning workloads...by making it natively usable in a data lake, we can directly apply ML techniques to it. This makes predictive models easier to implement, since they are already expressed in terms of the FHIR standard.
We happen to use Apache Spark (and its Spark SQL query language) for this type of analysis, so the Bunsen project's goal is just to make very large collections of FHIR records easily used in Spark. However, this approach shouldn't be unique to any specific technology....there are several excellent analytic query engines that can handle nested structures like FHIR resources.
nicola (RIO/SS) (Jul 12 2018 at 15:13):
I think, it can be extension of SQL - because now SQL is lingua franca for analytic and supported by many databases and more and more of them started support for JSON. We can use SQL standard as basis and design embeded language to address json fields in resources and then transpile it to different dialects (postgres, bigquery, etc)
Ryan Brush (Jul 12 2018 at 20:14):
For starters, I had something much less ambitious than a transpiled SQL dialect, although that would be interesting. There are multiple instances today where large store of FHIR resources are being directly queried by analytic tools, such as Google BigQuery (as seen in these examples: https://github.com/google/fhir/tree/master/examples/bigquery) and Apache Spark (as seen in these examples: https://github.com/cerner/bunsen-tutorial/blob/master/data_engineering_tutorial.ipynb). I know Apache Drill is being used for similar workloads as well. The thing is these all use a slightly different approach to representing the FHIR resources in SQL schemas (with nested structures), meaning queries aren't portable between these systems. A common way to represent FHIR in SQL schemas -- possibly combined with some high-value SQL UDFs like "in_valueset" -- would make it much easier to share and migrate large-scale analysis of FHIR data.
Ryan Brush (Jul 12 2018 at 20:19):
The above also be a foundation for richer functionality, like the transpiler idea over resources. I'm not sure we can fully abstract away the query engine itself for many workloads, since a lot of involved data analysis leverages (for example) the creation of temporary views of large complex datasets that are cached for a user's session. But I think we can 1.) make SQL queries over large FHIR datasets largely portable and 2.) make it easy to move between query engines and representations (like Protocol Buffers), so users can easily apply whatever tool makes the most sense for their analysis.
Kenny Blanchette (Aug 03 2018 at 12:44):
@Grahame Grieve Are we planning to start a work group around this topic? Or is the next step a bit more informal - potentially a topic at the upcoming Connectathon in Sept?
Grahame Grieve (Aug 03 2018 at 13:01):
I'm not sure. Any further discussions @Eyal Oren and @Ryan Brush ?
Ryan Brush (Aug 03 2018 at 15:11):
I am interested in moving forward with this, even if it is informal to start. I'll reach out to @Eyal Oren as well...I've had a few conversations with him and @Patrik Sundberg, so I think they're interested too.
Patrik Sundberg (Aug 06 2018 at 21:14):
yes, we should move forward. reasonable next steps including getting a group together to highlight potentially tricky areas (choice types, contained resources, valuesets, extensions, etc) and see if we can come to a consensus. i'll be at the connectathon in sept and i believe @Ryan Brush will be as well.
nicola (RIO/SS) (Aug 07 2018 at 19:11):
@Kenny Blanchette , @Ryan Brush and @Patrik Sundberg - may be reuse our new Storage and Analytic track
for that?
nicola (RIO/SS) (Aug 07 2018 at 19:19):
Here is the repo - https://github.com/fhir-fuel/fhir-storage-and-analytics-track
nicola (RIO/SS) (Aug 07 2018 at 19:19):
We can start discussion in github issues and define scope for next connectathon
nicola (RIO/SS) (Aug 07 2018 at 19:27):
And here is our initial ideas for storage format for databases - https://github.com/fhir-fuel/fhir-fuel.github.io/issues
Ryan Brush (Aug 07 2018 at 20:13):
@nicola (RIO/SS) These topics certainly seem to align, so using the same track makes sense to me.
Ryan Brush (Aug 07 2018 at 20:19):
As an aside, here are some initial notes we had on this topic: https://docs.google.com/document/d/1IPyI-4GYlF400vmZvrtjDYL6jauHdjeLfjGJMo6yWtA/edit?usp=sharing
I think a number of us are converging from different directions and use cases. So far I've emphasized a desire for simple and portable SQL over FHIR data models across a number of systems in the above link, but I think we can achieve that along with other objectives -- like ease of integration into machine learning tools.
nicola (RIO/SS) (Aug 07 2018 at 20:37):
Cool, i see your sections are very similar to our issues - may be merge it?
Ryan Brush (Aug 07 2018 at 20:48):
I'm open to that. I have to confess I'm somewhat new to the process here. Are we far enough along to formally create a Connectathon track? There are a number of specific questions in that document (and that @Patrik Sundberg mentioned above) that we'll need to drill into. Are GitHub issues the best place to track those?
Also, just an FYI: I'm in the middle of a long-distance relocation this week, so I'll have limited connectivity for the next several days, but should be in good shape starting next week.
nicola (RIO/SS) (Aug 07 2018 at 21:03):
It supposed to be self-organizing :) I like, discussion in issues more than in chat or document. @Patrik Sundberg ? For track we have to create announce and plan for participants. Potentially we can start weekly or 2weekly calls to work on this track starting from next week?
David Hay (Aug 07 2018 at 21:16):
In theory, the tracks were supposed to be proposed a few weeks back :) . In practice, it's not always as neat as that (but as soon as possible would be appreciated)! Here's the place to add the proposal: http://wiki.hl7.org/index.php?title=Category:201809_FHIR_Connectathon_Track_Proposals . There's a template that can be used...
nicola (RIO/SS) (Aug 08 2018 at 06:59):
(deleted)
nicola (RIO/SS) (Aug 08 2018 at 07:48):
@David Hay - I've just created proposal for next connectathon - http://wiki.hl7.org/index.php?title=201809_FHIR_Storage_and_Analytics; @Ryan Brush & @Patrik Sundberg please review and fix it.
David Hay (Aug 08 2018 at 22:08):
Will add to the connectathon wiki page (actually hopefully @Sandra Vance will!) . We'll review at the FMG meeting next wednesday (US time) so would be great if any edits were complete by then...
Grigory Nokhrin (Aug 09 2018 at 14:16):
@David Hay Hello. I’m going to work on this page with Nicola.Does this mean that we will receive feedback on Wednesday 15/08 or even later?
David Hay (Aug 09 2018 at 18:09):
Well, that will be more of a QA review rather than a comprehensive check - we check for basic things like a track lead, description and scenarios. Would be best if you have all that stuff entered ahead of then...
nicola (RIO/SS) (Aug 15 2018 at 13:40):
@Grahame Grieve Are we planning to start a work group around this topic? Or is the next step a bit more informal - potentially a topic at the upcoming Connectathon in Sept?
@Grahame Grieve proposed us to create FHIR Foundation Project . It would be nice to have a weekly calls/meetings and start working on track for next connectathon and possibly produce some artefacts :). Should we move this discussion into separate topic?
nicola (RIO/SS) (Aug 15 2018 at 13:41):
@Grahame Grieve can you tell us more - what means FHIR Foundation Project?
Grahame Grieve (Aug 16 2018 at 00:15):
I'm not sure that this work sits nicely in HL7. And there'd be an overhead for HL7. The Foundation would have lower overhead, and is a better home for code etc
Grahame Grieve (Aug 16 2018 at 00:15):
who else is interested?
Patrik Sundberg (Aug 16 2018 at 06:11):
I'm definitely interested though I'm not sure what exactly i'm signing up for :).
As for the connectathon, I feel pretty confident we'll have code on github.com/google/fhir by then for people to play with. It would be nice to have a track where people bring code to transform/load sample resources, and we can all play with the results to get a feel for the strengths and weaknesses of various approaches.
nicola (RIO/SS) (Aug 16 2018 at 12:10):
@Patrik Sundberg we want to do the same with postgresql and fhirbase. So people will be able compare bigquery and postgresql :). David Hay asked us to formulate scenarios for track! May be @Ryan Brush can do the same with spark and bunsen. Then we can discuss common subset of SQL for FHIR.
Corey Pauley (Oct 16 2018 at 14:29):
Has anybody ever used ElasticSearch to store fhir resources? Its a noSQL database that stores documents as JSON objects and supports nested structures very well. The JSON query language has a steep learning curve, but the combination of lucene black magic, inversed indices, and consistency sacrifices makes for unbelievably fast queries and aggregations over massive amounts of data.
Corey Pauley (Oct 16 2018 at 14:30):
Just curious if that route has been explored by anybody.
nicola (RIO/SS) (Oct 16 2018 at 14:35):
We played with elastic - it works good for simple cases. But you need to generate schema for fhir data and it is tricky to make it work for more complex queries, which require joining.
Christiaan Knaap (Oct 17 2018 at 09:56):
HAPI JPA Server uses Lucene and optionally ElasticSearch for its full text search.
Last updated: Apr 12 2022 at 19:14 UTC