FHIR Chat · OMOP · analytics on FHIR

Stream: analytics on FHIR

Topic: OMOP

Grahame Grieve (Feb 18 2020 at 07:22):

Hey, a few people have asked me about OMOP / OHDSI recently. Does anyone here have any opinions about when you would use OMOP and when it would be better to use a database of FHIR resources?

Alexander Zautke (Feb 18 2020 at 13:19):

@Georg Fette

Paul Church (Feb 18 2020 at 15:29):

I did some work converting MIMIC3 to OMOP a while back. My impression was that the biggest advantages of OMOP are the extensive suite of analytical tools (Atlas) and the federated query network for research (Arachne). Interoperability lives or dies on its network effects.

My observation 2 years ago was that when I went to a research event almost everyone was using OMOP and had never heard of FHIR, and when I went to an industry event it was exactly the reverse. But that has evolved a bit since then.

Jean Duteau (Feb 18 2020 at 16:56):

In many cases, systems already have an OMOP network of data and are putting a FHIR facade in front of it. I've been on two projects (one right now) where that is exactly the situation. All of the providers are already putting their data into an OMOP repository so we have a ready source of medical information. Converting that data into FHIR resources and having the providers change their interface to talk FHIR is probably a non-starter. Instead, we are putting our analytic apps on top of the OMOP database and providing a FHIR way of interfacing with it.

Georg Fette (Feb 19 2020 at 07:57):

For analytics I would currently prefer OMOP over FHIR because OMOP comes with a widely-used query language (SQL) that most analysts are already capable of. As long as there are no commercial vendors of CQL engines to ease the combined task of accessing the FHIR data and directly doing analytics operations, the access to FHIR has to be done using the FHIR search API. This requires the analyst to first export the FHIR data and then to analyze it with another programming langauge (R, C#, Java). For rapid prototyping research projects this is an obstruction. Perhaps this combined task is possible with GraphQL or FHIRPath (with which I do not have any experience yet).

Frank Oemig (Feb 19 2020 at 08:30):

But there are hard requirements by the underlying relational database structure..

Georg Fette (Feb 19 2020 at 09:00):

that's true

Grahame Grieve (Feb 19 2020 at 11:05):

@Georg Fette one of the goals of the work on this thread is a to define a common sql schema for FHIR resources. SHould that come to fruition and be successful, that would change your analysis, no?

Viet Nguyen (Feb 19 2020 at 18:08):

@Jon Duke - I know you and your folks at Georgia Tech have worked on OMOP to FHIR mappings and may have an opinion on this thread. For others, GaTech has a very active FHIR program including a graduate course in Health Informatics that teaches FHIR and in which students develop FHIR applications based on ideas submitted and led by mentors.

Frank Oemig (Feb 20 2020 at 07:14):

The SQL schema is given. Therefor the bidirectional mapping has to bridge to the hard constraints, either by adding complexity to the mapping, or by mapping to app. restricted FHIR profiles.

Lee Surprenant (Feb 24 2020 at 19:15):

One of the things that has confused me about FHIR and OMOP is that, if FHIR is more transactional in nature (OLTP) and OMOP more analytical (OLAP), why are all the projects (like https://build.fhir.org/ig/HL7/cdmh/profiles.html) mapping OMOP-to-FHIR? I would have expected FHIR-to-OMOP mappings and some sort of ETL process for populating OMOP tables from FHIR data. Does anything like that exist?

Grahame Grieve (Feb 24 2020 at 19:29):

there are such things, but not as far as I know open source ones. Right now they are secret sauce

Grahame Grieve (Feb 24 2020 at 19:30):

a more challenging question for me and a couple of OMOP users is: why is OMOP so much simpler than FHIR

I think the answer is "OMOP has not yet encountered many use cases that the FHIR Community has" but the OMOP users I have talked to dislike that answer very greatly

Gidon Gershinsky (Feb 25 2020 at 09:57):

We believe there is a way to make analytics on FHIR simple and efficient,
https://www.devdays.com/amsterdam/schedule-2019/#event-801

starting with a "low hanging fruit" of making Apache Parquet one of the FHIR bulk export formats. Parquet is the leading format in the big data analytics industry today, and fortunately FHIR maps to it directly (see a report in the bulk data channel). This allows to run SQL etc on FHIR data, using the top open source tools like Apache Spark - that supports not only SQL, but also machine learning on Parquet data. This works today, without dependency on the sql-on-fhir project, because Spark/Parquet can handle the existing FHIR/JSON structure.

An additional step could be using Parquet as the FHIR storage backend (for relevant resources only; the rest would still be stored in a database as done today). This is feasible (shown), and enables export-free analytics on FHIR data.

Then, as the sql-on-fhir project makes progress, it's results can be applied to the two items above. This will allow to further simplify the SQL/FHIR queries and improve the user experience.

Jon Duke (Feb 25 2020 at 19:13):

We do a lot of work with OMOP at Georgia Tech and actually make an OMOP <-> FHIR package that translates between the two (https://github.com/omoponfhir). A great resource on OMOP/OHDSI is the Book of OHDSI (book.ohdsi.org or IRL on Amazon).

OMOP tends to be preferable for dealing with complex cohort definitions and the OHDSI community offers quite a few tools to design cohorts, run comparative effectiveness studies, predictive models, evaluate phenotype performance etc. OMOP works with the usual SQLs as well as Impala, Netezza, GBQ, Redshift, etc. So that is handy for many places that have existing infrastructure and may not have familiarity with something like Parquet.

As a long time OMOP person and a medium-time FHIR person, I'd say that getting phenotyping using FHIR to be as easy and fast to iterate as it is using something like OMOP is a very important goal to really advance FHIR for large scale population analytics.

Josh Mandel (Feb 25 2020 at 19:27):

I think the answer is "OMOP has not yet encountered many use cases that the FHIR Community has" but the OMOP users I have talked to dislike that answer very greatly

My usual explanation is that OMOP is happy to summarize data in ways compatible with / useful for observational research. And obviously summary data are simpler than full data.

Josh Mandel (Feb 25 2020 at 19:29):

Let's take translations, for instance: OMOP allows for the notion of one "source code" and one "standard code" for a given entry, and that's it. So you either throw away some translations, or you turn a single problem in your patient's problem list into two rows of the OMOP Condition table (which isn't "proper" use of OMOP but something I've seen in the real world when people bump up against this constraint).

Paul Church (Feb 25 2020 at 19:31):

To build on that, OMOP has a very clear use case (observational clinical research) where FHIR has quite a broad portfolio of use cases. I think the difference in complexity is inevitable from the problem scope.

Grahame Grieve (Feb 25 2020 at 19:46):

for me the most obvious issues are around Observation (no reference range!) and Medication -simple options for dose route etc

Gidon Gershinsky (Feb 26 2020 at 07:50):

Jon Duke said:

.. Impala, Netezza, GBQ, Redshift, etc. So that is handy for many places that have existing infrastructure and may not have familiarity with something like Parquet.

The idea, of course, is to make Parquet invisible to the user. The (FHIR) SQL queries can be run on the familiar engines like Spark, Impala, Presto, pandas, Hudi, etc; today, all analytic engines, under the hood, support the Parquet format.

Grahame Grieve (Feb 27 2020 at 05:46):

seems to me that somehow OMOP should piggy back on our requirements gathering exercise - they are missing stuff that seems pertinent to me

Grahame Grieve (Feb 27 2020 at 05:47):

though OMOP apparently has a few all-purpose abstractions that can be press-ganged into doing anything

Josh Mandel (Feb 27 2020 at 19:20):

Don't we all, though? ;-) i2b2 modifiers, FHIR extensions, and so on.

Grahame Grieve (Feb 27 2020 at 20:14):

right. I can't decide whether it counts as a pattern or an anti-pattern.

Adam Flinton (Mar 04 2020 at 08:00):

"The idea, of course, is to make Parquet invisible to the user. The (FHIR) SQL queries can be run on the familiar engines like Spark, Impala, Presto, pandas, Hudi, etc; today, all analytic engines, under the hood, support the Parquet format." Should work well with AWS as most of their aws branded DA stuff sits on top of parquet (or can) e.g. DMS > S3, Athena etc.

Kendall Park (Mar 11 2020 at 22:40):

Of the current FHIR SQL solutions, I've seen Fhirbase, which uses JSONB. I also read about the Apache Parquet option in the thread. Are there other ways to get SQL queries using a FHIR schema?

Josh Mandel (Mar 11 2020 at 22:45):

Spark SQL against a pile of NDJSON (e.g., https://azure.microsoft.com/en-us/services/databricks/ makes this easy); BigQuery in Google Cloud; Apache Drill against json or parquet files...

Paul Church (Mar 11 2020 at 22:49):

Google Cloud Healthcare has a FHIR export to BigQuery operation that supports the analytics schema. We're in the process of promoting another feature from alpha to beta that streams resource mutations to BigQuery to keep a near-real-time analytics mirror of the FHIR server.

Kendall Park (Mar 11 2020 at 22:59):

Thank you! I have a heavy GCP bias, but unfortunately my institution does not. At this point, it's a lot easier to get a proof of concept running on-prem. Apache Drill seems promising.

Eric Whitley (Mar 23 2020 at 15:14):

I am really curious about how people are approaching change management around this. How are your organizations addressing the mindset of where the status quo is "only ever relational"? I keep running into scaling challenges with relational databases using a FHIR-like model, particularly around nested properties and extensions - but we keep being told "you have to use (our chosen) relational database." We've suggested the idea of inverting the model as has been described in this thread. Instead of storing in a relational database, store with documents and then expose via SQL so people can query in a relational way. There's just a general dismissal that a relational engine isn't good enough/flexible enough/fast enough/etc. that that the nature of healthcare data directly lends itself to a relational model. Maybe that's true, but it just feels like short of embracing relational "anti-patterns" like key-value pairs, we'd be spending a lot of time designing a fixed, inflexible relational model that would be difficult to implement and even harder to scale. Can we implement a FHIR-oriented set of tables in a relational model? Sure. It feels like it's going to be complex, brittle, and likely slow, but it can be done...ish. Extensions are tough. Scaling is going to get expensive really quick. I can see the economics of a document-centric model being more appealing - throwing commodity hardware at the problem and scaling out instead of up. Sorry, at this point I'm rambling. How are you approaching groups that seem unwilling to embrace the notion that a data warehouse is bigger than a single tool/approach?

Grahame Grieve (Mar 23 2020 at 19:12):

I generally don't bother approaching them, I'm afraid. Having a prejudicial notion about this as a flag waving incompetence at this point. OTOH Being realistic and understanding you have constraints that mean you can't have those discussions in the way everyone wants... that's a good thing. Because every one has constraints like that.

Having said that... my view is that this is moving the deck chairs on the titanic; all the approaches will have problems one way or another

Théo Ryffel (Mar 27 2020 at 13:54):

Hey everyone!
I'm putting this here because many people have moved a bit away from OMOP to discuss SQL :D

I'm very interested by the SQL on FHIR initiative but I haven't found lots of details about it. In particular what's cool about this idea is that virtually everyone one know this query model, while graphql would be smarter but less user friendly for beginners.

In the meantime, I don't think anyone would want to really transform their FHIR data into SQL tables (well at least not everyone!) and we shouldn't make any assumption on how people really store your data (mongo, postgres+jsonb, etc). The best way to query data would then be the rest API again because this is the most used interface.

So, putting back together my ideas: I'd like a SQL-like language which would transform the query into (possibly >1) API calls and involve minimal post-processing on the data.

Does this sounds conceptually like feasible? We're trying to do a small proof of concept (open-source) but I'd like feedback on the overall direction before diving too much into it :)

NB: many people have mentioned here Azure, Google etc. I'm living in France so this is not even an option for french hospitals haha + I believe solutions should also be available for people not using the cloud or being beginner developers (not everyone can use spark out of the box)

Bryn Rhodes (Apr 01 2020 at 02:36):

So, I don't know about "minimal post-processing on the data" :) but CQL is a query language that will do this. You express the query in an SQL-like language, and can then transform it to suit a particular target. The most developed of those targets right now is the FHIR API. Still lots of hard work to do no doubt, but it is progressing.

Josh Mandel (Apr 01 2020 at 13:16):

The challenge with translating queries from a rich language like SQL or CQL into a series of rest API calls (assuming you're thinking about the current FHIR REST API and not some future state where it expands dramatically) is that only a limited number of query constraints and clauses would be directly supported. You would need to over fetch by a huge margin in order to do client side filtering; for population level queries this is not likely to scale well. It's generally important to push as much logic as possible into the query engine that runs close to the data, for a system like this to perform well.

Josh Mandel (Apr 01 2020 at 13:16):

For a simple class of queries written in a rich language, of course these compile nicely to existing search capabilities :-)

Kendall Park (Apr 02 2020 at 04:21):

@Théo Ryffel Perhaps we should talk because I have followed this exact train of thought (including the GraphQL and open-source considerations).

Some thoughts I'd add...

1) SQL-style syntaxes do end up being somewhat influenced by the underlying architecture (for example, some things are possible in postgres that are not possible in mongo and vice versa). So rolling your own SQL-like syntax just makes an additional SQL flavor to throw into the mix. iirc a SQL syntax for FHIR is being discussed. Keeping in sync with whatever might be made into an official HL7 spec would be useful.

2) What you are proposing is similar to how people use GraphQL proxies to transition away from REST APIs. You're just using a different query syntax. The idea is that you throw a proxy GraphQL endpoint over the REST API to allow developers to work with the GraphQL API interface immediately. You will take a performance hit doing massive joins post-REST response. It's not an efficient solution but works as a great intermediate step when transitioning between the two API paradigms. The ultimate goal would be some sort of server-side adoption with server-side resolvers. I wonder if you could leverage GraphQL technology and simply change the query syntax.

3) The need pull data through the FHIR REST API and perform inefficient joins is problem specific to real-time analytics (assuming Bulk API support). In many cases we're stuck using whatever FHIR server an IT department has enabled to query live data. If you can tolerate up to 24 hours of lag time, you can use a tool like Apache Drill on the raw Bulk JSON files or import those JSON files into a psql server like FhirBase. Hypothetically, you could build a system where the same query will be executed differently depending on whether you wanted the live analytics. Currently, I'm focusing on the stale data solution.

4) You probably want to also keep big batch/stream processing frameworks in mind. These joins are going to be expensive on big N.

Théo Ryffel (Apr 04 2020 at 16:59):

Thanks a lot for this feedback!
I'd love to have a small chat @Kendall Park :slight_smile:
I'm 100% with all being said here. My use case would be for Machine Learning, so this would mean load lots of data (big N) but in the meantime it does has to be very efficient, because once the data is extracted then the ML workflow is much slower anyway. So I would favour practicability over efficiency I guess.

Kendall Park (Apr 13 2020 at 20:06):

Re: Parquet

Would Parquet allow optimized querying and encryption at rest? Having an unencrypted "pile of NDJSON" on disk/in blob makes me nervous.

Gidon Gershinsky (Apr 27 2020 at 04:44):

Kendall Park said:

Re: Parquet

Would Parquet allow optimized querying and encryption at rest? Having an unencrypted "pile of NDJSON" on disk/in blob makes me nervous.

Yep, Parquet is designed for query optimization (via columnar projection, predicate pushdown, encoding/compression) and now has a native support for encryption that preserves this optimization. A couple of links,
https://conferences.oreilly.com/strata/strata-ny-2019/public/schedule/detail/77144 (slides inside)
https://chat.fhir.org/#narrow/stream/179250-bulk-data/topic/Parquet.20Bulk.20Data.20format

Keith Boone (Jul 21 2020 at 13:47):

The NLM is using OMOP / OHDSI for storage of data in the All of Us Research program. There's a gateway between Sync for Science and the OMOP database for All of Us hosted at Vanderbuilt. I don't know many more details than that. But that would be a reason to use OHDSI instead of FHIR, because the platform for sharing is already using OHDSI.

Josh Mandel (Jul 21 2020 at 18:32):

s/NLM/NIH/

Georgie Kennedy (Jul 27 2020 at 06:46):

Grahame Grieve said:

for me the most obvious issues are around Observation (no reference range!) and Medication -simple options for dose route etc

There's active interest starting (if not yet sufficient traction) within OMOP in more fully modelling sig data (expanding dose/route complexity) - would happily loop in with any FHIR experience on this to ensure compatibility

Grahame Grieve (Jul 27 2020 at 07:30):

great. Where is that work happening?

Georgie Kennedy (Jul 27 2020 at 11:46):

In the oncology working group. It's coming out of a desire to model treatment protocols in detail. The work is nascent but there are a number of interested parties, so some cross-pollination may be what's required to push it into active development.

Jose Costa Teixeira (Jul 29 2020 at 11:56):

The structuring of dose information is very interesting for a few projects (not related to outcomes, so not sure if relevant).
The most interesting one is UNICOM, which aims at cross-border medication management - from prescription to pharmacovigilance. One part of it that is the structured dose information.

Jose Costa Teixeira (Jul 29 2020 at 11:58):

Not as rich as oncology in terms of complexity of dosing, but also interesting because we want to support cross-border settings, where product identifiers are not common, structuring of dose is not harmonised.. when we have to collect information we assume even the alphabets with which people write are different.

May Terry (Mar 04 2021 at 17:06):

Georgie Kennedy said:

In the oncology working group. It's coming out of a desire to model treatment protocols in detail. The work is nascent but there are a number of interested parties, so some cross-pollination may be what's required to push it into active development.

Correct re: OMOP Oncology Vocabulary which I do attend on occasion. The work is nascent in some spots, particularly with chemo drug regimens. In a similar vein, HL7 is also rather nascent (despite multiple attempts) to model it so cross-pollination is helpful. I have recommended this to Rimma and Christian before so I suspect the paths will cross at some point.

Last updated: Apr 12 2022 at 19:14 UTC

Main menu

FHIR Chat · OMOP · analytics on FHIR

Stream: analytics on FHIR

Topic: OMOP

Grahame Grieve (Feb 18 2020 at 07:22):

Alexander Zautke (Feb 18 2020 at 13:19):

Paul Church (Feb 18 2020 at 15:29):

Jean Duteau (Feb 18 2020 at 16:56):

Georg Fette (Feb 19 2020 at 07:57):

Frank Oemig (Feb 19 2020 at 08:30):

Georg Fette (Feb 19 2020 at 09:00):

Grahame Grieve (Feb 19 2020 at 11:05):

Viet Nguyen (Feb 19 2020 at 18:08):

Frank Oemig (Feb 20 2020 at 07:14):

Lee Surprenant (Feb 24 2020 at 19:15):

Grahame Grieve (Feb 24 2020 at 19:29):

Grahame Grieve (Feb 24 2020 at 19:30):

Gidon Gershinsky (Feb 25 2020 at 09:57):

Jon Duke (Feb 25 2020 at 19:13):

Josh Mandel (Feb 25 2020 at 19:27):

Josh Mandel (Feb 25 2020 at 19:29):

Paul Church (Feb 25 2020 at 19:31):

Grahame Grieve (Feb 25 2020 at 19:46):

Gidon Gershinsky (Feb 26 2020 at 07:50):

Grahame Grieve (Feb 27 2020 at 05:46):

Grahame Grieve (Feb 27 2020 at 05:47):

Josh Mandel (Feb 27 2020 at 19:20):

Grahame Grieve (Feb 27 2020 at 20:14):

Adam Flinton (Mar 04 2020 at 08:00):

Kendall Park (Mar 11 2020 at 22:40):

Josh Mandel (Mar 11 2020 at 22:45):

Paul Church (Mar 11 2020 at 22:49):

Kendall Park (Mar 11 2020 at 22:59):

Eric Whitley (Mar 23 2020 at 15:14):

Grahame Grieve (Mar 23 2020 at 19:12):

Théo Ryffel (Mar 27 2020 at 13:54):

Bryn Rhodes (Apr 01 2020 at 02:36):

Josh Mandel (Apr 01 2020 at 13:16):

Josh Mandel (Apr 01 2020 at 13:16):

Kendall Park (Apr 02 2020 at 04:21):

Théo Ryffel (Apr 04 2020 at 16:59):

Kendall Park (Apr 13 2020 at 20:06):

Gidon Gershinsky (Apr 27 2020 at 04:44):

Keith Boone (Jul 21 2020 at 13:47):

Josh Mandel (Jul 21 2020 at 18:32):

Georgie Kennedy (Jul 27 2020 at 06:46):

Grahame Grieve (Jul 27 2020 at 07:30):

Georgie Kennedy (Jul 27 2020 at 11:46):

Jose Costa Teixeira (Jul 29 2020 at 11:56):

Jose Costa Teixeira (Jul 29 2020 at 11:58):

May Terry (Mar 04 2021 at 17:06):