Stream: python
Topic: FHIR to Dataset
Théo Ryffel (Feb 10 2020 at 13:39):
Hi @all,
I'm new to this chat :wave:
I was wondering, is there any python tool out there to specify and extract a dataset from a FHIR api to do some Machine Learning?
Typical output structure would be a pandas dataframe or a pyTorch Dataset.
I'm looking for smthg which doesn't use Spark, only assumes there is a FHIR api available, and allows to get back a flat dataset containing stuff like Patient age, bodyweight, and last episofe of care date.
Cheers!
Théo
Ward Weistra (Feb 11 2020 at 16:32):
Hi @Théo Ryffel, nothing that does exactly that already, but it shouldn't be too hard to create it with one of the previously mentioned clients: https://chat.fhir.org/#narrow/stream/179218-python/topic/FHIR.20101.20Jupyter.20notebook/near/187329666.
Build the query for the resources you need and flatten the JSON to a Pandas dataframe.
Alex F. Bokov, Ph.D. (Feb 11 2020 at 17:34):
(deleted)
Alex F. Bokov, Ph.D. (Feb 11 2020 at 17:35):
Funny that you posted this just now-- I also asked a similar question in #implementers
This extends beyond Python-- there needs to be a general-purpose tool for doing this, for any selection of FHIR resources.
Eric Haas (Feb 11 2020 at 20:14):
Pandas is the way to go ... I use Pandas to create instances but not the other way. ( well technically I go excel --> Panda df --> FHIR instance. ) anybody going the other way?
Ward Weistra (Feb 12 2020 at 15:51):
@Nick George This seems like an excellent use case for Protobuf based FHIR, right? :up:
Alex F. Bokov, Ph.D. (Feb 12 2020 at 17:54):
Nick George This seems like an excellent use case for Protobuf based FHIR, right? :up:
Wow, I learned several new things googling what you just wrote.
You're talking about https://github.com/google/fhir , right?
But, how would converting a FHIR message to a protobuf string help turn it into a tabular format?
Or, are you saying that the values of individual cells would be chunks of the FHIR message serialized in the protobuf format?
Ward Weistra (Feb 12 2020 at 18:11):
One of the things they do when converting to Protobuf is converting some difficult fields to a format that's easier to work with in tabular format: https://youtu.be/iBkyZ86eiiw?t=1635. But also generally Protobuf might be a good format to work with when doing machine learning.
Eric Haas (Feb 13 2020 at 19:03):
Here is simple and naive FHIR instance to dataframe to csv transform with and without a flattening algorithm. This is kind of a brute force method as a starting point and there is no round tripping but I think that would be relatively straight forward.
:notebook: https://mybinder.org/v2/gh/Healthedata1/MyBinder/master
to start the Notebook:
- click on the
FHIR-to-CSV
folder - click on the notebook file
FHIR-to-csv.ipynb
Alex F. Bokov, Ph.D. (Feb 13 2020 at 23:09):
Here is simple and naive FHIR instance to dataframe to csv transform with and without a flattening algorithm. This is kind of a brute force method as a starting point and there is no round tripping but I think that would be relatively straight forward.
:notebook: https://mybinder.org/v2/gh/Healthedata1/MyBinder/master
to start the Notebook:
- click on the
FHIR-to-CSV
folder- click on the notebook file
FHIR-to-csv.ipynb
Cool! This is very helpful.
I tried an uglier, multi-resource query:
http://hapi.fhir.org/baseDstu2/Patient?_id=1834&_revinclude=Encounter:patient&_revinclude=Condition:patient&_revinclude=DiagnosticReport:patient&_include=DiagnosticReport:result&_revinclude=Procedure:subject
It still worked, though it didn't traverse the branches all the way down, which is to be expected since I didn't tell it how far down to go.
Do you think there is a way to tell squash()
to look for distinct elements that share the same start
, performedDateTime
, issues
, onsetDateTime
, or effectiveDateTime
rounded down to the closest day and put each such set in its own row, creating empty cells for elements that are missing on that day?
FYI, in the process I learned that loads()
gets confused by the escaped "
s in HAPI DSTU2 text.div
values. This can be fixed by replacing the backslashes with double backslashes in the input data.
Eric Haas (Feb 14 2020 at 00:23):
re the text.div you can pop it out too if you don't need it .
I updated the notebook to show some simple black lists in a custom flatten function ( you could white list the same way) This function is simplified version of the library version.
Eric Haas (Mar 23 2020 at 17:00):
Here are a couple of example scripts:
- The takes a search bundle and converts it to a Panda dataframe flattening the data of interest ( note the identifiers and codeableConcepts are not flattened) and does some simple stats and plots:
- Using GraphQL which does all the work to flatten the data and then convert to Panda dataframe and and does some simple stats and plots:
to start the Notebook:
:notebook: https://mybinder.org/v2/gh/Healthedata1/MyBinder/master
- click on the
FHIR-to-Panda
folder - for 1. click on the notebook file
Playing_with_pandas_pt4.ipynb
- for 2. click on the notebook file
Playing_with_pandas_pt2.ipynb
Théo Ryffel (Mar 25 2020 at 14:02):
I'm looping back here to know if there are any updates. @Alex F. Bokov, Ph.D. are you now working using :notebook: https://mybinder.org/v2/gh/Healthedata1/MyBinder/master ?
I'm very interested by the SQL on FHIR initiative but I haven't found lots of details about it. In particular what's cool about this idea is that virtually everyone one know this query model, while graphql would be smarter but less user friendly for beginners.
In the meantime, I don't think anyone would want to really transform their FHIR data into SQL tables (well at least not everyone!) and we shouldn't make any assumption on how people really store your data (mongo, postgres+jsonb, etc). The best way to query data would then be the rest API again because this is the most used interface.
So, putting back together my ideas: I'd like a SQL-like language which would transform the query into (possibly >1) API calls and involve minimal post-processing on the data.
Does this sounds conceptually like feasible? We're trying to do a small proof of concept (open-source) but I'd like feedback on the overall direction before diving too much into it :)
Grahame Grieve (Mar 25 2020 at 20:27):
you should ask this on the #analytics on FHIR channel
Théo Ryffel (Mar 27 2020 at 13:41):
Oh god I thought I was already on the #analytics on FHIR channel! Thanks a lot, I'm moving there
Last updated: Apr 12 2022 at 19:14 UTC