Stream: implementers
Topic: What is the state of FHIR->CSV export?
Alex F. Bokov, Ph.D. (Feb 11 2020 at 17:22):
I am looking to adopt if possible or write if necessary an app that will convert data in FHIR format to a tabular format.
My questions are:
- Does anybody know of any existing code for FHIR->CSV (or any other tabular format) conversion?
- Is this the correct stream on which to discuss this topic and if not, can anybody recommend appropriate streams?
In anticipation of the "what are you trying to do?" question:
From a statistics and data science point of view, a huge number of use cases are variations on the theme of "the analysis-ready dataset is a single flat table where each subject is a series of one or more rows ordered in time, each variable is a column; time-invariant values are repeated".
I expect that insofar as they exist, current FHIR->CSV converters will dump each resource as a separate table. If that's the case, the gap for me to fill will be joining these separate tables on patient and encounter in a resource-aware manner (single flat table where each subject is a series of one or more rows ordered in time) and providing the user with a way to choose between the various scalar representations supported by each resource (each variable is a column) with reasonable default selections as a fallback.
If this too already exists, then the thing for me to do will be to learn to use it, champion it, use it in my own research, and submit PRs to its maintainers.
In summary, I'm trying to wrap my head around what's already done in FHIR->CSV export so that I am filling a genuine gap instead of duplicating anybody else's work.
Paul Church (Feb 11 2020 at 17:30):
#analytics on FHIR is likely relevant
Lloyd McKenzie (Feb 11 2020 at 18:21):
FHIR data often has hierarchy, so a straight CSV conversion can be challenging and involves trade-offs. You need to tune the conversion based on the type of analysis you intend to perform.
Abbie Watson (Feb 11 2020 at 18:28):
FHIR to CSV is generally easier than the other direction, since the JSON/XML tree structures encode more entropy than a tabular format. Generally one wants to use a utility library like lodash
or underscore
, and use a tree plunking function such as get which lets you pluck values from the JSON and map it into the CSV. We went through and hand coded a mapping function for each of the Resources in the Common Clinical Data Set (aka Argonaut), but they're distributed among 60+ libraries. But we've been thinking of consolidating them all into a single place.
Are you looking for every field and value in the FHIR resource mapped to CSV? Or curating 'just the useful' stuff? The former will result in a CSV file that has many columns and empty cells, while the later is rather subjective.
Eric Haas (Feb 11 2020 at 20:24):
publisher creates CSV for conformance resources. @Mark Kramer created a flattened list too. There are some Python packages that flatten arbitrary json, but since some resources are deeply nested it gets ugly fast with lots of indexing.
Mark Kramer (Feb 12 2020 at 13:24):
Our "data dictionary" is created from the CIMPL source format, not from StructureDefinitions directly. In the future, we might do something similar from FHIR Shorthand, but not guarantee this will happen.
Alex F. Bokov, Ph.D. (Feb 12 2020 at 18:25):
FHIR data often has hierarchy, so a straight CSV conversion can be challenging and involves trade-offs. You need to tune the conversion based on the type of analysis you intend to perform.
The way my current non-FHIR, i2b2-specific app works is that some crude tuning is done up front-- e.g. is this data element numeric, does it ever use modifiers, does it ever have more than one distinct value per visit, and many other heuristics. This informs the default conversion to a scalar form, and the options for alternative conversion presented to the end-user.
I imagine what will be different with FHIR is that instead of deriving it from the data and from assumptions about how most deployers of i2b2 use the available fields, the properties of each data type will be spelled out somewhere in the FHIR specification.
Alex F. Bokov, Ph.D. (Feb 12 2020 at 18:45):
Are you looking for every field and value in the FHIR resource mapped to CSV? Or curating 'just the useful' stuff? The former will result in a CSV file that has many columns and empty cells, while the later is rather subjective.
I propose starting with 'just the useful stuff' so that the initial version covers at least some use-cases and then extending support to more categories with each release.
For retrospective mining of patient data a tentative starting list of resources is: Condition, Procedure, Observation, Medication, Patient, and EpisodeOfCare
For not-yet-supported resources some kind of graceful fallback based on what the FHIR specification says about their category, code, and value.
Alex F. Bokov, Ph.D. (Feb 12 2020 at 19:18):
The former will result in a CSV file that has many columns and empty cells
I submit that this is not a technical problem, it is an inherent quality of health data that needs to be confronted by the statistician or data scientist.
I guess an assumption I didn't explicitly state is that the researcher comes with a list of data elements deemed relevant to the question they are asking, they don't ask for "all the things". But you're right, usually there still are many columns, which is to be expected from high-dimensional data, though some columns (e.g. equivalent codes) can be merged.
The empty cells I encounter most often mean one of these:
- actual missing data (left empty)
- negative data, such as diagnoses (can be recoded to missing=FALSE, nonmissing=TRUE)
- one level of a categorical variable (can be assigned an explicit code)
- falling during an interval between observations, like a lab or vital (can be interpolated or carried forward from previous observations if there are any within a certain time interval and left empty otherwise)
With the rigorous definitions available for FHIR resources, it may be possible to move some of these common patterns up the pipeline to the extraction app and have them offered to the user up-front when they are first extracting the data to a tabular format.
Grahame Grieve (Feb 12 2020 at 19:36):
in general, there's more than just a technical transform that's part of the ETL to get to tabular data suitable for statistical analysis. A key part of this is that the data preparation depends on the underlying assumptions that you want to do statistics about
Alex F. Bokov, Ph.D. (Feb 12 2020 at 19:40):
in general, there's more than just a technical transform that's part of the ETL to get to tabular data suitable for statistical analysis. A key part of this is that the data preparation depends on the underlying assumptions that you want to do statistics about
Absolutely. As a statistician, I have observed a number of recurring patterns that I want to move to a reusable tool rather than re-written for each data analysis project.
I guess my underlying assumptions are that I am analyzing a multi-variable repeated measures dataset for a cohort of patients with many missing observations and irregularly spaced visits rounded to the nearest day (though it could be other time-intervals without fundamentally changing the use-case).
Grahame Grieve (Feb 12 2020 at 19:46):
right. so there's space for a nice sophisticated ETL tool for converting data to a tabular format. I haven't thought that there's anything inherently FHIR specific about that task, so I've been expecting that off the shelf tools will be applicable to the task
Of course, those tools do not avail themselves of the all the computable definitions available for FHIR content, which might help during the ETL design. I'm not sure how directly applicable that would be, but if I was looking into it, I'd be looking for some kind of open source tooling where there was a modular approach and I could write a module to inform it about FHIR content
Alex F. Bokov, Ph.D. (Feb 12 2020 at 19:49):
right. so there's space for a nice sophisticated ETL tool for converting data to a tabular format. I haven't thought that there's anything inherently FHIR specific about that task, so I've been expecting that off the shelf tools will be applicable to the task
FHIR brings to the table a self-documenting compatibility layer to unite i2b2, OMOP, PCORNet CDM, and of course EHR systems. All I/we need to write is a FHIR->tabular data ETL.
Alex F. Bokov, Ph.D. (Feb 13 2020 at 00:14):
I haven't thought that there's anything inherently FHIR specific about that task, so I've been expecting that off the shelf tools will be applicable to the task
I probably shouldn't be emphasizing the CSV/tabular aspect of the problem. The FHIR-specific problems that a converter needs to solve are programatically generating the correct joins and transforming the cell values into data types usable for analysis.
Joe Kesselman (Feb 13 2020 at 15:31):
right. so there's space for a nice sophisticated ETL tool for converting data to a tabular format.
If you're willing to go through XML rather than JSON, XSLT solutions to convert structured to tabular are common. Not that it would be hard to run XSLT over JSON, if the JSON community wasn't so vehemently opposed to using anything from the XML universe; the data models are essentially compatible. (Indeed, IBM's internal optimizing XSLT compiler (which was briefly seen as the "Traditional Websphere" XML Feature but otherwise sunk without a trace) was specifically designed to easily adapt to APIs other than XML's DOM and SAX. I wish we had time to revive that work...)
But JSON is the structured-data notation of the day. And while the JSON community has spawned several transformation languages -- and FHIR seems to be reinventing that yet again in the mapper proposal -- as far as I know nothing has yet been strongly accepted as a base for shared effort, and they tend to focus more on transcoding within the notation than on going from one notation to another.
If the FHIR mapper was farther along, I'd suggest adding text and/or CSV output capability to that, rather than being strictly FHIR-to-FHIR.
Meanwhile... well, I'm writing a FHIR-JSON-to-specific-CSV converter right now, and hope to grow it or replace it with something a bit more general. I'll note here that one of the challenges of going from JSON to CSV is deciding what to do about arrays containing structures and vice versa. Exploding an array to produce multiple CSV lines makes retaining context hard and will not always be the best solution.
Ward Weistra (Feb 13 2020 at 16:38):
FHIR brings to the table a self-documenting compatibility layer to unite i2b2, OMOP, PCORNet CDM, and of course EHR systems. All I/we need to write is a FHIR->tabular data ETL.
@Alex F. Bokov, Ph.D. Just to check: Are you familiar with the mappings between those models the Common Data Models Harmonization project is describing? http://build.fhir.org/ig/HL7/cdmh/cdmh-overview.html
Alex F. Bokov, Ph.D. (Feb 13 2020 at 22:07):
Alex F. Bokov, Ph.D. Just to check: Are you familiar with the mappings between those models the Common Data Models Harmonization project is describing? http://build.fhir.org/ig/HL7/cdmh/cdmh-overview.html
Yes.
Downstream from the mappings is CD2H's proposed FHIR Result Server, and that in turn would feed data to what the CD2H calls CSV/FHIR Export for Investigators. I was hoping to find work-in-progress code for that app so I could help complete it but as far as I can tell it's not there yet.
Alex F. Bokov, Ph.D. (Feb 13 2020 at 22:13):
@Joe Kesselman
If you're willing to go through XML rather than JSON
As far as I can tell, most of the FHIR servers can present output in either format. I have no problem going through XML.
If the FHIR mapper was farther along, I'd suggest adding text and/or CSV output capability to that, rather than being strictly FHIR-to-FHIR.
Do you mean the CDMH one that @Ward Weistra just mentioned?
Meanwhile... well, I'm writing a FHIR-JSON-to-specific-CSV converter right now, and hope to grow it or replace it with something a bit more general. I'll note here that one of the challenges of going from JSON to CSV is deciding what to do about arrays containing structures and vice versa. Exploding an array to produce multiple CSV lines makes retaining context hard and will not always be the best solutionquote
Aha. This may be the collaboration I have been seeking. Do you have a repo link you could share?
What are the main use-cases you have in mind for your converter?
Joe Kesselman (Feb 14 2020 at 15:09):
FHIR mapper: https://www.hl7.org/fhir/mapping-language.html. Note that its maturity is currently marked as 0; it's just getting started. I have quibbles with some of its base assertions, but it's not an unreasonable design. I don't remember the spec discussing anything but FHIR output, but in theory... I assume there's an open-source version of that in development.
My draft isn't ready to go public yet, even if IBM does eventually decide to open-source it. Right now it's targeted to adapt FHIR output to a specific application which currently insists on CSV-based input, but it may be generalizable. Or it may be a stopgap, discarded while we find better answers. If we open it up, I'll make a more formal announcement; I just wanted to say "yeah, we're considering the same issues". (And, admittedly, couldn't resist an XML/JSON rant. I have no objection to JSON, I just wish the JSON fans were more willing to leverage all the work we put into developing XML tools.)
Ward Weistra (Feb 17 2020 at 14:19):
The old wiki page (can't find the Confluence counterpart) on using the FHIR mapping language points to the Java and Pascal implementations: https://wiki.hl7.org/index.php?title=Using_the_FHIR_Mapping_Language
The C# one is now available as a plugin for Vonk: https://fire.ly/products/fhir-mapper/
nicola (RIO/SS) (Feb 17 2020 at 17:19):
If your server provides SQL on FHIR - it can be done with SQL. We have this handy operation - https://docs.aidbox.app/basic-concepts/bulk-api-1#usddump-sql
Last updated: Apr 12 2022 at 19:14 UTC