Stream: analytics on FHIR
Topic: Connectathon track
nicola (RIO/SS) (Aug 16 2018 at 12:14):
David Hay asked us to formulate scenarios for track!
@Patrik Sundberg we want to do the same with postgresql and fhirbase. So people will be able compare bigquery and postgresql :). May be @Ryan Brush can do the same with spark and bunsen. Then we can discuss common subset of SQL for FHIR.
@Patrik Sundberg @Ryan Brush @Kenny Blanchette
nicola (RIO/SS) (Aug 16 2018 at 12:25):
Here is what we tried to do on previous one - https://github.com/fhir-fuel/fhir-storage-and-analytics-track
nicola (RIO/SS) (Aug 16 2018 at 12:27):
For track we have to have a leader. @Patrik Sundberg would you like to lead us?
nicola (RIO/SS) (Aug 16 2018 at 12:31):
There is an idea to integrate this track with bulk load api and for track provide bulk loader for each specific database. Participants will be able to load data from existing FHIR server and start playing with SQL. (https://github.com/fhir-fuel/fhir-storage-and-analytics-track/issues/6)
nicola (RIO/SS) (Aug 16 2018 at 12:40):
We can work on scenarios as github issues in https://github.com/fhir-fuel/fhir-storage-and-analytics-track/issues
robert worden (Sep 21 2018 at 17:47):
I'll be coming to your connectathon track, although my interests are a little different from your main thrust. I am interested in getting FHIR in and out of legacy relational databases; generating the transforms, and generating the required SQL from a RESTFul FHIR search. So I may be watching and learning more than doing.
nicola (RIO/SS) (Sep 29 2018 at 14:35):
Here is online demo of fhirbase you can try - https://fbdemo.aidbox.app/
Ryan Brush (Sep 29 2018 at 18:08):
The (very) experimental representation of the CMS CQL content in SQL that we had discussed earlier an be found here: https://gist.github.com/rbrush/61243822298dc077659028bb36c1b02d.
nicola (RIO/SS) (Sep 29 2018 at 18:09):
SQL on FHIR breakout in the President's Room on 15:00 - join us!
Bryn Rhodes (Sep 29 2018 at 21:12):
https://github.com/cqframework/healthedecisions/tree/master/tooling/framework/SQL.Translation
Bryn Rhodes (Sep 29 2018 at 21:13):
@Ryan Brush and @Chris Grenz , this is the ELM-to-SQL translator we were talking about it.
Ryan Brush (Sep 29 2018 at 21:48):
Some notes from today's discussions:
There are two related but distinct projects here: One is a simplified projection of FHIR for SQL and analytic use. This should offer an excellent user experience to anyone looking to ask questions over large FHIR datasets, or create new knowledge for them. The second is a lossless storage format that is more efficient and easier to work with than the current JSON/NDJSON models.
Some notes on the simplified projections for SQL, which is my current focus:
-
They can be lossy when needed to improve the experience of the data science users. For instance:
* id fields may be omitted
* Numeric values are represented a NUMERIC or DECIMAL types in the projection (so the original precision might be unknown)
* Dates and times may be represented using date and datetime types in the projection, to simplify use of database-provided functionality. -
Systems should offer a way to retrieve the original, authoritative data in lossless JSON if they need to know what the original data was. (We do not yet specify how to retrieve that, since we want to offer some flexibility to find the best approach.)
- Each projected table SQL schema is generated based on a given set of Profiles.
- Extensions are projected as first-class fields based on the slice name of the projection. For instance, the US Core Patient profile has a first-class “race” field that can be queried like any other.
- Similarly, slicing should be used as appropriate for codes defined in profiles. So US Core Observation.code would explicitly have a loinc field, rather than the convention of the LOINC code being in Observation.code.coding[0]
- The SQL projections should be considered transient — so they can be discarded and re-recreated as needed, possibly with updated profiles (and corresponding extension fields).
- The SQL schema used should follow conventions of the documentation on fhir.hl7.org. Ideally, our analytic users should be able to simply use that web site as a reference for all of their SQL queries.
- Inline with the above, field names should use camelCase and exactly match the structures on fhir.hl7.org, as seen in that documentation.
Open questions include how close the projections should stick with the NDJSON format, allowing queries directly over that, or whether that will be improved with a future storage representation. For instance, whether option types should be stored as distinct fields like valueQuantity, valueString, and so on — or as nested fields such as value.quantity, value.string, etc. This is a tradeoff between consistency with the JSON model and the ability to easily to check for NULL choice types in SQL.
I didn’t go into the physical storage considerations here since the above is my current focus, but welcome others to comment on that.
David Hay (Sep 30 2018 at 10:46):
"Extensions are projected as first-class fields based on the slice name of the projection. For instance, the US Core Patient profile has a first-class “race” field that can be queried like any other." - interestingly, this is the approach I took when presenting a 'logical' view of a profile intended for clincian use... Here's US Core patient for example: Screen-Shot-2018-09-30-at-6.46.11-AM.png
Michel Rutten (Sep 30 2018 at 14:54):
@David Hay As you probably know, Forge re-orders elements for display, listing extensions _after_ standard elements. Maybe the 'logical' view in ClinFhir could also follow this convention, if that makes sense?
Grahame Grieve (Sep 30 2018 at 15:03):
I used to list extensions afterwards in some of the generated IG views, but got some strong pushback
Michel Rutten (Sep 30 2018 at 15:26):
Interesting. I've never received any push back about Forge's custom rendering order. I could easily implement a configuration option to toggle this behavior, however no user ever asked for it...?
Lloyd McKenzie (Sep 30 2018 at 15:37):
I would definitely like Forge to allow views that are consistent with how things need to appear in the instance. It's confused several of my users.
Michel Rutten (Sep 30 2018 at 15:46):
@Lloyd McKenzie thank you, that is good to know. Definitely makes the case for implementing a configuration option.
Patrik Sundberg (Sep 30 2018 at 18:05):
I agree with Ryan's summary and the proposed direction. It will be helpful to consider the analyst view of the data separately from the storage format for now, but there are obvious similarities. Longer term, we might find ways to combine the two approaches.
We discussed query rewrites in the session yesterday; can we transform simple queries containing terms like "observation.code.loinc" to a longer, more complex query over a generic data format? This is tempting for various reasons, but i believe it's ultimately impractical because most tools will not have natural plugins for such transformations, so any workflow that needs to use transformations will be burdensome.
Instead, I believe we should focus on tools that make it easy to go from (Resource.ndjson, StructureDefinition) pairs into files suitable for direct queries by standard database engines. This conversion can and should be lossy for simplicity. It should inline extensions like Ryan mentioned before, and it should map primitive types to single fields. We've implement parts of this in our github repo, for example, here's a sample protobuf definition (ObservationGenetics) with a fair number of inlined extensions: https://github.com/google/fhir/blob/master/proto/stu3/profiles.proto#L233 .
Key items left to formally decide in my opinion are:
Patrik Sundberg (Sep 30 2018 at 18:06):
- how do we name extensions?
- proposal: use slice names (this seems to be mostly agreed upon)
Patrik Sundberg (Sep 30 2018 at 18:06):
- how do we handle choice types?
- proposal: observation.value.quantity instead of observation.valueQuantity
Patrik Sundberg (Sep 30 2018 at 18:06):
- how do we represent decimal?
- proposal: as a native number, discarding precision (lossy conversion)
Patrik Sundberg (Sep 30 2018 at 18:06):
- how do we represent dates?
- proposal: as ISO8601 string, with timezone as appropriate
Patrik Sundberg (Sep 30 2018 at 18:06):
- how do we handle unknown extensions (those not explicitly listed in the input profile)?
- proposal: drop on the floor (lossy)
- proposal: keep, for queries of the type "count extension types in use group by extension_url"
Patrik Sundberg (Sep 30 2018 at 18:06):
- how do we handle ids and extensions on primitive fields?
- proposal: drop on the floor (lossy)
Patrik Sundberg (Sep 30 2018 at 18:06):
- how do we handle nested recursive types?
- proposal: define a max nesting level somewhere
- proposal: use avro files and built-in avro support
Patrik Sundberg (Sep 30 2018 at 18:06):
- do we duplicate sliced data (extensions, observation.code.loinc, etc), leaving it in the raw fields?
- proposal: no
- proposal: yes
Patrik Sundberg (Sep 30 2018 at 18:49):
- how do we handle situations where some resources do not comply with the profile, such as when querying for vitals in an observation table
- proposal: filter on resources that match the profile
Patrik Sundberg (Sep 30 2018 at 18:50):
- how do we represent a primitive field with a declared extension
- proposal: as a new type, where the original value is accessed as "value"
Ryan Brush (Sep 30 2018 at 19:01):
Thanks for this thoughtful enumeration of current considerations! On all items with a single proposal, I agree with the approach. I'll comment on a couple of the others...
- how do we handle nested recursive types?
- proposal: define a max nesting level somewhere
- proposal: use avro files and built-in avro support
While Avro and ProtoBuf allow for arbitrarily recursive structures, we still need to apply a max depth when creating an SQL-based view of the data for Spark or Presto (and for other analytic tools as well.) So I think the best we can do is to base that depth based on the data that exists, but tools SHOULD re-create that schema with further depth if new data is discovered that demands it.
Ryan Brush (Sep 30 2018 at 19:02):
- do we duplicate sliced data (extensions, observation.code.loinc, etc), leaving it in the raw fields?
- proposal: no
- proposal: yes
After an offline discussion, I think our current best option is to leave in the raw fields. This makes it easier for users to explore who may not be interested in the specific slices, and keeps it consistent with the documentation published on fhir.hl7.org
Ryan Brush (Sep 30 2018 at 19:04):
- how do we handle unknown extensions (those not explicitly listed in the input profile)?
- proposal: drop on the floor (lossy)
- proposal: keep, for queries of the type "count extension types in use group by extension_url"
Similarly, we discussed that keeping these extensions is worthwhile to support users doing exploration of the data, making sure their schemas cover all the extensions they need.
Patrik Sundberg (Sep 30 2018 at 19:19):
Missed one:
- how do we handle references?
- proposal: split typed references ("Patient/ABC") into subfields, e.g. patient_id
nicola (RIO/SS) (Sep 30 2018 at 19:21):
- how do we handle choice types?
- proposal: observation.value.quantity instead of observation.valueQuantity
- consider observation.value.Quantity vs observation.value.quantity. What with CodeableConcept or DateTime ? codeableconcept?
nicola (RIO/SS) (Sep 30 2018 at 19:22):
- reference & choice types meta attribute :
reference: {patient_id: ...., type: 'Patient'}, choice: {quantity: ..., type: 'Quantity'}
Patrik Sundberg (Sep 30 2018 at 19:25):
fyi here's the protobuf current Observation.value: https://github.com/google/fhir/blob/master/proto/stu3/resources.proto#L11657
Patrik Sundberg (Sep 30 2018 at 19:28):
- how do we handle contained resources?
- proposal: drop on the floor
nicola (RIO/SS) (Sep 30 2018 at 19:31):
* externalise
* move into reference (we do this using custom resource attribute in Reference)
nicola (RIO/SS) (Sep 30 2018 at 19:33):
- how do we represent a primitive field with a declared extension
- proposal: as a new type, where the original value is accessed as "value"
- move it into another element: birthDate.extension => birthDateTime
Christiaan Knaap (Sep 30 2018 at 19:37):
If you move the extension to a sibling element, and the original element repeats - how do you correlate the extension values to the elements they were originally contained in?
nicola (RIO/SS) (Sep 30 2018 at 19:39):
I do not know, but semantically this is a new element :)
Ryan Brush (Sep 30 2018 at 20:16):
I think our current proposal is that extensions are duplicated to sibling elements for convenience (to make it easy to query patient.race, for instance), but still preserve them in their original form as well. This seems like the "Least Surprising Behavior", offering an additional field for convenience while preserving the content of the original fields as much as possible.
Ryan Brush (Sep 30 2018 at 20:27):
I took a pass at incorporating the above proposals into the SQL on FHIR markdown document. Much of the above was already reflected in the document, but you can see the changes in the pull request here: https://github.com/rbrush/sql-on-fhir/pull/9/files.
Ryan Brush (Sep 30 2018 at 20:30):
Please feel free to make any comments or suggestions on that pull request. The document is still a work in progress and will surely change, but I think this gets us close enough to start building early systems on this specification.
The biggest course change we've made is to accept this is a lossy projection of the underlying FHIR data to create a great experience for analysts looking to explore FHIR data. In most cases the information loss won't matter to these users, but the original data should be available as well for users who need to know the exact precision used for each numeric field, for instance.
Ryan Brush (Sep 30 2018 at 20:32):
This does not attempt to define an approach to a non-lossy, efficient physical representation. However, this should be a useful reference for physical representations to ensure they can be easily projected onto the model described here.
Patrik Sundberg (Sep 30 2018 at 21:09):
fwiw this example: https://github.com/google/fhir/tree/master/examples/bigquery
exposes a few of these issues in practice. the schema in use here is not the lossy, profile-specific one discussed in this connectathon, but i will change it to be.
nicola (RIO/SS) (Oct 01 2018 at 00:47):
Proto of CMS metric in postgresql - https://fbdemo.aidbox.app/ - see CMS snippet
Ryan Brush (Oct 01 2018 at 16:31):
I went ahead and merged the pull request I mentioned to https://github.com/rbrush/sql-on-fhir that reflects our discussions here. We'll continue to update that document as further changes emerge.
One additional topic did come up:
- How to deal with extensions that are unknown when generating the SQL schema?
- Proposal: when generating the schema, users specify the profiles/extensions they are interested in querying. Additional extensions in the data will be visible as URLs in the projection so they can be detected, but won't have schema generated for them.
This assumes the schema can be easily re-generated if there is a desire to query these new extensions.
Grahame Grieve (Oct 05 2018 at 10:50):
how do we handle contained resources?
proposal: drop on the floor
That's my favorite idea, if only it was a realistic one
nicola (RIO/SS) (Oct 05 2018 at 14:17):
i think, exporter can have options - drop, externalise, internalise
Last updated: Apr 12 2022 at 19:14 UTC