Stream: analytics on FHIR
Topic: FHIR foundation project
nicola (RIO/SS) (Aug 16 2018 at 12:15):
I'm not sure that this work sits nicely in HL7. And there'd be an overhead for HL7. The Foundation would have lower overhead, and is a better home for code etc
@Grahame Grieve is there any information about FHIR foundation projects - how it works?
nicola (RIO/SS) (Aug 16 2018 at 12:35):
It would be nice to start weekly calls? Who is interested? What day and time?
Bryn Rhodes (Aug 16 2018 at 18:01):
@nicola (RIO/SS) , I am interested, Mondays are usually good for me, but open to whenever.
Ryan Brush (Aug 17 2018 at 18:43):
I'm definitely interested. If there is a time that works during business hours of US/Pacific I could probably make that work, but don't mind the off-hour calls if necessary as well.
Ryan Brush (Aug 17 2018 at 18:45):
I'll also contribute the scenarios. Does "User-friendly standard SQL across very large FHIR datasets" seem like an appropriate scope for that? Or should each scenario be targeted at a more specific use case?
Grigory Nokhrin (Aug 19 2018 at 19:22):
I'm definitely interested. If there is a time that works during business hours of US/Pacific I could probably make that work, but don't mind the off-hour calls if necessary as well.
Hi! Grigory from Health Samurai.
I’ll take care about all the administrative part of future Connectathon track or Foundation or whatever related to Storage & Analytics.
How about to take a conference call to define our need , proposals, justifications?
Please provide me exact timeframe during the next week day by day. So I well be able to compare the availability and define the day and time for call between the interested people.
nicola (RIO/SS) (Aug 20 2018 at 08:08):
@Grigory Nokhrin can you provide something like https://doodle.com/polling-tool
Grigory Nokhrin (Aug 20 2018 at 08:42):
@Grigory Nokhrin can you provide something like https://doodle.com/polling-tool
For sure.
https://doodle.com/poll/34wix33u5y7ycnab
Please vote for the time that suits you best for conference call.
Michael van der Zel (Aug 20 2018 at 21:13):
+1
Ryan Brush (Aug 20 2018 at 22:07):
Sounds good! I went ahead and posted on the Doodle as well, but can probably make other times work if none of the current times do. Thanks for setting that up!
Grigory Nokhrin (Aug 23 2018 at 15:33):
Feel free to join the call in 30 minutes.
We will talk about 19th Connectathon track “Storage & Analytics”
BTW. If you're checking Zulip not so often or not getting the notifications, then maybe we can coordinate our conversations any other way? WhatsApp group chat?
https://chat.whatsapp.com/AeZYpzVd0UoDkESPUX2k5M
nicola (RIO/SS) (Aug 27 2018 at 10:46):
Here is google calendar for meetings - https://calendar.google.com/calendar?cid=aGVhbHRoLXNhbXVyYWkuaW9fOGVxZDgxdm1tNDRpNnNpZmI3aHRtaDdoaWdAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ
nicola (RIO/SS) (Aug 27 2018 at 10:50):
If time is not convenient for you - let's discuss here
Christiaan Knaap (Aug 28 2018 at 06:21):
Last connectathon I used a tool - VonkLoader - to upload data to my own FHIR server Vonk. Mainly Synthea collection bundles. I promised to see if we could publish the tool to the community. We did. If you had not found it yourself yet, you can get it from Simplifier: https://simplifier.net/downloads.
Documentation is here: http://docs.simplifier.net/vonkloader/index.html
It is free to use, but there is no official support. You can ask questions here of course.
nicola (RIO/SS) (Aug 28 2018 at 08:18):
Can your loader load directly into database?
nicola (RIO/SS) (Aug 30 2018 at 21:33):
31 aug meeting notes:
Work on scenarios for track:
* loaders (bulk api)
* sql on FHIR (postgresql, spark, bigquery)
* fhirpath in database
* more
Specify scenarios on github issues - https://github.com/fhir-fuel/fhir-storage-and-analytics-track/issues
Work on track definition - http://wiki.hl7.org/index.php?title=201809_FHIR_Storage_and_Analytics
Different types of scenarios from tutorials to top level
Book room for discussion "SQL on FHIR" (@Ryan Brush provides proposal for SQL on FHIR)
Use synthea dataset as default
@nicola (RIO/SS) is track leader
Christiaan Knaap (Aug 31 2018 at 09:10):
No, it targets a FHIR server. It even starts by requesting its capabilitystatement to check the endpoint.
Patrik Sundberg (Sep 01 2018 at 01:15):
Some reading material that i think is relevant for these discussions: http://www.healthintersections.com.au/?p=2824 and (if you have a lot of time) https://chat.fhir.org/#narrow/stream/4-implementers/topic/JSON.20Format.20for.20Primitives.20.26.20extensions
Ryan Brush (Sep 04 2018 at 16:22):
As discussed on the August 31 call, I wanted to start discussion with a straw-man proposal for analytic workloads via "SQL on FHIR". You can find that proposal here (although we can easily move it to a collaborative repository elsewhere):
https://github.com/rbrush/sql-on-fhir/blob/master/sql-on-fhir.md
This proposal focuses on the schema and views provided to SQL users -- which may or may not be a direct reflection of the underlying physical model. There are tradeoffs here, but I think it's important that we provide our SQL users with a model that strongly matches their intuitions. We have a lot of users working through a variety of statistical analyses of healthcare data, and I would very much like to have them do so on FHIR-based data models. This gives them access to a much broader dataset than they have had historically, enables more portable and reproducible analysis, and makes the results of such analysis much easier to implement in any system that speaks FHIR. This could be a significant gain for the creation and application of new knowledge.
However, this only works if we can bring this class of analyst and researcher on board to using FHIR-based data for their needs. Therefore the proposal in this document prefers offering a simpler view for the 99% case (such as primitives that do not have extensions) to make this an appealing environment for their work. It may be convenient for underlying physical storage to follow the same pattern, but that is not required.
In any case, my goal here is simply to offer a more concrete basis to start the conversation -- so feedback and suggestions are strongly desired!
Patrik Sundberg (Sep 04 2018 at 18:10):
Great writeup, thanks a lot for that. At a high level I agree with generating a schema from a profile, and in particular using the extension slices in that profile to simplify use of extensions. I also agree with keeping choice types as structs, which is sort of what FHIRPath does. Two items for discussion:
- should we handle references differently in some way? joining across resources is cumbersome with the standard representation
- decimal as VARCHAR vs DECIMAL: switching to a numeric type means we can't go back from SQL to FHIR without loss of knowledge of precision. That's worth noting.
- why store date/datetime/time as VARCHAR? to retain knowledge of timestamp precision?
- other fields can be sliced; how do we handle that? worth a section. i think slicing is actually a great assert for the SQL representation.
at a high level, we need to decide if we expect to be able to round trip from FHIR to SQL and back in a lossless manner, or if we're giving that up.
Ryan Brush (Sep 04 2018 at 19:05):
Great questions. A few initial thoughts here:
- should we handle references differently in some way? joining across resources is cumbersome with the standard representation
Good idea. Perhaps we generate an additional field that sits in the Reference structure that that can be equi-joined to other resources? Maybe something like subject.patient_reference in the generated SQL schema, so the type of the reference is clear from the field?
- decimal as VARCHAR vs DECIMAL: switching to a numeric type means we can't go back from SQL to FHIR without loss of knowledge of precision.
Agreed this is worth noting, but I think this tradeoff is worthwhile for SQL users. We had a similar system where we kept these as strings, and users would use comparison operators without realizing these numbers were still strings...leading to bad results. I think this justifies the lack of clarity in precision here.
- why store date/datetime/time as VARCHAR? to retain knowledge of timestamp precision?
Yes, this retains knowledge of precision and avoids ambiguity in timestamp fields. If we get 2018-09-01 in a datetime field but encode it as 2018-09-01T00:00:00 in a timestamp, we don't know if the source really meant midnight or just that day. Unlike the conversion to DECIMAL, this is less prone to induce hidden user error...but we can discuss the tradeoffs further.
- other fields can be sliced; how do we handle that? worth a section. i think slicing is actually a great assert for the SQL representation.
I hadn't thought through how we'd handle this yet, but agree it makes a lot of sense. (I'd much rather be able to reference .systolic and .diastolic fields rather than using some external reference to know they are [0] and [1] in some array...) Let's see if we can find a way to add this.
Ryan Brush (Sep 04 2018 at 19:17):
As for lossless bi-directional conversion: I think we should aim for this, but there are likely to be edge cases where that isn't practical, such as recursive structures of unbounded depth or nested resources. I think we should aim for lossless conversion in all cases possible, and make sure the exceptions are well defined and documented
Users who need strong lossless guarantees can use this proposal merely as a view of underlying data, but also keep their data in a storage format that offers strong guarantees of round-trip lossless conversion...
Grahame Grieve (Sep 04 2018 at 21:11):
references: the FHIR spec anticipates that storage forms will add an an additional element to the reference data type that contains the resolution of the reference - usually, a primary key. But the exact details of what the reference is will depend on design and technology. The question is, what table does the key refer into?
Grahame Grieve (Sep 04 2018 at 21:13):
the mismatches between the fhir data type and the sql data types in the corner cases are difficult. I would have thought that storing using the native type with a precision field as a sibling field is a better tradeoff for native storage (and justified there, where it's not justified on the wire for exchange)
Ryan Brush (Sep 05 2018 at 02:16):
I like the idea of using the native type with precision as a sibling field. I'll look into a simple approach for doing so and add it to the proposal. (Suggestions on conventions here are welcome as well.)
As for what table do references join to: I (perhaps naïvely) was thinking the referent table could be determined by generating a sibling field to the reference field, e.g., patient_reference, encounter_reference, and so on. This isn't expressed in the current draft, but Patrik's comments above got me thinking down these lines. There may be considerations here I'm overlooking...so if I'm missing something obvious please let me know. :)
Patrik Sundberg (Sep 05 2018 at 03:56):
In the proto representation, we use { timestamp, precision } tuples for datetime types, and that carries over into our current experimental sql representation. same with references, a references of the form "Patient/123" will be represented as "{ patient_id: 123 }". But both of these approaches have drawbacks as well. For example (and to Grahame's point), this approach for references more or less assumes a single table, which isn't always appropriate.
Ryan Brush (Sep 05 2018 at 04:11):
I'd imagine we would have and reference different tables based on the resource profiles...so (for example) in US Core there would be separate tables for results and vital signs, as opposed to just one observation table. Is a table-per-profile in an Implementation Guide the right path? I would be curious if there are examples where this isn't sufficient (realizing that we'd have to flesh out this pattern to really be sure we've covered all of the corner cases.)
That's something we implied earlier but should be explicit about: these schemas are generated in the context of an Implementation Guide, and therefore would handle extensions and reference fields in a way consistent with that particular guide...
Patrik Sundberg (Sep 05 2018 at 04:17):
agreed. a lot of this makes the most sense in the context of a complete implementation guide. we should be explicit about that.
even so, linking "Observation/abc" to Vitals vs Labs vs SomethingElse isn't going to be trivial. Where/when would that be done?
Grahame Grieve (Sep 05 2018 at 04:30):
I would advise against 'a table per profile' for the simple reason that there will be many cases where observations - in particular - meet multiple profiles.
Grahame Grieve (Sep 05 2018 at 04:31):
I've seen a pattern where you have a table for each resource, and a separate table for each profile that is just a key:key pair, so you can join by profile if you want to
Grahame Grieve (Sep 05 2018 at 04:34):
with regard to references: I think that there are 2 different issues:
- classic polymorphic joins issue - see e.g. https://hashrocket.com/blog/posts/modeling-polymorphic-associations-in-a-relational-database
- fhir specifix issue around the fact that references are much broader in functionality than a relational reference in the same database
Grahame Grieve (Sep 05 2018 at 04:38):
one of the challenges here is that I feel it's unlikely that people will buy into a single answer for the first part
Ryan Brush (Sep 05 2018 at 05:38):
Fair enough, a rote table-per-profile isn't where we want to go. At the same time, I don't think a table-per-resource is sufficient in all cases, either...since we want to generate concrete fields based on extensions that exist only in some profiles to make things easier to query.
Looking closer, I'm not sure that there is an algorithmic solution to define what tables should exist and how they are linked based purely on Implementation Guides as they exist today. It almost feels like we need to build an "SQL Implementation Guide" that evolves from usage patterns to supplement the existing guides, stating which tables exist, the collection of profiles they may serve, and how they are linked together. Just speculating here, but maybe it makes sense for vitals and labs to co-exist in an Observation table...but something significantly different (like genetics extensions to observations) would live in its own table and we handle these explicitly. Such polymorphic queries might still cause headaches, but we can isolate them to where they're really needed and explicitly define how to handle them.
Grahame Grieve (Sep 05 2018 at 07:17):
polymophism is bad enough, but what is going on with the profiles is worse :-(
Grahame Grieve (Sep 05 2018 at 07:18):
but, in fact, a lake of json fragments should be superior, shouldn't it, if you can sql directly on them - isn't that what this is about?
Ryan Brush (Sep 05 2018 at 17:57):
A main goal (from my perspective) is to make FHIR data easily used by a variety of popular data science tools. If a lake of JSON fragments satisfies that, I'm all for it...although it's not obvious how to make that easily discoverable and understandable by a set of users that generally start with SQL-based tooling. (Also, JSON doesn't lend itself to efficient analytic operations at scale.)
Given the complexity here, I wonder if our first iteration should have a less ambitious goal to create a building block for larger systems. So our first goalpost might be to:
1. For a given StructureDefinition, generate an SQL schema that can represent the corresponding records in way that is efficient and simple to query in a variety of analytic databases, and
2. Be able to load the NDJSON files from Bulk Extract (or other sources) into tables using the above schema.
As this evolves we can consider defining what tables should exist and how to join them in the context of an IG. But even if that's not practical, I think this first goalpost of conventions around SQL schema derived from a StructureDefinition is a useful asset -- and gets some currently ongoing projects using common building blocks in this space.
nicola (RIO/SS) (Sep 06 2018 at 13:42):
Can we do discussion of specific topics in issues of Ryan's repo - it will be more structured. I created couple of issues to discuss. We represent references as {id: 'textid', resourceType: 'Patient'}
object.
nicola (RIO/SS) (Sep 06 2018 at 13:45):
Another solution of polymorphic refs is to move resource type into key - subject: {Patient: 'pt-id'} - and this approach can be unified with polymorphic/union elements.
nicola (RIO/SS) (Sep 06 2018 at 13:45):
I think, we can have logical representation of FHIR resource for storage - which can be used for both - json and native databases.
nicola (RIO/SS) (Sep 06 2018 at 13:46):
@Ryan Brush can we split your initial doc into sections and have discussion over each section separately.
Ryan Brush (Sep 06 2018 at 15:25):
I'm fine with discussing specific considerations as GitHub issues.
Does the initial goalpost I mentioned above make sense here? If so, I'll log specific issues for topics that have come up along that path (e.g., dealing with precision with numeric and time structures, etc.)
nicola (RIO/SS) (Sep 06 2018 at 15:59):
weekly meeting https://meet.google.com/zvs-ucbe-ufk
Ryan Brush (Sep 29 2018 at 14:32):
Some links for reference at the Connectathon. The straw man SQL on FHIR draft is here: https://github.com/rbrush/sql-on-fhir/blob/master/sql-on-fhir.md and the tutorial for using Spark on FHIR is here: https://github.com/cerner/bunsen-tutorial
Last updated: Apr 12 2022 at 19:14 UTC