Stream: implementers
Topic: synthea on bigquery
Josh Mandel (Nov 17 2017 at 14:25):
https://bigquery.cloud.google.com/dataset/fhir-org-starter-project:synthea is a copy of the Synthea "Synthetic MA" data that I just loaded into BQ. @Eyal Oren @Patrik Sundberg @Grahame Grieve
Josh Mandel (Nov 17 2017 at 14:26):
(Code used to do the loading -- very rough, since I just put this together during the DevDays keynote... https://github.com/jmandel/synthea-to-bigquery)
Josh Mandel (Nov 17 2017 at 14:30):
(and source .ndjson files in Google Cloud Stotrage are at https://storage.googleapis.com/fhir-synthetic-ma/2017-05-24/AllergyIntolerance.ndjson.gz and so on... changing the resource.)
nicola (RIO/SS) (Nov 17 2017 at 14:32):
+1
Eyal Oren (Nov 17 2017 at 14:33):
try count(*) from synthea.Encounter or so
Eyal Oren (Nov 17 2017 at 14:33):
it's fun
Eyal Oren (Nov 17 2017 at 14:34):
or e.g., select format_timestamp("%A", period.start), count(*) from synthea.Encounter group by 1
Josh Mandel (Nov 17 2017 at 14:34):
(Patient and Observation are still loading loaded.)
Josh Mandel (Nov 17 2017 at 15:31):
@Jason Walonoski FYI -- the synthea publications made this super easy, thanks!
Grahame Grieve (Nov 17 2017 at 16:18):
awesome, thanks @Josh Mandel now we need to do a nice little tutorial
Josh Mandel (Nov 17 2017 at 23:40):
Yeah, definitely. For anyone looking to whet their appetite, here's a mildly interesting kind of query that shows off a bunch of features -- it's definitely something that would be hard to do from the FHIR REST API alone. (Definitely a toy example, but with a realistic shape to it.)
Goal: see what fraction of patients are diagnosed with hypertension, stratified by the patient's maximum observed systolic blood pressure.
Josh Mandel (Nov 17 2017 at 23:42):
You start to see pretty quickly that the synthetic data are "way too clean" for real life, with a sharp diagnostic cutoff right at the guideline-compatible boundary of 140mmHg.
Grahame Grieve (Nov 18 2017 at 04:15):
We'd like to make the Mimic dataset available - a real de-identified value set. But it's not currently available in public, and so we need to negotiate with the owners to make it so
Grahame Grieve (Nov 18 2017 at 04:19):
@Eyal Oren reading @Josh Mandel's query makes me wonder about introducing terminology service logic into BigQuery. How would we get it so that instead of
where a_coding.code='55284-4'
Grahame Grieve (Nov 18 2017 at 04:20):
we can write something like
where a_coding.code in valueset "http://my.com/ValueSet/all-blood-pressures"
Grahame Grieve (Nov 18 2017 at 04:21):
? (and I know that we've specifically been ensuring you don't need to in this case, I'm interested in the myriad other cases where we won't and can't)
Grahame Grieve (Nov 18 2017 at 04:22):
in general, I'd look to define a closure table.... see http://build.fhir.org/terminology-service.html#closure
Josh Mandel (Nov 18 2017 at 06:13):
Having a set of authoritative vocabularies to join in here will be super helpful, I agree. Definite strength of the platform is being able to join across disparate data sets including reference data and instance data. (E.g. there is a public dataset of NPI data that could be directly joined with our "private" synthea data)
Josh Mandel (Nov 18 2017 at 06:14):
I can share an example of what the join would look like with rxnorm for example, since I have that loaded in a place that I can access. It's obviously not the same text that you suggested Graham but it is going to be reasonably straightforward and does allow on the fly creation of (effectively) value sets.
Grahame Grieve (Nov 18 2017 at 06:15):
I think what you are saying is that it would be useful for a terminology service to be able to push expansions to a bigquery data set for this purpose
Josh Mandel (Nov 18 2017 at 06:45):
Yes indeed :simple_smile: There are a couple of directions to explore, including what you've described here as well as loading tables directly from source schema (e.g. for rxnorm, the rxnrel
, rxnsat
, rxnconso
tables and so on). I think each approach will have some utility.
Josh Mandel (Nov 18 2017 at 08:08):
Well, my attempt to show a cool join is a bit delayed -- looks like the synthea data has MedicationRequest
resources with references like
"medicationReference": { "reference": "" }
@Jason Walonoski is this a known issue with the data set?
Josh Mandel (Nov 18 2017 at 08:09):
(Also, I will have to update my loader to turning bundle-internal urn:
references into regular id-based references.)
Eyal Oren (Nov 18 2017 at 17:06):
agreed w/ wanting to join terminologies in, esp directly from their definition URLs.
fyi that we have rxnorm already available on bigquery to be joined in : https://cloud.google.com/bigquery/public-data/rxnorm
we can & will do the same for loinc (i think folks are already worknig on that -- @Josh Mandel you could load loinc into this same project just to demonstrate)
Eyal Oren (Nov 18 2017 at 17:07):
also agreed that a condition such as "where a_coding.code = '55284-4' " is in reality unlikely to be very useful, given the myriad of blood pressures in loinc ;-)
Grahame Grieve (Nov 18 2017 at 18:05):
so of course you should load the terminologies. But I'm talking about value sets - not the same thing
Josh Mandel (Nov 18 2017 at 18:16):
Indeed the example query was just a toy (and for the Synthea data set, a single fixed code was good enough to demonstrate the rest of the logic -- it's quite homogenous).
Josh Mandel (Nov 18 2017 at 18:18):
@Grahame Grieve definitely terminologies are different from value sets, and we need support for both. For "intensional" value sets, having access to the full terminology structure can help define/reconstitute the value set directly within an analytical query; the flexibility can be nice.
Josh Mandel (Nov 18 2017 at 18:19):
Do we have a FHIR representation of some/all value sets from the NLM Value Set Authority Center?
Josh Mandel (Nov 18 2017 at 18:20):
That'd be nice for various reasons, and we could load it into an abalytical db as well.
Grahame Grieve (Nov 18 2017 at 18:23):
I've processed all the CDA value sets; I don't know if that's all the value sets.
Grahame Grieve (Nov 18 2017 at 18:24):
It's quite a lot of code to do expansions; let's say we decide to get on the terminology servers to push the expansions up (for those that are not infinite) - how would that happen?
Josh Mandel (Nov 18 2017 at 21:18):
Does VSAC host (definitions for) any non-finite value sets? I didn't think so.
Grahame Grieve (Nov 19 2017 at 04:46):
no VSAC doesn't. But FHIR has several of them - and it's because VSAC can't, not because there's no need
Josh Mandel (Nov 19 2017 at 19:44):
@Jason Walonoski one other note on the data: looks like Immunizations don't have any id
. Is this expected?
Josh Mandel (Nov 20 2017 at 13:37):
Cool -- so using the public rxnorm data set, a join with clinical synthea data might ask: "How many synthea meds contained hydrocodone as an ingredient?" Knowing that hydrocodone is RxNorm concept 5489
, the query can be done with:
SELECT count(*) FROM `fhir-org-starter-project.synthea.Medication`meds, unnest(meds.code.coding) med_coding join `bigquery-public-data.nlm_rxnorm.rxn_all_pathways_current` pathways on pathways.source_rxcui=cast(med_coding.code as string) where med_coding.system="http://www.nlm.nih.gov/research/umls/rxnorm" and pathways.target_rxcui="5489"
... this matches any single-ingredient drug, multi-ingredient drug, brand name, generic, pack, dose form, etc, that contains hydrocodone as an ingredient. It's otherwise quite challenging to do this kind of thing without pre-processing.
Josh Mandel (Nov 20 2017 at 23:01):
And now for an example query joining LOINC to clinical data! (This isn't an officially hosted copy of LOINC, but I've added it to our sample project just to illustrate the point.)
See how often deprecated, discouraged, or trial LOINC terms are used in the data set:
#standardSQL select loinc.status, count(*) as count from synthea.Observation, unnest(code.coding) observation_coding join `loinc_test.loinc` loinc on loinc.loinc_num=observation_coding.code where observation_coding.system = "http://loinc.org" group by loinc.status order by count
status | count |
---|---|
TRIAL | 202426 |
DISCOURAGED | 7037333 |
ACTIVE | 50339669 |
Josh Mandel (Nov 21 2017 at 01:45):
BTW, I re-loaded the synthea data with:
- Proper references now enabled across resources
- An updated schema that includes descriptions for all fields (drawn from the FHIR spec)
This approach avoids the auto-guessed schema that (wrongly) concluded thing like "Observation.coding.code
is an integer" (just 'cause only SNOMED codes were used in the data set, and those happen to be integers).
Grahame Grieve (Nov 21 2017 at 02:02):
thanks
nicola (RIO/SS) (Nov 21 2017 at 05:47):
we do terminology in postgresql by introducing Concept resource- this simplify valuesets queries
Michael Lawley (Nov 21 2017 at 09:00):
Is there some alternative to XMLHttpRequest the can be used in BigQuery's UDFs?
Josh Mandel (Nov 21 2017 at 13:28):
@Michael Lawley Do you mean for fetching external data from within a udf? I don't think so, from https://cloud.google.com/bigquery/docs/reference/standard-sql/user-defined-functions . You'd need a js interpreter to expose a special synchronous http request function, but udfs run in an environment with only (a subset of) standards js built-ins. What's the use case you have in mind?
Jason Walonoski (Nov 21 2017 at 14:14):
Hmm... seems like you found some interesting things with the Synthea data, as well as some issues.
Did you load the STU3 or DSTU2 data? I'm assuming STU3.
Immunizations were not assigned IDs because they aren't referenced by other resources in the Bundle. It is assumed the FHIR server will assign the ID during the transaction.
At the time the data was generated, we had MedicationRequest reference Medication resources. We later changed this just to have medicationCodeableConcept instead (Medication doesn't add any value here, since we're just using RxNorm codes). Apparently we did not regenerate the data to include this change. At any rate, the medicationReferences should point to a Medication by "urn:uuid:#{resource_id}"
(that is Ruby syntax). Not sure why they are empty. But this is fixed in the latest version of Synthea.
Josh Mandel (Nov 21 2017 at 16:32):
Thanks @Jason Walonoski! I just downloaded the published data set, using the most recent version (Version 2, which uses STU3) listed at https://syntheticmass.mitre.org/download.html. Details here. Is there a newer release I should be using?
Jason Walonoski (Nov 21 2017 at 18:38):
@Josh Mandel Thanks for posting your work on this -- it is very interesting, and I always like to see how others use the data. The link to Version 2 is the latest release of the data, so that is good. The latest release of the code has fixed those problems. So, we'll need to update the data. TBD on timeline for that.
Michael Lawley (Nov 22 2017 at 01:47):
I'm looking for a way to connect up to a FHIR terminology server
Grahame Grieve (Nov 22 2017 at 01:51):
todo what?
Michael Lawley (Nov 22 2017 at 01:52):
Clearly there are performance concerns about REST calls during query evaluation. What would work is being able to populate a table as a result of REST call. In particular I'm thinking about the results of $closure. i.e., I'm hoping to be able to perform code:below type queries
Grahame Grieve (Nov 22 2017 at 02:10):
well, Josh is effectively saying that if you upload the code system directly to the bigquery repository, you can do those things in sql.
Grahame Grieve (Nov 22 2017 at 02:10):
unless the semantics of the code systems are as complicated as SNOMED Ct
Grahame Grieve (Nov 22 2017 at 02:11):
in which case the most effective way would be to generate the entire closure table and upload it once
Josh Mandel (Nov 22 2017 at 02:36):
I'm indeed saying what Grahame says I'm saying :simple_smile: It's very useful to load closure tables especially for these recursive structures since BQ doesn't have "connect by" or "with recursive".
Grahame Grieve (Nov 22 2017 at 02:40):
and this works for some cases. Michael and I are interested in the rest of the cases where the value set machinary brings real value
Grahame Grieve (Nov 22 2017 at 02:40):
though code:below is not one of those cases
Josh Mandel (Nov 22 2017 at 02:49):
Can you share some examples of where the value set machinery helps (and especially if helps in places where a closure table can't stand in for it)?
Grahame Grieve (Nov 22 2017 at 03:17):
so there's several things a txserver - with logic - does over this:
- in a value set, you make an arbitrary list of codes that are part of the value set that aren't purely derived from the semantics of the code system. So you really want the tx server to be able to push the final list up so you can join across them
- but there's some value sets where the list of codes is not finite (ones with grammar). So in this case, generating the join table (loosely called a closure table so far) is a shared responsibility - the data source says 'here's the codes I have' and the terminology server says 'ok, here's the table for this set of codes'
- the value set can cross code systems (makes people's skin crawl, but it happens)
Josh Mandel (Nov 22 2017 at 03:45):
Most of the VSAC value sets cross code systems, fwiw
Josh Mandel (Nov 22 2017 at 03:46):
I'm not sure I understand the interaction you're describing with respect to "join tables" - - what does the resulting table say/mean/include?
Josh Mandel (Nov 22 2017 at 03:47):
For value sets that are finite: pushing these up into an analytics environment seems very natural.
Grahame Grieve (Nov 22 2017 at 03:50):
yes. I think that's the first of the 2 use cases here: post the value set to the Tx Server, and it pushes the expansion to the analytics store
Grahame Grieve (Nov 22 2017 at 03:50):
then you can join using it
Grahame Grieve (Nov 22 2017 at 03:51):
the infinite value sets need an additional step where the TS knows the set of codes in the join space
Michael Lawley (Nov 22 2017 at 06:02):
Yep. And of course you can do most of this on a code system basis with SQL stuff. But generally that means you need to invest in learning about the table structure for every code system you're interested in. And, of course, there's the whole SNOMED thing with (to a lesser extent) post coordination, but more common (IMO) valuesets defined with ECL.
Then there's ConceptMaps, but they're a problem at the moment as there's only the code-by-code $translate rather than an analogue to $expand (or $closure) for bulk-translation
Grahame Grieve (Nov 22 2017 at 07:39):
well, when would you need an operation, as opposed to joining across a concept map content?
Michael Lawley (Nov 22 2017 at 11:14):
We have some ConceptMaps that are algorithmic - effectively infinite - so you can't grab the content, store it in a table and then do a join with your data. It's a similar scenario to the $closure use-case
Grahame Grieve (Nov 22 2017 at 14:18):
what's an example?
Michael Lawley (Nov 23 2017 at 01:35):
A concept map that looks at a code's display in order to find an appropriate match - we have a range of algorithms for this and expose them as different ConceptMaps. The usual use-case is mapping a large number of codes at once
Grahame Grieve (Nov 23 2017 at 01:38):
what does the concept map look like?
Michael Lawley (Nov 23 2017 at 02:22):
It's implicitly defined and doesn't have an extension, only a URI that you pass to $translate (although the missing parameter to $translate that allows this still hasn't been added to 3.1.0 - GF#13509)
Zoran Milosevic (Aug 31 2018 at 12:08):
We'd like to make the Mimic dataset available - a real de-identified value set. But it's not currently available in public, and so we need to negotiate with the owners to make it so
Has there been any progress in mapping Mimic to FHIR directly (but not via OMOP)? That is, anyone knows anyone doing this? :-)
Cinyoung Hur (Oct 20 2018 at 11:15):
@Zoran Milosevic Hi, I'm interested in mapping MIMIC 3.0 to FHIR, and recently started to work on it. https://github.com/linewalks/mimic-fhir
Last updated: Apr 12 2022 at 19:14 UTC