FHIR Chat · synthea on bigquery · implementers

Stream: implementers

Topic: synthea on bigquery


view this post on Zulip Josh Mandel (Nov 17 2017 at 14:25):

https://bigquery.cloud.google.com/dataset/fhir-org-starter-project:synthea is a copy of the Synthea "Synthetic MA" data that I just loaded into BQ. @Eyal Oren @Patrik Sundberg @Grahame Grieve

view this post on Zulip Josh Mandel (Nov 17 2017 at 14:26):

(Code used to do the loading -- very rough, since I just put this together during the DevDays keynote... https://github.com/jmandel/synthea-to-bigquery)

view this post on Zulip Josh Mandel (Nov 17 2017 at 14:30):

(and source .ndjson files in Google Cloud Stotrage are at https://storage.googleapis.com/fhir-synthetic-ma/2017-05-24/AllergyIntolerance.ndjson.gz and so on... changing the resource.)

view this post on Zulip nicola (RIO/SS) (Nov 17 2017 at 14:32):

+1

view this post on Zulip Eyal Oren (Nov 17 2017 at 14:33):

try count(*) from synthea.Encounter or so

view this post on Zulip Eyal Oren (Nov 17 2017 at 14:33):

it's fun

view this post on Zulip Eyal Oren (Nov 17 2017 at 14:34):

or e.g., select format_timestamp("%A", period.start), count(*) from synthea.Encounter group by 1

view this post on Zulip Josh Mandel (Nov 17 2017 at 14:34):

(Patient and Observation are still loading loaded.)

view this post on Zulip Josh Mandel (Nov 17 2017 at 15:31):

@Jason Walonoski FYI -- the synthea publications made this super easy, thanks!

view this post on Zulip Grahame Grieve (Nov 17 2017 at 16:18):

awesome, thanks @Josh Mandel now we need to do a nice little tutorial

view this post on Zulip Josh Mandel (Nov 17 2017 at 23:40):

Yeah, definitely. For anyone looking to whet their appetite, here's a mildly interesting kind of query that shows off a bunch of features -- it's definitely something that would be hard to do from the FHIR REST API alone. (Definitely a toy example, but with a realistic shape to it.)

Goal: see what fraction of patients are diagnosed with hypertension, stratified by the patient's maximum observed systolic blood pressure.

view this post on Zulip Josh Mandel (Nov 17 2017 at 23:42):

You start to see pretty quickly that the synthetic data are "way too clean" for real life, with a sharp diagnostic cutoff right at the guideline-compatible boundary of 140mmHg.

view this post on Zulip Grahame Grieve (Nov 18 2017 at 04:15):

We'd like to make the Mimic dataset available - a real de-identified value set. But it's not currently available in public, and so we need to negotiate with the owners to make it so

view this post on Zulip Grahame Grieve (Nov 18 2017 at 04:19):

@Eyal Oren reading @Josh Mandel's query makes me wonder about introducing terminology service logic into BigQuery. How would we get it so that instead of

where a_coding.code='55284-4'

view this post on Zulip Grahame Grieve (Nov 18 2017 at 04:20):

we can write something like

where a_coding.code in valueset "http://my.com/ValueSet/all-blood-pressures"

view this post on Zulip Grahame Grieve (Nov 18 2017 at 04:21):

? (and I know that we've specifically been ensuring you don't need to in this case, I'm interested in the myriad other cases where we won't and can't)

view this post on Zulip Grahame Grieve (Nov 18 2017 at 04:22):

in general, I'd look to define a closure table.... see http://build.fhir.org/terminology-service.html#closure

view this post on Zulip Josh Mandel (Nov 18 2017 at 06:13):

Having a set of authoritative vocabularies to join in here will be super helpful, I agree. Definite strength of the platform is being able to join across disparate data sets including reference data and instance data. (E.g. there is a public dataset of NPI data that could be directly joined with our "private" synthea data)

view this post on Zulip Josh Mandel (Nov 18 2017 at 06:14):

I can share an example of what the join would look like with rxnorm for example, since I have that loaded in a place that I can access. It's obviously not the same text that you suggested Graham but it is going to be reasonably straightforward and does allow on the fly creation of (effectively) value sets.

view this post on Zulip Grahame Grieve (Nov 18 2017 at 06:15):

I think what you are saying is that it would be useful for a terminology service to be able to push expansions to a bigquery data set for this purpose

view this post on Zulip Josh Mandel (Nov 18 2017 at 06:45):

Yes indeed :simple_smile: There are a couple of directions to explore, including what you've described here as well as loading tables directly from source schema (e.g. for rxnorm, the rxnrel, rxnsat, rxnconso tables and so on). I think each approach will have some utility.

view this post on Zulip Josh Mandel (Nov 18 2017 at 08:08):

Well, my attempt to show a cool join is a bit delayed -- looks like the synthea data has MedicationRequest resources with references like

"medicationReference": {
  "reference": ""
}

@Jason Walonoski is this a known issue with the data set?

view this post on Zulip Josh Mandel (Nov 18 2017 at 08:09):

(Also, I will have to update my loader to turning bundle-internal urn: references into regular id-based references.)

view this post on Zulip Eyal Oren (Nov 18 2017 at 17:06):

agreed w/ wanting to join terminologies in, esp directly from their definition URLs.
fyi that we have rxnorm already available on bigquery to be joined in : https://cloud.google.com/bigquery/public-data/rxnorm
we can & will do the same for loinc (i think folks are already worknig on that -- @Josh Mandel you could load loinc into this same project just to demonstrate)

view this post on Zulip Eyal Oren (Nov 18 2017 at 17:07):

also agreed that a condition such as "where a_coding.code = '55284-4' " is in reality unlikely to be very useful, given the myriad of blood pressures in loinc ;-)

view this post on Zulip Grahame Grieve (Nov 18 2017 at 18:05):

so of course you should load the terminologies. But I'm talking about value sets - not the same thing

view this post on Zulip Josh Mandel (Nov 18 2017 at 18:16):

Indeed the example query was just a toy (and for the Synthea data set, a single fixed code was good enough to demonstrate the rest of the logic -- it's quite homogenous).

view this post on Zulip Josh Mandel (Nov 18 2017 at 18:18):

@Grahame Grieve definitely terminologies are different from value sets, and we need support for both. For "intensional" value sets, having access to the full terminology structure can help define/reconstitute the value set directly within an analytical query; the flexibility can be nice.

view this post on Zulip Josh Mandel (Nov 18 2017 at 18:19):

Do we have a FHIR representation of some/all value sets from the NLM Value Set Authority Center?

view this post on Zulip Josh Mandel (Nov 18 2017 at 18:20):

That'd be nice for various reasons, and we could load it into an abalytical db as well.

view this post on Zulip Grahame Grieve (Nov 18 2017 at 18:23):

I've processed all the CDA value sets; I don't know if that's all the value sets.

view this post on Zulip Grahame Grieve (Nov 18 2017 at 18:24):

It's quite a lot of code to do expansions; let's say we decide to get on the terminology servers to push the expansions up (for those that are not infinite) - how would that happen?

view this post on Zulip Josh Mandel (Nov 18 2017 at 21:18):

Does VSAC host (definitions for) any non-finite value sets? I didn't think so.

view this post on Zulip Grahame Grieve (Nov 19 2017 at 04:46):

no VSAC doesn't. But FHIR has several of them - and it's because VSAC can't, not because there's no need

view this post on Zulip Josh Mandel (Nov 19 2017 at 19:44):

@Jason Walonoski one other note on the data: looks like Immunizations don't have any id. Is this expected?

view this post on Zulip Josh Mandel (Nov 20 2017 at 13:37):

Cool -- so using the public rxnorm data set, a join with clinical synthea data might ask: "How many synthea meds contained hydrocodone as an ingredient?" Knowing that hydrocodone is RxNorm concept 5489, the query can be done with:

SELECT count(*) FROM
  `fhir-org-starter-project.synthea.Medication`meds, unnest(meds.code.coding)  med_coding
   join `bigquery-public-data.nlm_rxnorm.rxn_all_pathways_current` pathways
     on pathways.source_rxcui=cast(med_coding.code as string)
where
  med_coding.system="http://www.nlm.nih.gov/research/umls/rxnorm" and
  pathways.target_rxcui="5489"

... this matches any single-ingredient drug, multi-ingredient drug, brand name, generic, pack, dose form, etc, that contains hydrocodone as an ingredient. It's otherwise quite challenging to do this kind of thing without pre-processing.

view this post on Zulip Josh Mandel (Nov 20 2017 at 23:01):

And now for an example query joining LOINC to clinical data! (This isn't an officially hosted copy of LOINC, but I've added it to our sample project just to illustrate the point.)

See how often deprecated, discouraged, or trial LOINC terms are used in the data set:

#standardSQL
select loinc.status, count(*) as count
from
  synthea.Observation,
  unnest(code.coding) observation_coding
  join `loinc_test.loinc` loinc on loinc.loinc_num=observation_coding.code
where
  observation_coding.system = "http://loinc.org"
group by loinc.status
order by count
status count
TRIAL 202426
DISCOURAGED 7037333
ACTIVE 50339669

view this post on Zulip Josh Mandel (Nov 21 2017 at 01:45):

BTW, I re-loaded the synthea data with:

  • Proper references now enabled across resources
  • An updated schema that includes descriptions for all fields (drawn from the FHIR spec)

This approach avoids the auto-guessed schema that (wrongly) concluded thing like "Observation.coding.code is an integer" (just 'cause only SNOMED codes were used in the data set, and those happen to be integers).

view this post on Zulip Grahame Grieve (Nov 21 2017 at 02:02):

thanks

view this post on Zulip nicola (RIO/SS) (Nov 21 2017 at 05:47):

we do terminology in postgresql by introducing Concept resource- this simplify valuesets queries

view this post on Zulip Michael Lawley (Nov 21 2017 at 09:00):

Is there some alternative to XMLHttpRequest the can be used in BigQuery's UDFs?

view this post on Zulip Josh Mandel (Nov 21 2017 at 13:28):

@Michael Lawley Do you mean for fetching external data from within a udf? I don't think so, from https://cloud.google.com/bigquery/docs/reference/standard-sql/user-defined-functions . You'd need a js interpreter to expose a special synchronous http request function, but udfs run in an environment with only (a subset of) standards js built-ins. What's the use case you have in mind?

view this post on Zulip Jason Walonoski (Nov 21 2017 at 14:14):

Hmm... seems like you found some interesting things with the Synthea data, as well as some issues.

Did you load the STU3 or DSTU2 data? I'm assuming STU3.

Immunizations were not assigned IDs because they aren't referenced by other resources in the Bundle. It is assumed the FHIR server will assign the ID during the transaction.

At the time the data was generated, we had MedicationRequest reference Medication resources. We later changed this just to have medicationCodeableConcept instead (Medication doesn't add any value here, since we're just using RxNorm codes). Apparently we did not regenerate the data to include this change. At any rate, the medicationReferences should point to a Medication by "urn:uuid:#{resource_id}" (that is Ruby syntax). Not sure why they are empty. But this is fixed in the latest version of Synthea.

view this post on Zulip Josh Mandel (Nov 21 2017 at 16:32):

Thanks @Jason Walonoski! I just downloaded the published data set, using the most recent version (Version 2, which uses STU3) listed at https://syntheticmass.mitre.org/download.html. Details here. Is there a newer release I should be using?

view this post on Zulip Jason Walonoski (Nov 21 2017 at 18:38):

@Josh Mandel Thanks for posting your work on this -- it is very interesting, and I always like to see how others use the data. The link to Version 2 is the latest release of the data, so that is good. The latest release of the code has fixed those problems. So, we'll need to update the data. TBD on timeline for that.

view this post on Zulip Michael Lawley (Nov 22 2017 at 01:47):

I'm looking for a way to connect up to a FHIR terminology server

view this post on Zulip Grahame Grieve (Nov 22 2017 at 01:51):

todo what?

view this post on Zulip Michael Lawley (Nov 22 2017 at 01:52):

Clearly there are performance concerns about REST calls during query evaluation. What would work is being able to populate a table as a result of REST call. In particular I'm thinking about the results of $closure. i.e., I'm hoping to be able to perform code:below type queries

view this post on Zulip Grahame Grieve (Nov 22 2017 at 02:10):

well, Josh is effectively saying that if you upload the code system directly to the bigquery repository, you can do those things in sql.

view this post on Zulip Grahame Grieve (Nov 22 2017 at 02:10):

unless the semantics of the code systems are as complicated as SNOMED Ct

view this post on Zulip Grahame Grieve (Nov 22 2017 at 02:11):

in which case the most effective way would be to generate the entire closure table and upload it once

view this post on Zulip Josh Mandel (Nov 22 2017 at 02:36):

I'm indeed saying what Grahame says I'm saying :simple_smile: It's very useful to load closure tables especially for these recursive structures since BQ doesn't have "connect by" or "with recursive".

view this post on Zulip Grahame Grieve (Nov 22 2017 at 02:40):

and this works for some cases. Michael and I are interested in the rest of the cases where the value set machinary brings real value

view this post on Zulip Grahame Grieve (Nov 22 2017 at 02:40):

though code:below is not one of those cases

view this post on Zulip Josh Mandel (Nov 22 2017 at 02:49):

Can you share some examples of where the value set machinery helps (and especially if helps in places where a closure table can't stand in for it)?

view this post on Zulip Grahame Grieve (Nov 22 2017 at 03:17):

so there's several things a txserver - with logic - does over this:
- in a value set, you make an arbitrary list of codes that are part of the value set that aren't purely derived from the semantics of the code system. So you really want the tx server to be able to push the final list up so you can join across them
- but there's some value sets where the list of codes is not finite (ones with grammar). So in this case, generating the join table (loosely called a closure table so far) is a shared responsibility - the data source says 'here's the codes I have' and the terminology server says 'ok, here's the table for this set of codes'
- the value set can cross code systems (makes people's skin crawl, but it happens)

view this post on Zulip Josh Mandel (Nov 22 2017 at 03:45):

Most of the VSAC value sets cross code systems, fwiw

view this post on Zulip Josh Mandel (Nov 22 2017 at 03:46):

I'm not sure I understand the interaction you're describing with respect to "join tables" - - what does the resulting table say/mean/include?

view this post on Zulip Josh Mandel (Nov 22 2017 at 03:47):

For value sets that are finite: pushing these up into an analytics environment seems very natural.

view this post on Zulip Grahame Grieve (Nov 22 2017 at 03:50):

yes. I think that's the first of the 2 use cases here: post the value set to the Tx Server, and it pushes the expansion to the analytics store

view this post on Zulip Grahame Grieve (Nov 22 2017 at 03:50):

then you can join using it

view this post on Zulip Grahame Grieve (Nov 22 2017 at 03:51):

the infinite value sets need an additional step where the TS knows the set of codes in the join space

view this post on Zulip Michael Lawley (Nov 22 2017 at 06:02):

Yep. And of course you can do most of this on a code system basis with SQL stuff. But generally that means you need to invest in learning about the table structure for every code system you're interested in. And, of course, there's the whole SNOMED thing with (to a lesser extent) post coordination, but more common (IMO) valuesets defined with ECL.

Then there's ConceptMaps, but they're a problem at the moment as there's only the code-by-code $translate rather than an analogue to $expand (or $closure) for bulk-translation

view this post on Zulip Grahame Grieve (Nov 22 2017 at 07:39):

well, when would you need an operation, as opposed to joining across a concept map content?

view this post on Zulip Michael Lawley (Nov 22 2017 at 11:14):

We have some ConceptMaps that are algorithmic - effectively infinite - so you can't grab the content, store it in a table and then do a join with your data. It's a similar scenario to the $closure use-case

view this post on Zulip Grahame Grieve (Nov 22 2017 at 14:18):

what's an example?

view this post on Zulip Michael Lawley (Nov 23 2017 at 01:35):

A concept map that looks at a code's display in order to find an appropriate match - we have a range of algorithms for this and expose them as different ConceptMaps. The usual use-case is mapping a large number of codes at once

view this post on Zulip Grahame Grieve (Nov 23 2017 at 01:38):

what does the concept map look like?

view this post on Zulip Michael Lawley (Nov 23 2017 at 02:22):

It's implicitly defined and doesn't have an extension, only a URI that you pass to $translate (although the missing parameter to $translate that allows this still hasn't been added to 3.1.0 - GF#13509)

view this post on Zulip Zoran Milosevic (Aug 31 2018 at 12:08):

We'd like to make the Mimic dataset available - a real de-identified value set. But it's not currently available in public, and so we need to negotiate with the owners to make it so

Has there been any progress in mapping Mimic to FHIR directly (but not via OMOP)? That is, anyone knows anyone doing this? :-)

view this post on Zulip Cinyoung Hur (Oct 20 2018 at 11:15):

@Zoran Milosevic Hi, I'm interested in mapping MIMIC 3.0 to FHIR, and recently started to work on it. https://github.com/linewalks/mimic-fhir


Last updated: Apr 12 2022 at 19:14 UTC