Stream: bulk data
Topic: Sample data
Toby Hu (Apr 26 2019 at 20:38):
For participants to the Bulk Data and Analytics track in during the Montreal Connectathon, a Google cloud project has been created to host a small and a large FHIR demo dataset. The data is generated using Synthea (https://github.com/synthetichealth/synthea), which is an open-source project under Apache License, Version 2.0, owned by MITRE Corporation. To access the dataset, please join the access group https://groups.google.com/forum/#!forum/gcp-fhir-demo-dataset-readonly. You then can find the data files at https://console.cloud.google.com/storage/browser/gcp-fhir-demo-dataset-synthea.
Josh Mandel (Apr 29 2019 at 18:24):
Thanks @Toby Hu ! For anyone who wants to access this content (open to public) via Azure:
Files are available via https://synthea2019.blob.core.windows.net/synthea-may-2019 + path
, as in https://synthea2019.blob.core.windows.net/synthea-may-2019/r4-small/Condition.ndjson
/ ├── /r4 │ ├── /r4/AllergyIntolerance.ndjson │ ├── /r4/CarePlan.ndjson │ ├── /r4/Claim.ndjson │ ├── /r4/Condition.ndjson │ ├── /r4/DiagnosticReport.ndjson │ ├── /r4/Encounter.ndjson │ ├── /r4/ExplanationOfBenefit.ndjson │ ├── /r4/Goal.ndjson │ ├── /r4/ImagingStudy.ndjson │ ├── /r4/Immunization.ndjson │ ├── /r4/MedicationRequest.ndjson │ ├── /r4/Observation.ndjson │ ├── /r4/Organization.ndjson │ ├── /r4/Patient.ndjson │ ├── /r4/Practitioner.ndjson │ ├── /r4/Procedure.ndjson │ ├── /r4/hospitalInformation1555536562902.json │ └── /r4/practitionerInformation1555536562902.json ├── /r4-small │ ├── /r4-small/AllergyIntolerance.ndjson │ ├── /r4-small/CarePlan.ndjson │ ├── /r4-small/Claim.ndjson │ ├── /r4-small/Condition.ndjson │ ├── /r4-small/DiagnosticReport.ndjson │ ├── /r4-small/Encounter.ndjson │ ├── /r4-small/ExplanationOfBenefit.ndjson │ ├── /r4-small/Goal.ndjson │ ├── /r4-small/ImagingStudy.ndjson │ ├── /r4-small/Immunization.ndjson │ ├── /r4-small/MedicationRequest.ndjson │ ├── /r4-small/Observation.ndjson │ ├── /r4-small/Organization.ndjson │ ├── /r4-small/Patient.ndjson │ ├── /r4-small/Practitioner.ndjson │ ├── /r4-small/Procedure.ndjson │ ├── /r4-small/hospitalInformation1555535281086.json │ └── /r4-small/practitionerInformation1555535281086.json ├── /stu3 │ ├── /stu3/AllergyIntolerance.ndjson │ ├── /stu3/CarePlan.ndjson │ ├── /stu3/Claim.ndjson │ ├── /stu3/Condition.ndjson │ ├── /stu3/DiagnosticReport.ndjson │ ├── /stu3/Encounter.ndjson │ ├── /stu3/ExplanationOfBenefit.ndjson │ ├── /stu3/Goal.ndjson │ ├── /stu3/ImagingStudy.ndjson │ ├── /stu3/Immunization.ndjson │ ├── /stu3/MedicationRequest.ndjson │ ├── /stu3/Observation.ndjson │ ├── /stu3/Organization.ndjson │ ├── /stu3/Patient.ndjson │ ├── /stu3/Practitioner.ndjson │ └── /stu3/Procedure.ndjson └── /stu3-small ├── /stu3-small/AllergyIntolerance.ndjson ├── /stu3-small/CarePlan.ndjson ├── /stu3-small/Claim.ndjson ├── /stu3-small/Condition.ndjson ├── /stu3-small/DiagnosticReport.ndjson ├── /stu3-small/Encounter.ndjson ├── /stu3-small/ExplanationOfBenefit.ndjson ├── /stu3-small/Goal.ndjson ├── /stu3-small/ImagingStudy.ndjson ├── /stu3-small/Immunization.ndjson ├── /stu3-small/MedicationRequest.ndjson ├── /stu3-small/Observation.ndjson ├── /stu3-small/Organization.ndjson ├── /stu3-small/Patient.ndjson ├── /stu3-small/Practitioner.ndjson └── /stu3-small/Procedure.ndjson 4 directories, 68 files
Jie Fan (May 04 2019 at 14:05):
There is an issue with the synthetic dataset (references are invalid), we've cleaned most of the data, and are trying to export the data to BigQuery, I'll post an update after it is done
Jason Walonoski (May 04 2019 at 14:12):
Are the invalid references from Synthea itself, or are they a result of some post-processing or repackaging? If the former, please file a Synthea issue so I can fix it. Thanks.
Jie Fan (May 04 2019 at 14:13):
It's from Synthea, I'll file a bug in a minute.
Nik Klassen (May 04 2019 at 14:41):
The sample data can also be queried directly in BigQuery at https://console.cloud.google.com/bigquery?project=gcp-fhir-demo-dataset (accessible once the group has been joined)
Josh Mandel (May 04 2019 at 14:42):
(Do these sample data in bq currently have the same bug that @Jie Fan mentioned ?)
Jie Fan (May 04 2019 at 14:43):
No, but we forgot to export the data in analytics schema, @Benard Ebinu is re-exporting the data
Jie Fan (May 04 2019 at 14:44):
@Benard Ebinu will post an update once that's done.
Jie Fan (May 04 2019 at 15:01):
FYI, the cleaned stu3 small dataset is here: https://pantheon.corp.google.com/storage/browser/gcp-fhir-demo-dataset-synthea/stu3-small-cleaned, let me know if you encounter any issues using it.
Jie Fan (May 04 2019 at 15:03):
(Please join https://groups.google.com/forum/#!forum/gcp-fhir-demo-dataset-readonly to use the dataset and access BigQuery for data analytics)
Brian Wright (May 04 2019 at 15:16):
I am getting a google SSO login prompt trying to access this.
Brian Wright (May 04 2019 at 15:16):
Specifically this link: https://pantheon.corp.google.com/storage/browser/gcp-fhir-demo-dataset-synthea/stu3-small-cleaned,
Dan Gottlieb (May 04 2019 at 15:21):
Yep - same issue. @Jie Fan can you open it to non-google folks?
Nik Klassen (May 04 2019 at 15:24):
That link should be https://console.cloud.google.com/storage/browser/gcp-fhir-demo-dataset-synthea/stu3-small-cleaned
Jie Fan (May 04 2019 at 15:25):
Thanks Nik, this should be the correct link.
Dan Gottlieb (May 04 2019 at 15:29):
Is the cleaned data also accessible through https://console.cloud.google.com/bigquery?project=gcp-fhir-demo-dataset or is this the previous version?
Benard Ebinu (May 04 2019 at 15:36):
The new cleaned data is now accessible through https://console.cloud.google.com/bigquery?project=gcp-fhir-demo-dataset under dataset stu3_small
Jie Fan (May 04 2019 at 15:40):
Here are references to the standard SQL for BigQuery: https://cloud.google.com/bigquery/docs/reference/standard-sql/
Jie Fan (May 04 2019 at 15:40):
Let one of us know if you have issues querying the data :)
Jie Fan (May 04 2019 at 15:59):
There is an issue where the resource ids are omitted in the BigQuery table, we are fixing it.
Josh Mandel (May 04 2019 at 16:08):
I also updated https://gist.github.com/jmandel/fd9683f11c9bc3eeb2316f017c35ddac with links to the "fixed" files in Azure.
Jason Walonoski (May 04 2019 at 16:11):
Posted bug report: https://github.com/synthetichealth/synthea/issues/513
Jason Walonoski (May 04 2019 at 19:44):
I fixed this bug in Synthea proper. Thanks for finding it @Jie Fan
Jie Fan (May 04 2019 at 20:17):
Great thank you! Credit actually goes to @Toby Hu :)
Dan Gottlieb (May 04 2019 at 21:37):
@Jie Fan are you rebuilding the data - I don't see any tables in bigquery under stu3_small anymore?
Jie Fan (May 04 2019 at 22:37):
@Benard Ebinu is trying to re-generate the source data since Jason fixed the bug, Benard, did you delete the tables for backup them somewhere?
Dan Gottlieb (May 05 2019 at 12:28):
Thanks - looks like the tables are back now! Were you able to add the identifier field to patient (I don't see a .id in the resource)?
Jie Fan (May 05 2019 at 13:43):
gcp-fhir-demo-dataset:stu3_sql_schema has resources exported based on the lossless schema.
Brian Wright (May 05 2019 at 16:13):
Thanks. I was able to complete my scenario using the stu3_sql_schema data in BigQuery.
Jie Fan (May 05 2019 at 16:25):
:+1:
Last updated: Apr 12 2022 at 19:14 UTC