FHIR Chat · The T in ETL · bulk data

Stream: bulk data

Topic: The T in ETL


view this post on Zulip René Spronk (Dec 11 2020 at 14:47):

In an environment in which all systems expose a FHIR API, and where Bulk Data is used for the E and L in ETL, what would the T look like? What kind of steps would one recommend (as a general best practice) ? Systems may use different FHIR profiles or versions, just to mention one of the challenges.

I've created an initial list, but I'm pretty sure (if you're in the data warehousing business) you know of others that would be appropriate. (The list will be used as part of a module in one of Firely's FHIR training courses, so it's technology/system/platform agnostic).

*Cleaning (correct corrupt or incomplete data):
De-duplication, data quality assessment of candidate data
Using statistical methods
*Mapping / Enriching:
FHIR Profile based mapping
E.g. Translate codes, Subset: remove data elements, or resources
Aggregation, summarization (new data elements or resources)
*Review:
Validation (profiles, business rules, against “master data” registries), referential integrity between FHIR resources
(Partial) rejection of data

view this post on Zulip Dan Gottlieb (Dec 15 2020 at 14:59):

@René Spronk, this seems like a good start!

view this post on Zulip Dan Gottlieb (Dec 15 2020 at 14:59):

I'd also think about adding de-identification (for example, https://github.com/microsoft/FHIR-Tools-for-Anonymization which works directly on ndjson files), and NLP to build structured FHIR from unstructured notes (for example, https://github.com/BD2KOnFHIR/NLP2FHIR and https://aws.amazon.com/blogs/aws/amazon-comprehend-medical-natural-language-processing-for-healthcare-customers/) to the "enriching" section.

view this post on Zulip Dan Gottlieb (Dec 15 2020 at 15:01):

Also, as part of a larger project, I've done some early stage work drafting and prototyping a set of transformations to make direct querying of the imported FHIR data more efficient at https://github.com/sync-for-science/a3/blob/master/algorithm.md . This includes things like standardizing references and converting all date elements into periods for querying (eg. a date of "2019" will be transformed into a period of "2019-01-01" to "2019-12-31").

view this post on Zulip René Spronk (Dec 16 2020 at 08:41):

Thanks. I've had a call with someone with a data warehousing background, and these kinds of T activities are obviously not limited to data warehousing. They would apply wherever one has a use to import/receive data from any other system (where one also has ones own data). Zooming out there all aspects of data quality management.

view this post on Zulip René Spronk (Dec 16 2020 at 08:43):

NLP may be acceptable in some projects (and not in others), but it's certainly worthwhile to discuss. Standardizing references: e.g. to convert them all into 'full URLs'?

view this post on Zulip René Spronk (Dec 16 2020 at 08:45):

Even if one has one single FHIR based system (with little or no exchanges with others), data quality is an organizational/provenance type of issue, so discussing data quality will even be valuable in such scenarios. Detection of 'low quality data', and labeling it as such for additional processing or human review would be very useful in any setting.


Last updated: Apr 12 2022 at 19:14 UTC