Stream: implementers
Topic: bulk data import
Sergei Gnezdov (Oct 18 2018 at 19:01):
hi, I have originally posted this question to HAPI group. This group appears to be a more correct location.
Hi,
We are evaluating FHIR as a workflow management backend. I have written a simple CSV import into FHIR system and I have run it against Spark server and I am running it against HAPI server 3.5.0.
I observed that import performance is one of the slowest I have ever worked with. So, I am looking at figuring out how FHIR import process can be done faster.
Here is a high level algorithm I have applied so far:
For each CSV record:
For each FHIR resource type in CSV record:
1. Build FHIR object from CSV data
2. Derive unique FHIR identifier from CSV data
3. Search by unique identifier in FHIR service to find if resource already exists. If resource exists, make update call with FHIR object's id set to found id. If object is not found, don't set id and create a new object in FHIR data store.
Here is a log of the sequence used to create different resource types:
- Location: updated existing ID 4952
- Device: updated existing ID 5124
- Patient: created with ID 88366
- Encounter: created with ID 88367
- Practitioner: updated existing ID 40318
- Practitioner: updated existing ID 7
- Organization: updated existing ID 4957
- Organization: updated existing ID 4958
- Specimen: created with ID 88368
- ProcedureRequest: created with ID 88369
- Procedure: created with ID 88370
- ImagingStudy: created with ID 88371
- DiagnosticReport: created with ID 88372
- Task: created with ID 88373
- Task: created with ID 88374
- Task: created with ID 88375
- Task: created with ID 88376
- Task: created with ID 88377
- Task: created with ID 88378
- Task: created with ID 88379
Finished 5711 row; duration: 00:00:20.1776926
From the log statement above you can see that:
I am on row #5711 and the import time is 20 seconds for 1 CSV record! The import started at below 1 second on default HAPI installation.
I made 20 search calls, 6 update calls that generated new FHIR object version while data has not changed, 14 create calls.
Obviously this kind of import process is inefficient and I am looking into ways to improve it.
As I understand Bundle could be the answer, but I am not quite sure how to specify id for object's that don't exist yet.
So, any advice is appreciated.
Thank you.
nicola (RIO/SS) (Oct 18 2018 at 19:22):
Sounds interesting - if we have Bulk Export we should have Bulk Import API :)
Josh Mandel (Oct 18 2018 at 20:05):
Agreed -- this has come up in discussions on the export side, and I think there'd be a pretty natural analog that we can define.
James Agnew (Oct 18 2018 at 21:26):
A couple of things that might bridge the gap between "full blown bulk import service" (which would be great) and your current level of performance:
- The default config for HAPI is definitely not tuned for write performance. It has tons of search parameters enabled (basically all of the built-in FHIR ones), fulltext indexing enabled, "missing search param" indexing enabled, etc. You should consider disabling any search parameters you don't need, disabling lucene if you're not using
_text
, etc. - The workflow of doing a whole bunch of searches and creates/updates individually for each row sounds like a killer for performance. You should consider using FHIR transactions with features like placeholder IDs between your resources, conditional creates, conditional updates, etc. This lets HAPI optimize quite a bit in terms of how the write works.
nicola (RIO/SS) (Oct 19 2018 at 14:51):
I think, efficient bulk import can sometimes for performance go directly into database with some asynchronous (deferred) validation
James Agnew (Oct 19 2018 at 22:21):
yeah, a mechanism for that would be much better still, Id agree..
Brian Postlethwaite (Oct 20 2018 at 06:05):
That's what ours does, it processes the import file in several passes, then is a direct import into our native table format.
natus (Oct 20 2018 at 09:39):
We have an implementation of HAPI fhir based on a plain server (not jpa) and it's trivial
to bulk load in the database thought basic ETL. BTW the database backend is a hadoop
10 computer cluster with apache Phoenix combined to a apache solR for _text filtering.
Once all resource implemented, that's confortable to work with.
Cinyoung Hur (Oct 20 2018 at 11:31):
Hi all, I think I can get good advice from this group.
I'm transforming EHR to FHIR. At first, I start with the small size of EHR, and I used SQL procedures and a couple of bash script.
But I have to handle much larger volumes(terabytes).
Have you used big data backend like Apache Kafka, Apache NiFi to do ETL?
I found this document, but it would like to know other references from this group. https://community.hortonworks.com/articles/138249/nifi-in-healthcare-ingesting-hl7-data-in-nifi.html
natus (Oct 20 2018 at 21:42):
@Cinyoung Hur From my side the ETL has focused on batch processing though apache hive/spark/sqoop to load the historical part (multiple terabytes of RDBMS EHR databases). Kafka, nifi, spark streaming are more relevant for ingesting data stream in realtime but I have not right now plugged them on.
Cinyoung Hur (Oct 21 2018 at 03:41):
@natus How often do you batch processing?
natus (Oct 21 2018 at 09:05):
Right now, on a daily basis
sandeep diddi (Aug 16 2021 at 10:22):
Team,
We need to use FHIRloader to do multi loads. How can we integrate this with Azure Datafactory or Databricks to perform bulk upload to FHIR server
Lloyd McKenzie (Aug 16 2021 at 13:37):
What is FHIRloader?
sandeep diddi (Aug 16 2021 at 15:09):
@Lloyd McKenzie Its a tool from Microsoft for Bulk load to FHIR
Gino Canessa (Aug 16 2021 at 15:13):
@Caitlin Voegele @Brendan Kowitz
Caitlin Voegele (Aug 16 2021 at 15:35):
sandeep diddi said:
Team,
We need to use FHIRloader to do multi loads. How can we integrate this with Azure Datafactory or Databricks to perform bulk upload to FHIR server
If you are looking at this tool here: https://github.com/microsoft/fhir-loader, please feel free to open an issue against the GitHub repo with questions so that one of the implementers of the tool can review. There are also examples on the page about the architecture.
Last updated: Apr 12 2022 at 19:14 UTC