Stream: bulk data
Topic: Large data sets
Bryan Schofield (May 14 2020 at 17:50):
I'm curious what the community considers a large data set to be? For example, if exporting patient data, would we consider 100,000 patients large? 500,000? a million? At what threshold does bulk fhir begin to break down because either there are a too many files, too many records per file, or the time it takes to generate the files in response to a kick off request too long to be manageable?
Jenni Syed (May 14 2020 at 17:52):
I think this is specifically targeted at "big data" - so it's explicitly beyond the "norms" of how we likely handle data today (unless you're a big data system)
Jenni Syed (May 14 2020 at 17:52):
some of our initial "very small" sets of testing data was 10s of Gigs
Jenni Syed (May 14 2020 at 17:53):
(which doesn't work well on hotel wifi, at most of our connectathon events ;) )
Bryan Schofield (May 14 2020 at 17:53):
how many records would you guess 10G to be?
Jenni Syed (May 14 2020 at 17:53):
I think patient list size comes into it, but really you would be looking at the size of the record per patient
Bryan Schofield (May 14 2020 at 17:54):
interesting... our use case so far hasn't been _per patient_, it's been _all of it_.
Vladimir Ignatov (May 14 2020 at 17:56):
I've tested it with about 400 million resources. Bulk-data itself works just fine, as long as everything is streaming. The hard part is to design your back-end and DB to handle such volumes properly.
Paul Church (May 14 2020 at 17:58):
One billion resources is a large data set, enough to stress the parallelization of every part of a system. Resources seem to average around the ballpark of 1KB so that's 1TB on disk. Number of resources per patient varies wildly depending on the application.
Jenni Syed (May 14 2020 at 17:58):
well, from data IRL :)
Jenni Syed (May 14 2020 at 17:59):
We have large clients that have populations in the millions of patients and single EHR record sets in the 100s of tb
Bryan Schofield (May 14 2020 at 17:59):
We're well into the billions. Observations alone are over 1.5B.
Jenni Syed (May 14 2020 at 17:59):
that's a single health system
Bryan Schofield (May 14 2020 at 18:01):
Any recommendations on balancing number of files with records per file? Or do consumers just assume they may be downloading hundreds or thousands of files?
Branden Rauch - CareEvolution (May 14 2020 at 18:04):
We limit the number of resources per file, currently set at 50k/file
Paul Church (May 14 2020 at 18:06):
We consolidate the output into one file per resource type no matter how large they get.
Vladimir Ignatov (May 14 2020 at 18:11):
That might depend on your implementation. For example can you handle 1B results from single SQL query, or do you prefer to split that over smaller "pages"... The clients should work with any size, unless you reach some OS file size limit.
Bryan Schofield (May 14 2020 at 18:13):
fascinating ... at 50k records, I'd be looking at, say 60,000 files, but unlimited records would yield a file that is 1.5TB. Both seem extreme for a client, but perhaps this within normal operating bounds
Jenni Syed (May 14 2020 at 18:16):
most of the "export it all" would have to be from big data system to big data system, over pretty noice internet connections :)
Vladimir Ignatov (May 14 2020 at 18:17):
Having 60,000 files looks worst to me than having single 1.5TB file (which shouldn't be extreme)
Paul Church (May 14 2020 at 18:17):
Given how the files are individually listed in the bulk data status response once finished, it would be a mess to have 60,000 files.
Jenni Syed (May 14 2020 at 18:17):
We've talked about using chunking for the file transfers since these are so large (freeish in http/2, sometimes you have to do work for 1.1)
Jenni Syed (May 14 2020 at 18:18):
You also need to make sure your timeouts etc are going to handle how long this will take :)
Jenni Syed (May 14 2020 at 18:19):
@Paul Church it's not so much a mess since it's not a human reading it - it's a system
Jenni Syed (May 14 2020 at 18:19):
the system already has to handle many files
Branden Rauch - CareEvolution (May 14 2020 at 18:19):
The bulk data responses are meant to be consumed by a computer service, not a human. Dealing with lots of files versus a few shouldn't make much difference. What if network issues prevent downloading a 1.5TB file?
Bryan Schofield (May 14 2020 at 18:20):
Network issues would concern me with large files.
Branden Rauch - CareEvolution (May 14 2020 at 18:20):
50k resources a file was a totally arbitrary number we came up with. It's better suited to resources that contain a lot of data as opposed to "thinner" resources like Observation. In the long run I think we will limit the number of resources per file depending on the resource type.
Michele Mottini (May 14 2020 at 18:22):
Use a file size limit instead of a # of resources limit?
Paul Church (May 14 2020 at 18:22):
It's just that the API response to the status request is usually pretty small, but with many files there's no bound on it and it's not paginated or anything.
Vladimir Ignatov (May 14 2020 at 18:28):
Resumable downloads are doable but could be hard to implement. The easiest option for the clients seems to be to retry certain number of times.
Ritika Jain (Jan 15 2021 at 19:54):
does BITS protocol help here and how to add that in the implementation ? @Josh Mandel
Josh Mandel (Jan 15 2021 at 20:05):
I'm not sure what this is; can you share a link or specific details about what you have in mind?
Ritika Jain (Jan 15 2021 at 20:17):
BITS transfers files between machines using idle network bandwidth. I am wondering if something like this could be incorporated since downloading multiple files ( say 60k files) or huge size of data ( in TBs ) both could be a problem
Josh Mandel (Jan 15 2021 at 20:20):
My framework here is: the spec allows for extensions that can explore these sorts of capabilities; I'd look to specific real-world performance issues here before we try to optimize in a first-class, "in-spec" way.
Aditya Chhabra (Mar 02 2021 at 07:59):
The major concern around bulk data export is exporting a large chunk of data especially when we talk about system-level export.
Any suggestions on how to handle the system level export and also meet the ONC compliance guidelines?
Josh Mandel (Mar 02 2021 at 16:24):
I'm not sure what level of suggestion you're thinking about. In general the ONC requirement is to support Bulk Data Export for a group of patients.
Last updated: Apr 12 2022 at 19:14 UTC