FHIR Chat · Large data sets · bulk data

Stream: bulk data

Topic: Large data sets


view this post on Zulip Bryan Schofield (May 14 2020 at 17:50):

I'm curious what the community considers a large data set to be? For example, if exporting patient data, would we consider 100,000 patients large? 500,000? a million? At what threshold does bulk fhir begin to break down because either there are a too many files, too many records per file, or the time it takes to generate the files in response to a kick off request too long to be manageable?

view this post on Zulip Jenni Syed (May 14 2020 at 17:52):

I think this is specifically targeted at "big data" - so it's explicitly beyond the "norms" of how we likely handle data today (unless you're a big data system)

view this post on Zulip Jenni Syed (May 14 2020 at 17:52):

some of our initial "very small" sets of testing data was 10s of Gigs

view this post on Zulip Jenni Syed (May 14 2020 at 17:53):

(which doesn't work well on hotel wifi, at most of our connectathon events ;) )

view this post on Zulip Bryan Schofield (May 14 2020 at 17:53):

how many records would you guess 10G to be?

view this post on Zulip Jenni Syed (May 14 2020 at 17:53):

I think patient list size comes into it, but really you would be looking at the size of the record per patient

view this post on Zulip Bryan Schofield (May 14 2020 at 17:54):

interesting... our use case so far hasn't been _per patient_, it's been _all of it_.

view this post on Zulip Vladimir Ignatov (May 14 2020 at 17:56):

I've tested it with about 400 million resources. Bulk-data itself works just fine, as long as everything is streaming. The hard part is to design your back-end and DB to handle such volumes properly.

view this post on Zulip Paul Church (May 14 2020 at 17:58):

One billion resources is a large data set, enough to stress the parallelization of every part of a system. Resources seem to average around the ballpark of 1KB so that's 1TB on disk. Number of resources per patient varies wildly depending on the application.

view this post on Zulip Jenni Syed (May 14 2020 at 17:58):

well, from data IRL :)

view this post on Zulip Jenni Syed (May 14 2020 at 17:59):

We have large clients that have populations in the millions of patients and single EHR record sets in the 100s of tb

view this post on Zulip Bryan Schofield (May 14 2020 at 17:59):

We're well into the billions. Observations alone are over 1.5B.

view this post on Zulip Jenni Syed (May 14 2020 at 17:59):

that's a single health system

view this post on Zulip Bryan Schofield (May 14 2020 at 18:01):

Any recommendations on balancing number of files with records per file? Or do consumers just assume they may be downloading hundreds or thousands of files?

view this post on Zulip Branden Rauch - CareEvolution (May 14 2020 at 18:04):

We limit the number of resources per file, currently set at 50k/file

view this post on Zulip Paul Church (May 14 2020 at 18:06):

We consolidate the output into one file per resource type no matter how large they get.

view this post on Zulip Vladimir Ignatov (May 14 2020 at 18:11):

That might depend on your implementation. For example can you handle 1B results from single SQL query, or do you prefer to split that over smaller "pages"... The clients should work with any size, unless you reach some OS file size limit.

view this post on Zulip Bryan Schofield (May 14 2020 at 18:13):

fascinating ... at 50k records, I'd be looking at, say 60,000 files, but unlimited records would yield a file that is 1.5TB. Both seem extreme for a client, but perhaps this within normal operating bounds

view this post on Zulip Jenni Syed (May 14 2020 at 18:16):

most of the "export it all" would have to be from big data system to big data system, over pretty noice internet connections :)

view this post on Zulip Vladimir Ignatov (May 14 2020 at 18:17):

Having 60,000 files looks worst to me than having single 1.5TB file (which shouldn't be extreme)

view this post on Zulip Paul Church (May 14 2020 at 18:17):

Given how the files are individually listed in the bulk data status response once finished, it would be a mess to have 60,000 files.

view this post on Zulip Jenni Syed (May 14 2020 at 18:17):

We've talked about using chunking for the file transfers since these are so large (freeish in http/2, sometimes you have to do work for 1.1)

view this post on Zulip Jenni Syed (May 14 2020 at 18:18):

You also need to make sure your timeouts etc are going to handle how long this will take :)

view this post on Zulip Jenni Syed (May 14 2020 at 18:19):

@Paul Church it's not so much a mess since it's not a human reading it - it's a system

view this post on Zulip Jenni Syed (May 14 2020 at 18:19):

the system already has to handle many files

view this post on Zulip Branden Rauch - CareEvolution (May 14 2020 at 18:19):

The bulk data responses are meant to be consumed by a computer service, not a human. Dealing with lots of files versus a few shouldn't make much difference. What if network issues prevent downloading a 1.5TB file?

view this post on Zulip Bryan Schofield (May 14 2020 at 18:20):

Network issues would concern me with large files.

view this post on Zulip Branden Rauch - CareEvolution (May 14 2020 at 18:20):

50k resources a file was a totally arbitrary number we came up with. It's better suited to resources that contain a lot of data as opposed to "thinner" resources like Observation. In the long run I think we will limit the number of resources per file depending on the resource type.

view this post on Zulip Michele Mottini (May 14 2020 at 18:22):

Use a file size limit instead of a # of resources limit?

view this post on Zulip Paul Church (May 14 2020 at 18:22):

It's just that the API response to the status request is usually pretty small, but with many files there's no bound on it and it's not paginated or anything.

view this post on Zulip Vladimir Ignatov (May 14 2020 at 18:28):

Resumable downloads are doable but could be hard to implement. The easiest option for the clients seems to be to retry certain number of times.

view this post on Zulip Ritika Jain (Jan 15 2021 at 19:54):

does BITS protocol help here and how to add that in the implementation ? @Josh Mandel

view this post on Zulip Josh Mandel (Jan 15 2021 at 20:05):

I'm not sure what this is; can you share a link or specific details about what you have in mind?

view this post on Zulip Ritika Jain (Jan 15 2021 at 20:17):

BITS transfers files between machines using idle network bandwidth. I am wondering if something like this could be incorporated since downloading multiple files ( say 60k files) or huge size of data ( in TBs ) both could be a problem

view this post on Zulip Josh Mandel (Jan 15 2021 at 20:20):

My framework here is: the spec allows for extensions that can explore these sorts of capabilities; I'd look to specific real-world performance issues here before we try to optimize in a first-class, "in-spec" way.

view this post on Zulip Aditya Chhabra (Mar 02 2021 at 07:59):

The major concern around bulk data export is exporting a large chunk of data especially when we talk about system-level export.

Any suggestions on how to handle the system level export and also meet the ONC compliance guidelines?

view this post on Zulip Josh Mandel (Mar 02 2021 at 16:24):

I'm not sure what level of suggestion you're thinking about. In general the ONC requirement is to support Bulk Data Export for a group of patients.


Last updated: Apr 12 2022 at 19:14 UTC