FHIR Chat · Bulk Data Access

Stream: implementers

Topic: Bulk Data Access

Grahame Grieve (Sep 20 2017 at 01:45):

New proposal for January connectathon / R4:

Grahame Grieve (Sep 20 2017 at 01:45):

http://www.healthintersections.com.au/?p=2689

Grahame Grieve (Sep 20 2017 at 01:51):

Comments welcome

Jason Walonoski (Sep 20 2017 at 14:25):

Makes me think of our https://syntheticmass.mitre.org server.

We have over 1 million synthetic patients available using FHIR STU3. Scaling that large was a challenge. People often ask for the whole dataset (22 GB), so we give them a download link to a zip file.

Large complex queries are difficult to handle with performance. Often times, people are searching for a cohort. They run a search, and page through the results (many many pages).

Should bulk data access support a query, or does that just complicate matters? My instinct says to avoid query support to keep matters simple, but it would help limit very large files. How long are those files supposed to persist for?

Josh Mandel (Sep 20 2017 at 14:30):

One minor point about filesize is: the server can split data into as many chunks as it wants. Doesn't change the total size of the data, but individual files aren't necessarily big.

Chris Grenz (Sep 20 2017 at 14:37):

Based on the requests for data I've received, nearly all would have benefitted from query based scoping. Very few want or can handle a whole system dump, and the limiting criteria are often more complex than simply a list of patients (although that is common).

Chris Grenz (Sep 20 2017 at 14:57):

On the nd-json format - I'd assume it would be preferable to always compress these files (gzip?). I don't think application/fhir+ndjson+zip is legal?

John Moehrke (Sep 20 2017 at 15:56):

Just use http/2 and you get compression free...

Michele Mottini (Sep 20 2017 at 19:50):

...not completely on topic, but considering than of the 11 original Argonaut organizations only two or three actually implemented it so far, wouldn't be better to still focus on that at the connectathon, instead of adding an entire new protocol?

Grahame Grieve (Sep 20 2017 at 23:23):

@Jason Walonoski I think it makes a great deal of sense to support this on your server, and to support this for search ("query"). I've supported this for search and that was my primary test case. really big searches can get really long

Grahame Grieve (Sep 20 2017 at 23:25):

@Michele Mottini indeed, that was my first take on this too. But ONC asked us to take this on. We don't set priorities for the argonaut project - they do, under, ahh, advice, from the ONC and the market. It will be up to the market and ONC to influence the priorities of the vendors...

Michele Mottini (Sep 20 2017 at 23:31):

Got it, thanks Grahame

Jason Walonoski (Sep 21 2017 at 13:47):

Well, supporting bulk data access is something I'm willing to explore starting in... maybe October.

John Moehrke (Oct 04 2017 at 12:59):

@Grahame Grieve The Security WG looked at the Bulk Data Access proposal, and we have some Privacy and Security "Considerations". We recognize that you are simply trying to address a technical need for a solution to Bulk Data Access, and you have been careful to indicate that you have not addressed Security. We would agree with the modularity of focusing on the access method without conflating Security/Privacy; but we also want to make sure that this Bulk Data Access has a Privacy and Security Considerations so that use of the method DOES consider these issues. The Security WG minutes contain these 'issues', we stayed away from declaring 'solutions' although we do think there are ready accessible solutions. We would like to be included on development to help address these issues. See http://wiki.hl7.org/index.php?title=October_3,_2017_Security_Conference_Call#Bulk_Data_Transfer_Access_Control_.26_AuthorizationQuestions:

John Moehrke (Oct 04 2017 at 13:10):

Generally we are not clear what kind of use-case would allow this level of access, especially with a PERMIT without exceptions. Given that we expect many exceptions, we are concerned that a mechanism is needed to recognize the exceptions and obligations. Second we would like to assure that audit logging be specified in a way that is actionable but not unnecessarly verbose.

John Moehrke (Oct 04 2017 at 13:10):

There was also discussion and support for a companion method for service to de-identify, such as I outline on my blog https://healthcaresecprivacy.blogspot.com/2017/09/fhir-and-bulk-de-identification.html

Grahame Grieve (Oct 04 2017 at 19:57):

in general, I expect that the API itself will be use for a variety of purposes of use

Grahame Grieve (Oct 04 2017 at 19:57):

in general, for all the uses, there will be consent/authorization/access control issues.

Grahame Grieve (Oct 04 2017 at 19:57):

as yet, we have had no requests for anything related to these to appear on the API itself.

Grahame Grieve (Oct 04 2017 at 19:58):

we did refer a number of policy issues back to ONC for clarification - these are US specific things

Grahame Grieve (Oct 04 2017 at 19:58):

I agree that there's a place for de-identification in the service, but as yet this is not required or asked for

Grahame Grieve (Oct 04 2017 at 19:59):

finally, audit logging is a policy question. We do't anywhere specify that audit logging is required, and nor should we in the base specification

John Moehrke (Oct 11 2017 at 13:34):

I agree that many of these issues are not the issues we would address in a core specification. But we can put them in a "Security/Privacy Considerations" section so that those that are using the core capability are informed of some of the 'considerations' . This is minimally all we ask. This is what IETF, W3C, ISO, IHE, and DICOM are doing today. This is what our Security Risk Assessment Cookbook intended to drive to happen.

Grahame Grieve (Oct 11 2017 at 15:44):

do you think this is different to any other access that FHIR permits?

John Moehrke (Oct 11 2017 at 15:47):

Yes, for the above reasons...

John Moehrke (Oct 11 2017 at 15:48):

The use-cases that SMART focused on were targeting either Treatment (this patient, or any patient I have access to), or accesses by the Patient themselves. These are the use-case focus that has driven the development of the scope patterns today.

John Moehrke (Oct 11 2017 at 15:51):

Basically, the 'risk' -- 'impact' for bulk data access is 'everyone'; where a Provider or Patient access is more clearly one-patient-at-a-time.

Grahame Grieve (Oct 12 2017 at 10:52):

so this is actually an issue with SMART, not the FHIR API. Because the FHIR API itself has never been one patient or provider at at time

John Moehrke (Oct 12 2017 at 13:07):

Fine, but YOU pointed at SMART as the solution for security.. therefore you introduced it as a problem with your bulk data access proposal.

Isaac Vetter (Oct 19 2017 at 15:31):

Hey @Grahame Grieve , since the Sept HL7 meeting, we've continued to look at this proposed specification and wanted to help by sharing some practical challenges to bring this API to scale.

Isaac Vetter (Oct 19 2017 at 15:32):

Performance: Use of the API mustn't impact patient care. Practically, this could mean throttling the bulk query or running it against a replica or other non-operational database such that the API might not return real-time data. This might not have any functional impact, but it's an important expectation to set.

Isaac Vetter (Oct 19 2017 at 15:32):

Reciprocity: Towards our goal of improving patient care, this API should be reciprocal. A health system should be able to both query and be queried for bulk data.

Isaac Vetter (Oct 19 2017 at 15:33):

Patient consent: The spec doesn't currently address the exchange of patient consent. (@John Moehrke made this point as well). For similar integrations, health systems re-use or define their own authorization form, which is signed by the patient and provided as part of the query for clinical information. If authorization is determined to be incomplete or not automatically verifiable, it's examined by an actual person before clinical information is released. By not addressing consent within the specification, the ability to quickly implement and scale the API is diminished.

Isaac Vetter (Oct 19 2017 at 15:34):

The time based filtering could be really hard to implement accurately. FHIR isn't a native database format in either operational or reporting production databases. Determining the current time of $everything accurately is prohibitively expensive. Perhaps specific existing search criteria on the "focus" resources would both give the spec the flexibility you want and be more implementable?

Isaac Vetter (Oct 19 2017 at 15:34):

Further, as the data passes out of the organization's protection, so should the liability for data loss.

Grahame Grieve (Oct 22 2017 at 20:18):

what do you think we should do about patient consent? I would like to address consent but any attempt to do so de-generates into a free for all that never resolves. Anyway, in this case, the use case doesn't involve consent because there is BAA agreements - at least, that's how I understand it

Grahame Grieve (Oct 22 2017 at 20:19):

I don't understand the point you are making in regards to time. The time service that we look to support is timestamp on the source records. That's got nothing to do with FHIR as a native database format or not. Why is making use of that prohibitively expensive?

Grahame Grieve (Oct 22 2017 at 20:20):

agree about the liability for data loss, but.... so?

Grahame Grieve (Oct 22 2017 at 22:50):

Further to this, @Isaac Vetter @Jenni Syed @Kevin Shekleton please see the connectathon proposal at http://wiki.hl7.org/index.php?title=201801_Bulk_Data

Grahame Grieve (Oct 22 2017 at 23:14):

note that I signed you both up ;-)

John Moehrke (Oct 25 2017 at 00:11):

Grahame, I think that we should not address Access Control at all. In the core definition of the bulk data access 'methodology'. We can describe the security and privacy concerns. If you have specific policy, we could write that in an IG that expresses the policy that eliminates the need for consent checking. There could be a set of these policy specific IGs. These might be very light weight IGs. I would however stay away from declaring this kind of thing in the core specification where bulk data access methodology is defined.

John Moehrke (Oct 25 2017 at 00:13):

I think we should define the AuditEvent. Likely one AuditEvent for each patient that was accessed, with the query parameters. I say one for each patient as this makes mining the AuditEvent easier.

Grahame Grieve (Oct 25 2017 at 00:16):

that's what I think, and why the API as described doesn't address it. Maybe you're an internal application that has the right to complete access subject to access control, for instance. Else the consent process might be 'everyone in the group is who has consented'. But OTOH, maybe it will help EHR companies implement this in some cases if we can sort out access control. For instance, maybe we need to specifically document purpose of use on the request?

Grahame Grieve (Oct 25 2017 at 00:17):

as for audit event, I'll generate an audit event for the operation whatever, though I don't currently generate an audit entry per... patient?.... and it would be a good idea

John Moehrke (Oct 25 2017 at 00:17):

Yup, that is why I mentioned it. I expected one auditEvent for the whole transaction is natural, and good to do too...

John Moehrke (Oct 25 2017 at 00:22):

Doesn't SMART include the PoU? If it does, then it is possible for the request to come in with a PoU like "PUBHLTH" when it is a government purpose for some pubic health investigation, which clearly the user and system must also be authorized... But some PoU like this, would be exempt from Consent in the USA due to 'required government reporting' (I am not a lawyer... but know there are some PoU that do fall along these lines). --- This is what I am thinking is in an IG. Possibly a 1 page IG. But something that is use-case specific, and thus policy is already known.

Grahame Grieve (Oct 25 2017 at 00:30):

where does smart in the PoU? I don't know of a place

Grahame Grieve (Oct 25 2017 at 00:30):

unless it's entirely out of band

John Moehrke (Oct 25 2017 at 14:28):

drat... you are right. Smart today doesn't have PoU... but I know I put in a ballot comment on that...

Jenni Syed (Oct 25 2017 at 17:27):

Our server is similar to Grahame's: we audit what we feel is necessary/required and probably wouldn't "trust" an AuditEvent from an external source as "truth" (we may eventually use it to add more context to audits, when it's something only the application calling us will have)

Jenni Syed (Oct 25 2017 at 17:28):

when I say "trust" - if the incoming audit event said a different application or user was accessing it than what was actually authorized/is tied to the auth context, we wouldn't replace those fields with supplied data :)

Josh Mandel (Nov 09 2017 at 22:13):

Quick plug for the FHIR API for Bulk Data Export Connectathon Track!

Do you yearn to share huge piles of data using FHIR, providing or consuming FHIR resources for things like assessing the value of the care provided, conducting population analyses, identifying at-risk populations, and tracking progress on quality improvement? We're piloting a FHIR API for Bulk Data Export at January's Connectathon in New Orleans.

Grahame Grieve (Nov 09 2017 at 22:18):

I yearn for that, yes

Josh Mandel (Nov 09 2017 at 22:29):

:p I guess I deserve that.

Josh Mandel (Nov 16 2017 at 22:26):

Looking at your server's implementation, @Grahame Grieve I'm seeing three things that surprise me (and I think @Dan Gottlieb shared my interpretation):

1. The final resulting .ndjson files cover only 49 entries, rather than the full set. Is this a bug?
2. When I call /Patient/$everything I was expecting to generate .ndjson files pertaining to various resource types -- but I just get a Patient.ndjson file, and no Observation.ndjson, etc. In other words, I thought /Patient/$everything was synonymous with /Group/everybody/$everything (if everybody happens to be the id of a group containing all patients), and would return files representing the whole common clinical data set. Did I interpret this differently from you?
3. When I try to retrieve the status of an export, the request fails (500) if I leave off an Accept header (and indeed, if I provide anything other than Accept: application/fhir+ndjson). But this request expects a status response with no body, so an absent Accept header should be A-OK.

Josh Mandel (Nov 16 2017 at 22:29):

For anyone following along, I'm doing:

$ curl -v -H "Accept: application/fhir+ndjson" -H "Prefer: respond-async" "http://test.fhir.org/r3/Patient/$everything" 2>&1 | grep -i content-location:
< Content-Location: http://test.fhir.org/r3/task/4a1f26d6-4948-4647-8f7c-c6849a252950
$ curl -v -H "Accept: application/fhir+ndjson" "http://test.fhir.org/r3/task/4a1f26d6-4948-4647-8f7c-c6849a252950" 2>&1 | grep -i Link:
< Link: <http://test.fhir.org/r3/task/4a1f26d6-4948-4647-8f7c-c6849a252950/Bundle.ndjson>;rel=item, <http://test.fhir.org/r3/task/4a1f26d6-4948-4647-8f7c-c6849a252950/Patient.ndjson>;rel=item, <http://test.fhir.org/r3/task/4a1f26d6-4948-4647-8f7c-c6849a252950.zip>;rel=collection
$ curl -s "http://test.fhir.org/r3/task/4a1f26d6-4948-4647-8f7c-c6849a252950/Patient.ndjson" -o Patient.ndjson
$ wc -l Patient.ndjson
49 Patient.ndjson

Grahame Grieve (Nov 17 2017 at 04:01):

with regard to #3 - it didn't occur to me to check for no Accept header.

Grahame Grieve (Nov 17 2017 at 05:07):

#2 your interpretation is correct, I just had a bug. Should be fixed shortly

Grahame Grieve (Nov 17 2017 at 05:07):

I'm not sure what you mean by #1.

Josh Mandel (Nov 17 2017 at 07:03):

Number one was the easy one, I hope. I expected to see > 1000 patients and instead I only have 49 (just one Patient fille output generated, and it had only 49 lines).

Josh Mandel (Nov 17 2017 at 07:04):

Number 3, fair enough-- do you agree that it should work without an except header?

Grahame Grieve (Nov 17 2017 at 08:48):

yes it will shortly

Grahame Grieve (Nov 17 2017 at 09:16):

ok updated

Grahame Grieve (Nov 17 2017 at 09:17):

number 1... it's clear that I'm not populating an internal field properly. I should be generating somewhere near 80000 resources for that query, but I'm only returning about 1500. It's something that I'm not populating when I write the resources into the store., and I'll look at tonight. So for now, you're getting more data, but for from the proper set of data

Grahame Grieve (Nov 23 2017 at 10:42):

@Josh Mandel , this should be all good now

Alan Viars (Apr 16 2018 at 13:15):

I 'm just doing a little housekeeping and wanted to add this link here: https://docs.google.com/document/d/1Ps9OJVixnGO8qgLhxkprpZrH4icmBJC7-VAXLoOYd1U

Last updated: Apr 12 2022 at 19:14 UTC

Main menu

FHIR Chat · Bulk Data Access · implementers

Stream: implementers

Topic: Bulk Data Access

Grahame Grieve (Sep 20 2017 at 01:45):

Grahame Grieve (Sep 20 2017 at 01:45):

Grahame Grieve (Sep 20 2017 at 01:51):

Jason Walonoski (Sep 20 2017 at 14:25):

Josh Mandel (Sep 20 2017 at 14:30):

Chris Grenz (Sep 20 2017 at 14:37):

Chris Grenz (Sep 20 2017 at 14:57):

John Moehrke (Sep 20 2017 at 15:56):

Michele Mottini (Sep 20 2017 at 19:50):

Grahame Grieve (Sep 20 2017 at 23:23):

Grahame Grieve (Sep 20 2017 at 23:25):

Michele Mottini (Sep 20 2017 at 23:31):

Jason Walonoski (Sep 21 2017 at 13:47):

John Moehrke (Oct 04 2017 at 12:59):

John Moehrke (Oct 04 2017 at 13:10):

John Moehrke (Oct 04 2017 at 13:10):

Grahame Grieve (Oct 04 2017 at 19:57):

Grahame Grieve (Oct 04 2017 at 19:57):

Grahame Grieve (Oct 04 2017 at 19:57):

Grahame Grieve (Oct 04 2017 at 19:58):

Grahame Grieve (Oct 04 2017 at 19:58):

Grahame Grieve (Oct 04 2017 at 19:59):

John Moehrke (Oct 11 2017 at 13:34):

Grahame Grieve (Oct 11 2017 at 15:44):

John Moehrke (Oct 11 2017 at 15:47):

John Moehrke (Oct 11 2017 at 15:48):

John Moehrke (Oct 11 2017 at 15:51):

Grahame Grieve (Oct 12 2017 at 10:52):

John Moehrke (Oct 12 2017 at 13:07):

Isaac Vetter (Oct 19 2017 at 15:31):

Isaac Vetter (Oct 19 2017 at 15:32):

Isaac Vetter (Oct 19 2017 at 15:32):

Isaac Vetter (Oct 19 2017 at 15:33):

Isaac Vetter (Oct 19 2017 at 15:34):

Isaac Vetter (Oct 19 2017 at 15:34):

Grahame Grieve (Oct 22 2017 at 20:18):

Grahame Grieve (Oct 22 2017 at 20:19):

Grahame Grieve (Oct 22 2017 at 20:20):

Grahame Grieve (Oct 22 2017 at 22:50):

Grahame Grieve (Oct 22 2017 at 23:14):

John Moehrke (Oct 25 2017 at 00:11):

John Moehrke (Oct 25 2017 at 00:13):

Grahame Grieve (Oct 25 2017 at 00:16):

Grahame Grieve (Oct 25 2017 at 00:17):

John Moehrke (Oct 25 2017 at 00:17):

John Moehrke (Oct 25 2017 at 00:22):

Grahame Grieve (Oct 25 2017 at 00:30):

Grahame Grieve (Oct 25 2017 at 00:30):

John Moehrke (Oct 25 2017 at 14:28):

Jenni Syed (Oct 25 2017 at 17:27):

Jenni Syed (Oct 25 2017 at 17:28):

Josh Mandel (Nov 09 2017 at 22:13):

Grahame Grieve (Nov 09 2017 at 22:18):

Josh Mandel (Nov 09 2017 at 22:29):

Josh Mandel (Nov 16 2017 at 22:26):

Josh Mandel (Nov 16 2017 at 22:29):

Grahame Grieve (Nov 17 2017 at 04:01):

Grahame Grieve (Nov 17 2017 at 05:07):

Grahame Grieve (Nov 17 2017 at 05:07):

Josh Mandel (Nov 17 2017 at 07:03):

Josh Mandel (Nov 17 2017 at 07:04):

Grahame Grieve (Nov 17 2017 at 08:48):

Grahame Grieve (Nov 17 2017 at 09:16):

Grahame Grieve (Nov 17 2017 at 09:17):

Grahame Grieve (Nov 23 2017 at 10:42):

Alan Viars (Apr 16 2018 at 13:15):