Stream: bulk data
Topic: how can a client filter the data they want?
Isaac Vetter (Jun 07 2018 at 19:26):
Hi everybody!
On our bulk data Argonaut calls, we've been talking through the potential possibility of permitting a bulk data client to pre-filter the FHIR resources that it receives from a bulk data server . Currently, we say that "all data" should be returned by the server. All data very likely includes a lifelong history from a clinical data repository; while a very common use of the bulk data api will simply be to retrieve active, current, clinical data. Certainly, a bulk data client can simply discard historical clinical data in this case -- which will likely be a majority of the data.
So, how can a client differentiate "data that matters" from "lifelong history in a clinical data repository"?
One strawman proposal that the Argo group came up with was to allow a client to specify FHIR resource statuses as part of the bulk data query:
_type=MedicationRequest%3Fstatus%3Dactive%2Ccompleted,MedicationStatement%3Fstatus%3Dactive,MedicationAdministration%3Fstatus%3Dactive
Semantics would be: client can ask for all these filters, and server could ignore and just pay attention to types (thereby returning more than the client asked for, but still "working" to return all resources the client wanted)
What thoughts do ya'll have?
1) Is this worthwhile?
2) How could we design this to be better than the above proposal?
Isaac
Dan Gottlieb (Jun 07 2018 at 19:44):
To add to Isaac's comment, another potential option we discussed would be to pick one (or a few) top level query params that cover the 80% use case. Given the interest in potentially only retrieving active resources, perhaps the status
search parameter would be sufficient. Resources that have a status element could be filtered on the server, and other filters could always be applied on the client side.
Isaac Vetter (Jun 07 2018 at 19:49):
Yes, so that could look something like:
_type=MedicationRequest,MedicationStatement,MedicationAdministration&status=active,completed
This is definitely a lot cleaner and comes at the cost of limiting the ability of the client express more in-depth queries; which might be the right trade-off. (Of particular note - the relevent Financial FHIR resources don't actually have a status element...)
Grahame Grieve (Jun 07 2018 at 19:52):
it would be appropriate to focus on search parameters not elements. Syntax should be some variant of _include syntax. generally, dates and status elements seem like the highest value items to me
Lloyd McKenzie (Jun 07 2018 at 19:57):
Your most voluminous resource will probably be Observation. And excluding all of the Observations with status other than "completed" won't be much of a filter...
John Moehrke (Jun 11 2018 at 12:48):
isn't the use-case for bulk, to get the bulk? If you have needs for filtering, then use normal REST?
Michael Donnelly (Jun 11 2018 at 19:05):
If an insurance company wants to get a weekly download of their members' data from an EHR, they likely only want new information from the last week and not everything to the beginning of time.
John Moehrke (Jun 11 2018 at 19:14):
the bulk data api includes parameters for the list of resources (_type), and timeframe (_since). The complexity Isaac is adding is about things more than that.. right? should that specific use-case be added with a (_status) parameter?
Isaac Vetter (Jun 11 2018 at 19:24):
Hey John!
To extend Michael's described use-case, imagine the full download that a health plan performs -
Let's say the provider only has electronic data for for the past 15 years and has cared for a million patients over that time. Almost all of the data downloaded will be FHIR resources with completed or other "inactive" statuses. Does the client care about completed procedures? Meds? Entered-in-error allergies?
We don't know! There currently isn't a method for the client to ask.
To state the obvious, in five years, this timespan will have increased to 20 years worth of data (this problem will get worse over time). Compute cycles and storage cost real dollars. It seems like a minimal ability for the client to be at least somewhat specific on the bulk data that they want is reasonable and would be a feature of the spec.
Whaddya think?
Isaac
Lloyd McKenzie (Jun 11 2018 at 19:44):
I think what might make more sense is to pass a "business agreement" identifier. Filtering by statuses is going to have limited utility. The reality is that you're going to care about different time-ranges for different data. A blood type observation that's 20 years old is still relevant. But if there's more than one, you really only want the most recent. For something like cholesterol or blood pressure, you might want no more than 1/month for the past 5 years for trending, but you wouldn't want the "every 5 minute" values taken while the patient was in the ER (or the every minute versions while the patient was undergoing surgery). A business agreement could specify filters that were "smarter" and applied to the particular business case for which the data was being retrieved. The downside is that the business agreement would necessarily have a standard mechanism to define it. Perhaps something exists in the CDS space? @Bryn Rhodes
Lloyd McKenzie (Jun 11 2018 at 19:44):
@Bryn Rhodes
Isaac Vetter (Jun 11 2018 at 19:52):
Hey Lloyd,
Filtering by statuses is going to have limited utility.
No doubt you're exactly right. I was just looking for a blunt instrument to hack off a majority of the data some of the time. We also haven't defined the mechanisms of creating the Group resource. Would this be a reasonable place for some interoperable business rules?
Blood type is a great example. It'd be a shame to download all Observations ever for all patients to get all patients' blood types.
The basic interoperability needs (I think) for the most well defined use-case is the exchange of recent encounters, meds & observations, and updates to problems and allergies. The part of the patient's chart that's actionable.
Bryn Rhodes (Jun 11 2018 at 19:54):
Thanks for the ping, and apologies for the delay, this had dropped off my radar. In the quality measurement and decision support space, we have the notion of a DataRequirement to specify this. Would that work as a way to structurally identify filters for types of interest?
Michael Donnelly (Jun 11 2018 at 19:57):
What's a DataRequirement?
Bryn Rhodes (Jun 11 2018 at 20:34):
http://build.fhir.org/metadatatypes.html#datarequirement
Bryn Rhodes (Jun 11 2018 at 20:34):
Structural representation of a restricted class of queries.
Lloyd McKenzie (Jun 11 2018 at 20:40):
I'd been thinking of DataRequirement but I'd forgotten it was a data type.
Lloyd McKenzie (Jun 11 2018 at 20:45):
The _since parameter makes it easy to get updates once you've gotten the initial "Relevant" set. The trick is defining the "relevant" set. For allergies, you want all of them that aren't entered in error, as you probably want the refuted ones too. (Though if you can get a currated set, that's better.) For procedures, it's going to depend. A hand amputation 25 years ago is probably relevant. The fact they had physio therapy 25 years ago probably doesn't matter. The fact they had a filling 25 years ago might be relevant, depending on whether they still have that tooth... The same thing is going to be true of labs and other Observations. My guess is there'll be a "common" set of relevant information but there will also be more specialized versions of "relevant".
Christiaan Knaap (Jun 12 2018 at 05:33):
What if we see bulk data just as a way of performing searches allowing for a potential very large resultset, returned async in json-nd? So simply reuse the search framework as a whole (give or take a few limitations maybe)? The search framework already allows a server to express what it supports (in the CapabilityStatement), so we don't need something similar but separate for bulk data for that either.
Given the nature of bulk data to request a coherent set of data, support for GraphDefinition will probably be very useful.
This feels natural to me from a generic-fhir-server standpoint, but maybe less so for FHIR interfaces on existing EHRs?
Michael Donnelly (Jun 12 2018 at 13:42):
For bulk data, we use different retrieval mechanisms than we do for regular searches.
Michael Donnelly (Jun 12 2018 at 13:46):
The patient cohort comes from a predefined registry.
Michael Donnelly (Jun 12 2018 at 13:47):
For each of those patients, we do use conventional search logic to retrieve the set of resources requested.
Simone Heckmann (Jun 15 2018 at 14:45):
I agree with @Christiaan Knaap. We are currently looking at Bulk Data as a candidate for migrating / exporting patient data as well as whole system data and may or may not have different requirements from what is currently specified. (We don't have anything specific yet).
If Bulk Data had a generic approach, we'd be much more comfortable with laying our hopes on the fact that Bulk Data will solve our problems in the near future.
Dan Gottlieb (Jun 15 2018 at 18:34):
@Simone Heckmann, @Grahame Grieve recently added async semantics to FHIR that support using the bulk data approach with any standard FHIR query http://build.fhir.org/async.html (the docs are a bit out of date relative to recent updates on the $export operation at https://github.com/smart-on-fhir/fhir-bulk-data-docs/blob/master/export.md). Our goal for the $export bulk data operation has been to carve out a subset of the uses of the async API that can be widely supported in a consistent way across a variety of systems ranging from full FHIR servers to data warehouses with limited FHIR support (specifically, export of all data in the system, export of data for all patients or export of data for a group of patients). Taken together, do you think the general async capabilities and the $export subset will cover your set of use cases?
Dan Gottlieb (Jun 15 2018 at 18:36):
More broadly, I suppose one way to look at adding filter options to $export operation beyond the three we've already defined (resource type, modified date, patient group) is to consider whether they are sufficiently important to enabling real world use of the operation that we want to accept the trade off of the added implementation complexity. One potential approach would be to keep the limited filter options we have now in v1, see what additional client side filtering backend services end up doing in production, and then use that to inform our approach to server side filter options in v2.
Josh Mandel (Jun 19 2018 at 14:48):
@Simone Heckmann just checking back to make sure Dan's answer makes sense to you -- and that you have what you need to build out additional, fine-grained async interactions on top of what we've defined for Bulk Data (i.e. use the async infrastructure; ignore the $export
operation itself if you don't need it).
Simone Heckmann (Jun 19 2018 at 15:13):
@Josh Mandel @Dan Gottlieb Thank you for the information. I haven't had a chance to read up on the async spec but from what I see, it looks like a perfect for for anything that goes beyond $export. I'm sure, we'll be fine :)
Isaac Vetter (Jun 28 2018 at 19:00):
building on Grahame and Dan's suggestions, how about a simple syntax like this:
_type=MedicationRequest,Condition&MedicationRequest:status=active,completed&Condition:clinical-status=active
Instead of adding sub, search parameters on each resource in _type, we borrow from the _include syntax and define resource-level filtering at top level query parameters?
Isaac Vetter (Jun 28 2018 at 19:45):
Nix, the above. we're going to look at experimentally enabling standard FHIR search syntax for a client to better describe the data that it wants. Josh to write a PR on the bulk data gh site.
Josh Mandel (Jul 02 2018 at 14:08):
Documented the experimental _typeFilter
parameter in PR#54
Last updated: Apr 12 2022 at 19:14 UTC