FHIR Chat · Since parameter and group membership changes · bulk data

Stream: bulk data

Topic: Since parameter and group membership changes


view this post on Zulip Rick Hawes (Oct 07 2019 at 22:23):

Greetings. I want the group's advice about a bulk data export scenario. First, a little background, CMS is setting up regular bulk data exports of our claims information (see the DPC and BCDA posts elsewhere in this forum). For each client, we define a group for bulk-export. We are expecting our clients to call our bulk-export endpoint weekly to get the latest information for a group. The first time a client calls CMS, they receive the full dataset of their group. After the first export, clients only need the updates to a group's dataset. A export with a _since parameter is well suited for this use-case, only exporting data that has changed after the passed in timestamp, it does include information that the client already has. Because we an append-only dataset, we expect this optimization to make our exports a couple of orders of magnitude faster.

The problem that I've run into is what to do when a group's roster changes. Specifically, how can a client get the initial full dataset for only new members of a group? A _since export only includes resources that have a lastUpdated time after the supplied since parameter. A full export would include the new member's resources, but it is very inefficient because the number is of new members is small. One solution would be to use a synchronous method to get an individual member's resource. Another solution would be for clients to request a full export periodically. Both of these solutions are inefficient. I want to offer an export operation that includes all resources updated since a timestamp and every resource that belongs to a new member since a timestamp.

So, has anybody run into this problem? It seems to have been discussed a little on subscriptions and since parameter interaction. Thanks in advance for any advice.

view this post on Zulip Michele Mottini (Oct 08 2019 at 02:21):

Very good point...no idea about a solution though

view this post on Zulip Paul Church (Oct 08 2019 at 02:58):

Google has encountered a pattern like that, often posed as:

My app/ML model/dataset/whatever has access to only patients currently admitted to facility X, and I subscribe to or periodically export changes to those patients. When an admission happens I want to go get all of the historical record for the patient that I newly gained access to.

In this context we lean towards "subscribe to patient admissions, call synchronous patient-everything" because the flow of data is near-real-time and the rate of patients appearing is reasonably low. For weekly updates this is not going to work.

There is a related problem in subscriptions of "I want to subscribe to the event that I gain access to a resource that I did not previously have access to" which applies this question to a push instead of pull relationship. (And its troublesome converse "...that I lose access to a resource...")

view this post on Zulip Paul Church (Oct 08 2019 at 03:03):

I have a fairly clear idea for to making Group/$export do what you want, but it requires that your Group has a properly maintained Group.member.period for every member.

When doing Group/$export?_since=T, for each member if period.start < T you export Patient/$everything?_since=T, and if period.start >=T you just export Patient/$everything because it's a new addition. This gets you what you want in one go.

view this post on Zulip Rick Hawes (Oct 08 2019 at 04:10):

I have a fairly clear idea for to making Group/$export do what you want, but it requires that your Group has a properly maintained Group.member.period for every member.

When doing Group/$export?_since=T, for each member if period.start < T you export Patient/$everything?_since=T, and if period.start >=T you just export Patient/$everything because it's a new addition. This gets you what you want in one go.

Thanks Paul. You are right that is what I want for semantics of the operation, but returning a resource with a lastUpdated value before the since parameter would be incompatible with the export spec. Here's the line that I'm looking at

Resources will be included in the response if their state has changed after the supplied time (e.g. if Resource.meta.lastUpdated is later than the supplied _since time).

To me, this description means that I can't return everything for a new member and still be compatible with the spec. What to do?

view this post on Zulip Paul Church (Oct 08 2019 at 12:36):

Yes, to be clear I am suggesting a way that the spec could be changed or extended, not how it currently is.

view this post on Zulip Michele Mottini (Oct 08 2019 at 13:34):

Change the specs

view this post on Zulip Michele Mottini (Oct 08 2019 at 13:36):

The export system would need to know when patient have been added to the group, but that's should not be too difficult

view this post on Zulip Josh Mandel (Oct 11 2019 at 02:15):

I agree that it is worth describing thks functionality specifically in the next version of the $export specification and would welcome a PR + GForge tracker.

I should also note that the proposed solution would not technically violate what we have written today, since:

Resources will be included in the response if their state has changed after the supplied time (e.g. if Resource.meta.lastUpdated is later than the supplied _since time).

says "included in the response if" and not "only if". It seems like an over-read, but we do have precedent for saying that it is okay for servers to return more data than a client requested (e.g., additional operation outcomes on a search; or as we said in Sync for Science, it''s okay for a server to return results from a wider range of time than the client requests).

view this post on Zulip Yunwei Wang (Oct 11 2019 at 13:21):

but we do have precedent for saying that it is okay for servers to return more data than a client requested (e.g., additional operation outcomes on a search; or as we said in Sync for Science, it''s okay for a server to return results from a wider range of time than the client requests).

That is a headache for client and testing system because I cannot validate if server's export satisfied value in _since parameter.

view this post on Zulip Josh Mandel (Oct 11 2019 at 14:12):

Yeah, that's true. You have the output's "request" property at least

view this post on Zulip Josh Mandel (Oct 11 2019 at 14:13):

(And less of a headache than having no efficient way to get the data.)

view this post on Zulip Rick Hawes (Oct 12 2019 at 13:45):

I should also note that the proposed solution would not technically violate what we have written today, since:

Resources will be included in the response if their state has changed after the supplied time (e.g. if Resource.meta.lastUpdated is later than the supplied _since time).

says "included in the response if" and not "only if". It seems like an over-read, but we do have precedent for saying that it is okay for servers to return more data than a client requested (e.g., additional operation outcomes on a search; or as we said in Sync for Science, it''s okay for a server to return results from a wider range of time than the client requests).

@Isaac Vetter I'm wondering what the EPIC bulk-export client would do if a bulk-export end-point returned more than what was requested per Josh's interpretation.

view this post on Zulip Matt Snook (May 04 2020 at 17:35):

I'd like to revive this thread... :)

BCDA is looking at experimental ways to simplify the problem Rick posed many months ago; and hopefully through experimentation, user feedback, and feedback from the FHIR community we can iterate and come to a solution.

We currently provide /Patient, which allows an authorized user to request bulk data for all patients associated with their organization.  The mapping of patients-to-organization is handled by an ETL process that our system executes on a monthly basis. We are looking for ways to utilize _since in a world wear group rosters change monthly.

We currently offer an experimental group identifier called "all", which essentially does exactly what the /Patient endpoint does - it retrieves data for all patients associated with the requesting organization. \We are experimenting with a new group ID called "new", which would perform a grouping for patients after a request is made by an authorized organization.  The "new" group identifier would compare the patients from the current month's ETL load against the patients from the previous month's ETL load, and only include the "new" patients (i.e., exist in this month's load but not last month's load) in the bulk export job.  This provides a way for users of the API to "slim down" the bulk data requested, in situations where only new patient data is needed.  Note that we intend to use this in conjunction with _since, where _since would be used to retrieve newer claims data and /Group/new would be used to retrieve newer patients attributed to the calling organization.

We are aware of this statement from the Bulk FHIR Spec: "Note: How these Groups are defined is specific to each FHIR system’s implementation.".... but we do want to make sure we are building something that does not stray from the spec.

We are looking to the FHIR community to provide comments/feedback on this approach, from a FHIR-compliance perspective.  We are especially interested in hearing comments about our approach of creating simple group identifiers 

I can provide more clarification in our approach if necessary, just let me know where that might be.

Thanks in advance.  Your input is greatly appreciated.

view this post on Zulip Jenni Syed (May 04 2020 at 17:52):

I'm actually curious why this would vary to far from what we have said about _lastUpdated or challenges in subscription. IE: the current parameters (eg: _since) apply to the "new" state. This would result in you getting data from new patients but honoring the _since...

view this post on Zulip Jenni Syed (May 04 2020 at 17:53):

The approach above isn't much different than requiring the caller (if they use _since) to occasionally download the full set of data on occasion. Group membership is only one tiny reason that what is included in the data may change from call to call

view this post on Zulip Jenni Syed (May 04 2020 at 17:54):

security and other things that you don't typically trigger alerts on are another, more tricky reason

view this post on Zulip Jenni Syed (May 04 2020 at 17:55):

Your approach does allow for a much smaller amount of data - but I think there may be other reasons they need the "full" set. Also, just because a patient has been added to a group I don't think it implies that the caller should be authorized to all previous data through the lifetime of the patient

view this post on Zulip Jenni Syed (May 04 2020 at 17:56):

eg: as a payor, insuring a new member who is 80 yrs old, why should I get data prior to that time they were added?

view this post on Zulip Dan Gottlieb (May 04 2020 at 18:16):

@Matt Snook good timing! The issue of obtaining historical data on patients added to a group is one of the enhancements to the bulk IG we're focusing on this year in the Argonaut bulk data working group. I put together a straw man proposal around extending the meaning of the _since param that we discussed at the meeting last week (issue: https://github.com/HL7/bulk-data/issues/65 , proposal: https://github.com/HL7/bulk-data/pull/68/files , meeting notes: https://bit.ly/argo20-bulk ). As a follow-up, I'm writing up an alternate proposal the doesn't require the server to maintain a history of group membership that we'll be discussing at the next meeting (either next Thursday or two weeks after that depending on the FHIR Connectathon schedule - details will be posted to the Argonaut announcement thread above). Perhaps you could join the call to talk through your use case?

view this post on Zulip Dan Gottlieb (May 04 2020 at 18:17):

Regarding the "all" group, why not just use the patient endpoint, since all requests are filtered by what the user is allowed to access?

view this post on Zulip Gino Canessa (May 04 2020 at 18:17):

There was discussion around some of this during the Argonaut Bulk Data call last week (notes).

The filed issue against the spec uses the case where someone is added to the group, but their newest data is before your last _since (e.g., checking today (May 4), last check was April 27, the patient was added this past week, but newest data is from April 24).

Keeping in mind that Bulk Data is generally concerned with large scales (e.g., millions of records and/or several GB of data), there is a consensus that the spec should clarify this with a 'standard process' for handling it.

Can read the notes to check, but I believe that (and please jump in if I'm wrong):

  • This is common enough to warrant handling
  • It could be handled purely by clients, but:
    • There is a lot of complexity to handling this properly, which may not be reasonable in some implementations
    • There are more clients than servers, so this will result in more repeated work (and opportunities to get wrong)
  • It probably should be handled by servers, but:
    • Servers cannot be expected to track the dataset for every query in order to compare in the future
    • This is particularly problematic around characteristic groups (e.g., groups that are only instantiated during requests to resolve them)

Other than the last point, I think a subscription would actually work to bridge the gap (e.g., notify me when group membership changes, so that I can build a list of changed patients). That last point doesn't work with subscriptions unless we define it somehow (e.g., instantiate a group over some period and build a change notification based on that?). That last one may be worth bringing up for subscriptions during the connectathon as well.

view this post on Zulip Dan Gottlieb (May 04 2020 at 18:17):

Also, if we went with a structure like "Group/new", then users with access to more than one group wouldn't be able to see additions to a specific group, right?

view this post on Zulip Gino Canessa (May 04 2020 at 18:17):

Lol for hitting enter the same time as Dan, as well.

view this post on Zulip Dan Gottlieb (May 04 2020 at 18:18):

Ha, at least we didn't contradict each other!

view this post on Zulip Gino Canessa (May 04 2020 at 18:19):

Dan Gottlieb said:

Ha, at least we didn't contradict each other!

Hmm... let me edit :wink:

view this post on Zulip Dan Gottlieb (May 04 2020 at 18:23):

@Jenni Syed agree that whether a user gets historical data on patients added to a group certainly depends on the business rules involved, but this has been a challenge in implementations and is forcing clients have to download many gb of data on each request just to find the small set of new data being sent for this reason.

view this post on Zulip Jenni Syed (May 04 2020 at 18:34):

I worry that the person being added (or removed?) is the "easy" problem to solve, but there are many "hidden" problems that servers may not be able to track which may require a client to download the full population sometimes - NOT every time

view this post on Zulip Jenni Syed (May 04 2020 at 18:36):

I agree we should try to make it more performant/easy (noting that _since isn't a required parameter per spec nor per ONC regs), but I think we're going to find edge cases where this could get "fun" beyond just group membership

view this post on Zulip Jenni Syed (May 04 2020 at 18:36):

IE: X Person was always part of the group, but this app didn't have privs to see them?

view this post on Zulip Dan Gottlieb (May 04 2020 at 18:45):

Yeah, that wouldn't necessarily be addressed by the more robust _since approach (unless the server is sophisticated enough to consider privilege changes in the delta calculation). It should work with the other approach we discussed at the meeting of first querying for a list of patients in the group and then having the client restrict the full data query to a set of patients ids it hasn't seen before via a new "patient" param.

view this post on Zulip Gino Canessa (May 04 2020 at 18:46):

Agreed Jenni. But if it's hard for this group of people, I wouldn't expect every individual implementer to get it right.
I like the security use case too, but Dan handily beat my reply this time :-)

view this post on Zulip Matt Snook (May 05 2020 at 14:12):

Dan Gottlieb said:

Also, if we went with a structure like "Group/new", then users with access to more than one group wouldn't be able to see additions to a specific group, right?

Technically, in BCDA, users are Accountable Care Organizations (ACOs) and thus users have access to just one group - the beneficiaries attributed to their ACO. This is why a call to /Patient returns the same data as /Group/all (in our implementation, which is essentially why I'm looking for feedback on our approach). Because of this "grouping" on an ACO basis, we were looking at making use of pre-defined group identifiers to group patients within the ACO based on their "newness" to the ACO, so that the workflow plays nicely with _since. Thanks for giving this thread attention.

view this post on Zulip Josh Mandel (May 05 2020 at 19:44):

I'm not sure I'm following @Matt Snook -- are you describing a behavior you think we might want to standardize through the Bulk Data IG? Is there a use case you're looking to handle that's not being addressed by the PRs currently being proposed? It'd be great if you can join next week's Argonaut call.

view this post on Zulip Matt Snook (May 06 2020 at 20:35):

Josh Mandel said:

I'm not sure I'm following Matt Snook -- are you describing a behavior you think we might want to standardize through the Bulk Data IG? Is there a use case you're looking to handle that's not being addressed by the PRs currently being proposed? It'd be great if you can join next week's Argonaut call.

@Josh Mandel I'm mainly interested in hearing comments/ feedback on the approach we are considering with a pre-defined group identifier called "new", which is used to only retrieve beneficiaries which are new this period (where "period" might be defined as 1 month).

view this post on Zulip Karl M. Davis (May 06 2020 at 20:51):

@Matt Snook What if an ACO forgets to pull data during a period? Or has an outage spanning the entire period?

view this post on Zulip Matt Snook (May 06 2020 at 21:02):

Karl M. Davis said:

Matt Snook What if an ACO forgets to pull data during a period? Or has an outage spanning the entire period?

Great question. I think, in that case, they would need to pull "all" data again if they'd like historical data for new beneficiaries. Also, remember, the idea of "new" representing 1 month is just for purposes of discussion. It could also mean "new this year" (for example) if we think that is a better approach for ACO's workflows. The intent here is to see how these pre-defined group identifiers either stray or align with FHIR and the community.

view this post on Zulip Karl M. Davis (May 06 2020 at 21:13):

Okay. How would they determine, programmatically, that they need to do that? What code would they write to automatically say, "hey, looks like we need to pull all the data from scratch again -- let's do that"?

view this post on Zulip Josh Mandel (May 06 2020 at 21:26):

In general we try not to assign any meaning to FHIR Resource.id values, since they're not always under a server's control (e.g., some architectures assign these randomly or through counters, etc).

view this post on Zulip Dan Gottlieb (May 07 2020 at 13:51):

@Matt Snook - in terms of alternate options, the proposed spec changes outlined in https://github.com/HL7/bulk-data/pull/71/files?short_path=985cef1#diff-985cef1e2f9cd85436f27813aa6de763 are an attempt at a generalizable approach to returning historical data for patients new to a group.

view this post on Zulip Dan Gottlieb (May 07 2020 at 13:51):

Servers would be able to either return this data as part of the group export request based on the timestamp of the _since parameter and/or support a pattern where the client can determine which patients have been added to a group and request just their data.

view this post on Zulip Dan Gottlieb (May 07 2020 at 13:51):

This has the advantages of not being tied to a particular time period and working on servers where a user can have access to more than one group. Would this approach work for your use case?

view this post on Zulip Matt Snook (May 07 2020 at 14:01):

Dan Gottlieb said:

Matt Snook - in terms of alternate options, the proposed spec changes outlined in https://github.com/HL7/bulk-data/pull/71/files?short_path=985cef1#diff-985cef1e2f9cd85436f27813aa6de763 are an attempt at a generalizable approach to returning historical data for patients new to a group.

This approach is particularly interesting. To clarify the meaning of this addition to the _since parameter:
In the case of a Group level export, servers MAY return additional resources if the resource belongs to the patient compartment of a patient added to the Group after the supplied time (this behavior should be clearly documented by the server).

I read this as, when /Group is called with a date specified in _since:

  • then we return data for all patients in the specified group which have had data updated since the specified date
  • we also, optionally, could return ALL data (including historical data prior to the specified date input in _since param) for any new patients added to the group after the specified date. If we chose to optionally implement this strategy, the API should be heavily documented to clearly indicate the behavior.

If that is the intended usage of this proposed addition, then I think it might work for our use case! I'd love to hear thoughts from @Karl M. Davis on this.

view this post on Zulip Rick Hawes (May 08 2020 at 13:08):

@Dan Gottlieb At the CMS, our use-cases can have large groups (the largest I know of has >1M members). The patient query parameter in the proposal seems very well suited for small lists of patients, but for medium-sized lists of patients, it will run against URL length limits.

I suspect this issue come up in your discussion. What is the workgroup's thinking?

view this post on Zulip Josh Mandel (May 08 2020 at 13:14):

The plan is to allow submissions via POST to accommodate larger payloads.

view this post on Zulip Josh Mandel (May 08 2020 at 13:16):

It still limits us to what's reliably going to fit in a single http request body, but this seems workable. (To be clear though, CMS could implement the diff logic server side and save clients the trouble of computing/submitting a diff .)

view this post on Zulip Rick Hawes (May 08 2020 at 13:23):

@Josh Mandel Yes. A POST version of the request would be much more workable for large lists. Thanks for the quick response.

view this post on Zulip Dan Gottlieb (May 08 2020 at 13:27):

The POST support Josh mentioned is in the proposal here: https://github.com/HL7/bulk-data/pull/71/files#diff-985cef1e2f9cd85436f27813aa6de763R56 and I outlined a few implementation questions around it in the pull request at https://github.com/HL7/bulk-data/pull/71 .

view this post on Zulip Dan Gottlieb (May 08 2020 at 13:29):

Also, when servers have a history of group changes, it may be cleaner to just supply this data based on the _since param as outlined at https://github.com/HL7/bulk-data/pull/71/files#diff-985cef1e2f9cd85436f27813aa6de763R121 rather than implementing the patient param and making clients track and request the history.

view this post on Zulip Richard Braman (May 14 2020 at 16:04):

@Dan Gottlieb I know AdHoc/CMS engineers are interested in continuing this discussion. Is this something we can do during Connectathon or would next WG meeting be more appropriate?

view this post on Zulip Dan Gottlieb (May 14 2020 at 16:12):

@Richard Braman thanks for following up! The current plan is to have this be the main topic of discussion at the next Argonaut meeting (two weeks from today), since a few of the folks involved are focusing on other Connectathon tracks.

view this post on Zulip Dan Gottlieb (May 14 2020 at 16:16):

That said, if the team has time, it might be worth using the connectathon to review the proposed spec changes at https://github.com/HL7/bulk-data/pull/71/files?short_path=985cef1#diff-985cef1e2f9cd85436f27813aa6de763 and the open questions at https://github.com/HL7/bulk-data/pull/71 (possibly with some prototyping) to develop specific feedback based on your implementation.

view this post on Zulip Vladimir Ignatov (Aug 10 2020 at 22:02):

I understand that POST was only added to handle large patient payloads but I think it may need a bit more details. Some questions that could use explicit answer in the spec are:

  • Can I use the POST parameters for every export parameter, including _since, _elements, _type, etc?
    - If no, specify which parameters are allowed via POST
    - If yes
    - Can I mix POST and GET parameters, i.e. set some of them in the query string and others in the POST body?
    - Can I duplicate a parameter and specify it both in the query string and in the body
    - If no, return an error?
    - If yes, which one takes precedence?

view this post on Zulip Josh Mandel (Aug 10 2020 at 22:23):

I'm hoping the core FHIR operation semantics provides answers to most of these.

  • (Idempotent) Operations can be invoked by GET or POST, with the same parameters available in both. (Exception: parameters with complex types like Reference can't be passed via GET, so if you need them, you need POST.)

  • Any given invocation is either via POST (in which case, parameters are passed in the body) or GET (in which case, parameters are passed in the URL). There's no splitting/mixing.

view this post on Zulip Vladimir Ignatov (Aug 11 2020 at 00:52):

Thanks Josh! I guess I'll have to get used to thinking of Bulk Data as part of FHIR :)


Last updated: Apr 12 2022 at 19:14 UTC