FHIR Chat · Anonymized Resources · implementers

Stream: implementers

Topic: Anonymized Resources


view this post on Zulip Newt (Feb 04 2021 at 08:30):

Hi all, I am wondering what patterns exist for optionally anonymizing certain fields for particular consumers of a FHIR API would be?
The scenario we have is exposing an API to multiple consumers, based on permissions, some can see all data, whereas some should only be able to de-identified or anonymized versions of the data.

Some ideas I have so far are:

  1. Provide a separate service with a different version of the resources and just not return the fields at all - E.g. for Patient, just remove all names, addresses etc - so each consumer has access to one of the two API's
  2. Make all the fields for resources that might be anonymized not mandatory, so just don't return them for certain consumers
  3. Alter the fields, maybe returning * or similar in place of actual characters (essentially redacting the information but keeping the field with some kind of value)
  4. Provide a new custom resource for the anaonymized ones in parallel to the usual ones and control who has access to which

Are there any other options people can think of? Or any opinions on what the best way to go would be?

I am trying to determine what most consumers would expect as the best soltuion.

Some things I have found while searching, there is an AuditEvent extension, I am not quite sure what the context for using it is:
https://www.hl7.org/FHIR/extension-auditevent-anonymized-definitions.html

I also found this which I am investigating, but it seems more intended for redacting information to extract it and send, not sure if it works for real-time API's:
https://github.com/microsoft/FHIR-Tools-for-Anonymization

view this post on Zulip John Moehrke (Feb 04 2021 at 13:13):

What has been discussed in the security wg is that you can provide a degraded resource based on things like application credentials, user credentials, purposeOfUse, etc. The degraded resource would need to be marked in the .meta.security with a tag that indicates that it is degraded.

view this post on Zulip John Moehrke (Feb 04 2021 at 13:14):

The FHIR spec includes one of these use-cases baked into the specification, the "_summary" https://www.hl7.org/fhir/search.html#summary

view this post on Zulip John Moehrke (Feb 04 2021 at 13:16):

Some useful security tags for when the resource is degraded for security or privacy reasons are in the Security Integrity valueset
https://www.hl7.org/fhir/v3/SecurityIntegrityObservationValue/vs.html

such as ANONYED, MASKED, REDACTED, etc...

view this post on Zulip John Moehrke (Feb 04 2021 at 13:18):

see FHIR Core security page on this very topic https://www.hl7.org/fhir/secpriv-module.html#deId

view this post on Zulip Craig Newman (Feb 04 2021 at 13:25):

I believe that the case reporting IG keeps the key demographics elements mandatory but requires support for the dataAbsentReason extension with a value of "masked" for those cases when de-identified data needs to be shared. http://hl7.org/fhir/us/ecr/STU1/StructureDefinition-ecr-patient.html

view this post on Zulip Grahame Grieve (Feb 04 2021 at 20:08):

I think that the simplest approach is different end-points - one for real data, one of anonymised data.

view this post on Zulip Grahame Grieve (Feb 04 2021 at 20:09):

I say this for 2 reasons:

  • caches cannot get polluted by anonymised or not versions of the same resource
  • really good anonymization involves both providing fake dates, so ages are still roughly correct and eliminating whole classes of links and link targets, so that some urls are not allowed

view this post on Zulip John Moehrke (Feb 04 2021 at 20:13):

@Grahame Grieve this same argument should then be put toward the _summary feature of FHIR... I don't disagree with your urging, just that it seems your urging is counter to a feature that is in FHIR core.

view this post on Zulip Grahame Grieve (Feb 04 2021 at 20:15):

_summary is less dangerous because it's always explicit in the URL, and there fore the caching problem is not there. But yes, there is still potential danger here if clients are not careful

view this post on Zulip John Moehrke (Feb 04 2021 at 20:15):

I will note that real de-identification can't be done on a resource-by-resource basis. but rather must be done on a population of data. Thus the more realistic approach is to use the bulk-data access to extract out a population of data, run that through an analytics driven de-identification, and produce a new dataset that is de-identified for the given purpose.

view this post on Zulip John Moehrke (Feb 04 2021 at 20:16):

it was not clear from the proposal that this was an attempt to do de-identification, or was closer to _summary of just filtering,

view this post on Zulip Newt (Feb 09 2021 at 02:57):

Excellent thanks so much for the insight so far, very thought provoking!

I agree with your sentiment around de-identifying the population. I guess that will be more of a problem if we are exposing documents/text. I will comment on that more below, but first some more context: we are trying to create a FHIR Compliant API on top of an EHR/EMR type system. We have two options:
A) Create a facade layer that queries the raw data and converts it to FHIR on the fly
B) Host a FHIR server where we synchronise the data, store it in FHIR and allow the FHIR server to provide the API

One of the considerations is there are various consumers of the API and they require different permissions. Given there are infinite number of consumers, and some require close-to-real-time extraction, whereas some can do batch, we just want to provide two levels of data to keep it simple:
1) anonymized/de-identified
2) full
We don't want to have to create different permission levels for different consumers, they will have to conform to one of these categories.

So in order to help decide if we would go with option A or B with the API design, a big consideration is the cost of implementing and running the API.

We are leaning towards doing Option B - hosting a FHIR server, but to achieve the anonymization rqeuirement, we would have to decide if we were to try and put a layer on top of the FHIR server that deals with it (since it doesn't seem to be a built-in feature of FHIR servers); or maybe have literally two FHIR servers running, one stores the full data and one stores anonyized data, and the logic for what we put in each is decided before the data gets in there.

The latter (2 servers) feels like the simpler option but it seems wrong to duplicate so much, although it allows for an iterative approach by excluding documents and other tricky resources initially, but if people require them use tools to redact text etc over time as those requirements are uncovered, all while giving full data access to those that need it without slowing them down.

Any thoughts on that?

view this post on Zulip Lloyd McKenzie (Feb 09 2021 at 03:17):

You might need more than two if you're going to have a wide variety of customers with a variety of use-cases. Anonymization and de-identification are generally best done with the use-case in mind. For example, to de-identify, there's often a need to fuzz or add randomness to dates (because knowing specific dates, or event he precise duration between particular dates can be enough to re-identify a record). However, if the data is being used for research purposes, it may be that certain offsets need to be known with a particular degree of precision. As a result, the rules that govern the data fuzzing might need to take that intended use into account. Similar considerations might apply to geolocation information, etc.

Note that this consideration is somewhat orthogonal to the choice of transformation at time of query vs. creating a separate queriable repository. The only trick is that if doing run-time conversion, the algorithm that de-identifies/anonymizes needs to consistently convert data. E.g. ensuring that patient identity, provider identity, date fuzzing, etc. is performed consistently, even on data queried days or months apart - for a given user/use.

view this post on Zulip John Moehrke (Feb 09 2021 at 14:06):

Lloyd said exactly what I was going to say. De-Identification is not just blinding some elements. You have indicated that you have some apps that should get a degraded form of the data. This needs to be fully analyzed for the kind of re-identification risks that those systems could perform. They would be using all the data that comes back to them, thus a population of data are used to re-identify. This is what drives the points Lloyd brings up. This is why de-identification is done on a population of data, with statistical analysis proof that the resulting data population is protected from re-identification.

Given the little bit of information you have provided. It would seem logical that for these apps, you should create a standalone de-identified database that is also likely behind the real-time system due to the mechanics of doing this level of de-identification and risk assessment. This includes mechanisms for blinding, fuzzing, id-reassignment, pseudonums, etc. This includes analysis of the data (e.g. k-anonymity).

While you have a full fidelity FHIR api built directly ontop of your database. This API is used by the trusted apps, and this API is what the de-identified database mechanism pulls from on an occasional basis (the delay mentioned).

The independent databases would be at a different root URL and would need all id values substituted else there is a risk of re-identification thru correlation between trusted and untrusted networks.

view this post on Zulip Newt (Feb 10 2021 at 03:03):

Excellent thanks so much for your thoughts LLoyd and John.

I guess some examples would be helpful. Let's say we have 3 scenarios we need to fill:

  1. An online booking system built on top of a real-time FHIR API - this requires full data access
  2. An analytics solution that provides insights into data and events going on, built on top of a real-time FHIR API - this can only have a subset of data, no sensitive information - but the data can be "missing" as opposed to fuzzy (I.e. names and addresses and so on not available at all)
  3. A research solution that takes a daily/monthly extract of data in FHIR format, it requires the data to be there but must be de-identified and needs to be re-identified on demand

I am hoping we can have a standard API (or two API's) for scenario 1 and 2 that meets the needs of a vast majority of integration scenarios, but then for #3 I would envisage that would probably need to be set up differently for each use case, it's probably not reusable as a whole.

I am really now wanting to figure out the best way to re-use as much logic/storage/'development effort' to achieve #1 and #2, and assume it will not meet the needs of #3. So with that in mind, I think where my head is at - the best solution would be to have different endpoints for both full an anonymised, each with their own profiles defining which fields are available. How we achieve that under the hood (facade/server) then needs to be determined. Does that feel like a sane approach?

view this post on Zulip Lloyd McKenzie (Feb 10 2021 at 03:26):

  1. Names and addresses but all dates retained with precision could still allow re-identification. On the other hand, eliminating the dates is going to significantly impact what analytics can be performed. So the trick is to figure out what the reidentification risks are and what the analytic requirements are and then to adjust the dates to the degree necessary to minimize the risks while maximizing the analytics tradeoffs. There's no exact right answer - it's a question of risk tolerance. If there are multiple analytics solutions, it's possible that the risk profile and the data needs would be different, meaning the algorithm would also need to be different.

  2. Being able to re-identify is easy - just include a random identifier on the patients and retain a private map of that to the 'real' identifiers. So long as that mapping is appropriately secured, that's straight-forward. The same notion of de-identification applies here - different research projects might have distinct reidentification risks and also different needs for date correlation - meaning that different research consumers might also need distinct algorithms. For example, a research project focused on newborns might need precision around date and time down to the hour or even minute, but precision around year could be completely different and could be fuzzed separately for each patient. On the other hand, research related to impact on health of a real-world event would possibly need year and month to remain un-fuzzed, but might be able to strip or fuzz the day and time.

Similarly research conducted by an organization with access to location-based data that could identify proximity to healthcare organizations might be deemed higher risk than a trusted partner who has little access to such information.

In short, a one-size fits all or even a 3-size fits all solution is questionable, and certainly would not be best practice.

view this post on Zulip John Moehrke (Feb 10 2021 at 12:57):

again, Lloyd beat me to it... It is good to have more people able to express these concerns. We can't address how well you are handling #2 or #3. If your risk assessment proves that you are doing it okay, then this is possible.

My proposal had your #3 being a different service, that was fed using the #1 API. This service would need to be trusted to do the right de-identification. This model is proposed in many places. I have used diagrams that @Josh Mandel put together to show this.

view this post on Zulip John Moehrke (Feb 10 2021 at 13:10):

There is a flavor of #2 that is at the basis of some demonstration projects that the VA has done. They were designed to show Data-Sensitivity-for-Privacy (aka DS4P). They would do evaluation of the post search bundle to remove resources that the recipient was not authorized to receive. In this way the same API can serve Normal patient data, and Restricted patient data. Where some users (or apps) would have full access, some only access to Normal. This also leveraged the "Security Labeling Service" to do analysis of a Bundle content to tag each resource in the bundle for the kind of sensitivity found, this service would look at the data and in theory could be quite complex and comprehensive, for the demo it just looked for some codes). After the data are sensitivity labeled, when the recipient should only receive Normal data, then the sensitive data are removed. In this case it was removal at the resource level, thus not leading to the concern Lloyd brings up.

Current development is trying to do this at the element within a Resource level. In that case the whole resource would be considered sensitive if anything within was, that is the same rule as before; but the post tagging process could eliminate an element rather than the whole Resource. As Lloyd points out, doing this absent analysis of the results could be ineffective or could produce a resource of no value or even dangerous. I have rallied against this kind of element level work, as more dangerous than useful. The other post processing could be to return the fully tagged content to a recipient that is trusted to further handle it properly, meaning the recipient is trusted to get Restricted and trusted to further refine based on that recipients internal policy enforcement. Thus the restricted data are exposed, but sensitive elements are tagged at the element level.

Most of the FHIR resources are more likely to be damaged than helped by this kind of element level effort (e.g. Observation).

view this post on Zulip John Moehrke (Feb 10 2021 at 13:15):

There are a set of resources for which this element level manipulation is more likely to be useful and less likely to be damaging. Things like Patient, or Encounter. Your use-case might be focused on these. I think one might be able to come up with justifable algorithms.

For example Patient is critical to align all resources about a subject. But it includes a bunch of demographics that not all clients need access to. Removing some elements are easy, like removing address, telecom and name. This feels less damaging, and feels like it would be effective. However there are ways for an attacker to re-identify based on other things, like the visit information found in the Observations. However, if the recipients are "mostly trusted", and under obligation to "not attempt to re-identify"; then this model can lower the risk. BUT I stress this is not a general algorithm, this is a targeted algorithm for a specific purpose under a defined policy.

view this post on Zulip John Moehrke (Feb 10 2021 at 13:20):

Here are some of my articles. I haven't written the above up yet. You will note some of these are very old, that is because they are based on the concept and risk analysis; concepts that have been defined millennia ago

view this post on Zulip Chris Grenz (Mar 19 2021 at 14:49):

@John Moehrke If I was forced to tag elements (e.g. ElementDefintions) with a sensitivity label, is there a sensibly standard way to do so? ElementDefinition.code seems wrong. Expecting an extension?

view this post on Zulip Lloyd McKenzie (Mar 19 2021 at 14:52):

I think an extension is the only way at the moment

view this post on Zulip Chris Grenz (Mar 19 2021 at 14:55):

Are you aware of anyone already defining such an extension?

view this post on Zulip Lloyd McKenzie (Mar 19 2021 at 14:57):

I'm not. @k connor is the most likely person to know if something exists somewhere.

view this post on Zulip k connor (Mar 19 2021 at 15:26):

Hi Chris, I'm working with @Mohammad Jafari on adding an approach to subresource/element labeling in the FHIR Data Segmentation for Privacy (DS4P) IG we are balloting in May. Current version is http://hl7.org/fhir/uv/security-label-ds4p/2020May/.

After considering all the options we could think of and discussed on the FHIR Security calls, we are proposing the simplest approach - to add a security label extension in meta.security as an indicator that there is 1..* subresource labels. The subresource labels will have the same structure as the current IG's meta.security labels - see http://hl7.org/fhir/uv/security-label-ds4p/2020May/artifacts.html.

The security label tags for e.g., sensitivity, are already standard codes from the HL7 Privacy and Security Classification System (HCS) - see https://www.hl7.org/fhir/v3/InformationSensitivityPolicy/vs.html. If you'd like to discuss your use case further, please join the FHIR Security calls on Mondays at 12ET Zoom: https://us02web.zoom.us/j/82557619633?pwd=SnhFdWNMRGM5VXFLVjZ0OHVXUlhXdz09 Agenda are listed at https://confluence.hl7.org/pages/viewpage.action?pageId=31690524

view this post on Zulip Jose Costa Teixeira (Mar 22 2021 at 01:09):

I think it makes more sense to tag the resource definition (profiles), not the instances. This is the use case for the Permission resource, which can accompany the resource instances if that is needed


Last updated: Apr 12 2022 at 19:14 UTC