Stream: implementers
Topic: Anonymization libraries
Abbie Watson (May 19 2017 at 21:31):
Hi,
So I'm currently working on some logging libraries, and am thinking about how apps log warnings, errors, and other info to the server and browser console logs. A lot of apps have a lot of debugging info floating around; and while best practice is to turn off logging during production, I've learned to anticipate that such things can get left on by devops teams for weeks on end as they collect devops statistics.
Anyhow... does anybody have any thoughts on anonymizing resources? My particular use case right now is interoperability as it pertains to resources getting written to a winston or loggly or splunk transport pipe; but it could just as easily apply to an inbound/outbound RESTful endpoint. Are there any libraries out there for anonymizing a Patient resource? Or flagging resources in a bundle as belonging to one of the ~17 PII types?
Elliot Silver (May 19 2017 at 22:38):
I have some (somewhat random and contradictory) thoughts on this.
- I'd question why you need to write out PII in the first place: you could probably get by logging the resource id, rather than the actual resource content.
- IHE ATNA and DICOM have done a pretty good job of defining useful audit records that do not disclose unnecessary PII, and you might be able to find inspiration there.
- There is the question of whether the logs need to be anonymized: if the admin who can see the logs, also can browse the whole datastore, there isn't much to be gained from anonymizing the logs.
- Why are you logging telemetry/statistics to the same file as error logging? If you separate them you can pass around the telemetry without worrying accidental disclosure.
John Moehrke (May 21 2017 at 18:52):
@Abigail Watson anonymization is not a function, it is a process. That is, it is a process of determining how to apply algorithms to data such that the resulting data is both useful for some defined purpose, and also privacy preserving. The resulting anonymization algorithms do not bring risk to zero, they bring the risk to an acceptable level based on the controls applied to that resulting data set. There are standards. see https://healthcaresecprivacy.blogspot.com/p/topics.html#DEID
John Moehrke (May 21 2017 at 18:55):
that said, we could apply a first level analysis toward FHIR. To identify the elements that are Direct-Identifiers, Quasi-Identifiers, vs non-identifying data. I would be happy to do this, but would not want to do it as purely an academic exercise. The exercise needs to be done in context, and often needs a good bit of history behind the decisions. For example within DICOM, they have done this; and the result is a set of de-identification profiles. These profiles do not bring risk to zero; but are re-usable patterns that work for specific types of use with specific types of target protection environments. Similar could be done with FHIR, but not until we have far more experience with FHIR.
John Moehrke (May 21 2017 at 18:57):
To your question of libraries... I will point out that with the experience in DICOM, and their profiles, there are libraries. They however just do part of the work. There is significant residual risk to be managed, and risk calculations.
Abbie Watson (May 21 2017 at 20:03):
@Elliot Silver - No worries. I have various contradictory thoughts on this also. I was asked to write a follow up article to this blog post, and am sorting my own thoughts on the matter.
https://www.loggly.com/blog/pii-logs-managing-log-data-loggly-fluentd/
I agree with most of the the things you're saying, and think that having separate logging channels/transports for HIPAA and devops data makes a lot of sense. I'm also thinking about cloud hosting and software-as-a-service devop use cases. As John points out, there is residual risk to be managed with SaaS and cloud hosting; which is where anonymization libraries might be really useful. Being able to put an interface into anonymized mode seems like it could be an extremely useful across many use cases.
@John Moehrke - Yes! Great blog! This is the type of stuff I looking for! Thank you!
Yeah, developing an inventory of all the PII direct and quasi identifiers across the FHIR API could be really useful. I'd go further and suggest that it would be really useful to add directly to the schema definitions. We're already tracking cardinality, field requirement, and data type... tracking PII identifiers seems like it would be exactly in the purview of the FHIR API.
I'd be happy to help work on this. I've mentioned before that we use isomorphic javascript, and wind up using the same libraries on database, server, and client. So for us, an anonymization library would get really great reuse and would be high-value. We could use it on the server for inbound or outbound APIs, in logging functions to our devops teams, on the database in map/reduce functions and analytic reports, on the client UI to improve HIPAA compliancy, etc. So for us, it's a no-brainer. And like you describe in the conversation around deidentify=true, we'd use it conjunction with an Access Control Service.
Taking an inventory of the ~17PII types across the FHIR API seems like a great place to start. Given my own resources, I'd probably start with the ~20 resources supported by both Epic and Cerner.
John Moehrke (May 22 2017 at 13:00):
I don't expect it will be all that helpful, as just about everything will be a quasi-identifier, which all need special handling. Direct are rather clear, and must be removed. Data are very few. Most elements are quasi-identifiers.... That said, This tagging all elements seems like a useful thing to do in the RDF flavor. (Turtle?) Not sure if this is right, but it seems right.
Last updated: Apr 12 2022 at 19:14 UTC