Stream: terminology
Topic: Using a hash algorithm for valueset.expansion.identifier
Robert McClure (Aug 13 2018 at 17:09):
I've been considering the value of suggesting the use of a hash algorithm applied to the output of a "standardized" value set expansion resource as a way of creating a business identifier for value set expansions (valueset.expansion.identifier
) that can be used as a quick way to see if something in the expansion has changed. I'm interested in this because we have a number of implementers in the C-CDA and probably the USA eCQM space that would like to have an easy, but reliable, way to tell if the expansion has changed. Obviously using a hash would not say anything about what changed, just that something did. Also, this would mean that for a terminology service to support such a thing it would have to declare "a standard set included elements" in the expansion. I'm thinking that this standard expansion would then be fed to a has algorithm to generate a hash key. I've tried MD5 and it seems to work. But I'm not very familiar with hashes and would like to get others to give me some feedback. Yes, I would assume this would be server-specific, which is fine for now, and obviously format and all else dependent.
Interested in thoughts on this @Lloyd McKenzie , @Grahame Grieve , @Peter Jordan , @Michael Lawley , @Jack Bowie , @Rob Hausam
Michael Lawley (Aug 13 2018 at 22:09):
My first response is to wonder what the scope of change is. Order of codes, display text, designation text, time stamp of the expansion, etc. I'm guessing it will vary for the use case?
Also, are we talking about the actual expansion, it's serialisation as json or XML (now white space becomes an issue), or the logical (theoretucal) expansion?
Also, can you allow false positives? Or hash mismatch when no change, but hash can never match if there is a change?
Is this needed to work across terminology servers? If not, then why can't you just hash the json/XML output of your TS of choice, avoiding timestamped fields. I would expect a single server to have a stable set of elements.
Grahame Grieve (Aug 13 2018 at 23:26):
I do this internally - if the hash of the set of conditions for an expansion match the hash of the set of conditions for an expansion I've already returned (and still remember) then I'll return the previous one.
Grahame Grieve (Aug 13 2018 at 23:27):
Note that this is inside the terminology where I have a high degree of confidence that the conditions are all accounted for, and the hash doesn't miss anything
Robert McClure (Aug 14 2018 at 00:07):
@Michael Lawley Not across terminology servers as I don't see the juice worth the squeeze. Yes, I was just hashing the jason but you are right, we need to control for any date items.
@Grahame Grieve What specifically do you mean by
Conditions for an expansion?
I'm really interested in the content of an expansion including descriptions, code system version, etc.
Grahame Grieve (Aug 14 2018 at 00:20):
so I have an engine that generates expansions. There's a cache sitting right in front of it that takes all the parameters that are inputs to the engine, and hashes them. Because those are all the inputs to the engine, I don't need to cache the output
Grahame Grieve (Aug 14 2018 at 00:20):
I'm not sure how much more explanation you would want?
Michael Lawley (Aug 14 2018 at 01:35):
We have essentially the same thing - it's what drives the ETag machinery
Paul Lynch (Aug 14 2018 at 16:03):
What about hash collisions? If you get a different hash value, you know something has changed, but if you get the same hash value, you can't be 100% sure nothing has changed.
Robert McClure (Aug 14 2018 at 16:13):
@Paul Lynch What I'm suggesting is this is done in the context of a single value set, so you are not comparing hash across all value sets. I'm not sure that specifically decreases the likelihood of a collision, but it certainly decreases the number of hashes to be compared. I'd be interested to know why collisions occur. If it's just a random occurrence over some large number of instances, I'd be fine assuming we are safe. This is not a patient safety issue with no human involved.
Robert McClure (Aug 14 2018 at 16:18):
@Grahame Grieve So if we did as I suggested - define a specific set of value set expansion elements in a specific format (Jason?) and then run that jason artifact through a defined hash (perhaps we could all use the same specific implementation?) would this be similar to what you are doing? Again, I'm not saying this needs to be consistently implemented across servers, but it might be nice to have a consistent approach.
@Michael Lawley what is the ETag machinery?
Paul Lynch (Aug 14 2018 at 16:37):
@Robert McClure Hash collisions happen for any kind of hash function, simply because you are mapping a large number of bits into a smaller number of bits, which can't represent the same number of variations. Collisions should be rare, and you are unlikely to see one even with MD5. It is just a question of whether you can tolerate ever making a wrong conclusion that nothing has changed.
Lloyd McKenzie (Aug 14 2018 at 16:39):
There can be clinical consequences to a wrong answer. If I'm running a check "is patient's condition in value set x" and the answer is wrong, the decision support resulting could be wrong.
Lloyd McKenzie (Aug 14 2018 at 16:40):
Now, in practice, the likelihood is that the answer could be wrong for all sorts of other reasons too, so hopefully we have protections build in for that situation.
Michael Lawley (Aug 14 2018 at 22:13):
@Robert McClure when we return an expansion we include an ETag in the HTTP headers. A subsequent expansion request can provide this value in an If-None-Match header which allows use to return a 304 status code if there's no change. Note this is a general HTTP thing, not specific to ValueSet $expand and so applies to Read and Search operations as well.
The "ETag machinery" is the internal Ontoserver code that both generates and checks the ETag for a given request.
Robert McClure (Aug 15 2018 at 00:48):
@Michael Lawley is the expectation that users would use this to see if something changed, or is this for internal functions to see if you need to regenerate something? I'm trying to get a sense if you all think using a hash for an expansion business id is something users could then check to determine if the expansion changed. I know this is something different than you and Graham's use, but do you think it is another good use of a hash?
Michael Lawley (Aug 15 2018 at 01:09):
If you're using the FHiR API then ETags give you a standard mechanism to achieve the goal of determining if something has changed (with the caveat that a non-304 response doesn't imply change, only that there might have been change).
It's not quite as strong as your use case, I think, which would work for stand-alone resources but only if you know they were generated by the same TS.
Lloyd McKenzie (Aug 15 2018 at 01:11):
eTag tells you if the resource has changed. It doesn't tell you if the expansion (based on all the descendant versions) has changed.
Michael Lawley (Aug 15 2018 at 01:12):
Using a hash as you describe is not a bad idea, but I worry that it has the bad smell of statically generated and exchanged expansions (okay in some contexts, but not good for clinical data kind of uses)
Michael Lawley (Aug 15 2018 at 01:13):
It does in our case - Etag derives from all versions of value sets and code systems used in the expansion
Michael Lawley (Aug 15 2018 at 01:14):
Or an approximation there of (we might have etag mismatch when there's no change)
Robert McClure (Aug 16 2018 at 14:09):
@Michael Lawley I think the vast majority of users download, use and exchange a point-in-time expansion (a specific value set definition and a specific code system version.) I'm not sure if that was what you meant, but in my experience it's extremely common. Do you agree that in that use case, this hash idea within the context of a single TS would be useful?
Last updated: Apr 12 2022 at 19:14 UTC