Stream: genomics
Topic: CodeSystem URIs for genomics
Dora Walter (Oct 04 2018 at 11:48):
:thumbs_up: :thumbs_up:
Patrick Werner (Oct 04 2018 at 16:04):
Starting the work on URIs for all needed CodeSystems i was wondering if the existing URIs http://build.fhir.org/terminologies-systems.html under Code Systems for Genetics were created following the Guidance on how to define URIs for CodeSystems: http://build.fhir.org/terminologies.html#system
Patrick Werner (Oct 04 2018 at 16:05):
Were these URIs defined by the publisher of the systems?
Kevin Power (Oct 04 2018 at 16:07):
I am not sure, maybe @Gil Alterovitz knows? Also CC: @James Jones
Patrick Werner (Oct 04 2018 at 16:07):
If not i would try to find out if it is possible to change them to a http://hl7.org/fhir namespace following best-practice guidance given yesterday by @Lloyd McKenzie
Patrick Werner (Oct 04 2018 at 16:08):
I know this is a very sensitive topic as it would break systems already using these CodingSystems
Kevin Power (Oct 04 2018 at 16:10):
Personally, I am not concerned about passivity for the genetic specific code systems. Perhaps we should we concerned.
Patrick Werner (Oct 04 2018 at 16:10):
The working doc @Julian Sass and myself are working on today : https://docs.google.com/spreadsheets/d/14sIVVED2rI0Ab2ZQ3oV6v8njzX-sRezkIWcgkkepUUg/edit?usp=sharing
Patrick Werner (Oct 04 2018 at 16:14):
as the CodeSystem URI list is under maturity level 3, and has the following paragraph: This list of names is incomplete and subject to change. Some values may be dropped and others will likely be added in the coming months as HL7 institutes formal processes around URIs in vocabulary i would try to change them and align it with the guidance how to define CodeSystem URIs
Jamie Jones (Oct 04 2018 at 16:37):
I know Gil has some perspective with these internal code systems; I'll go over the notes from the sessions we missed yesterday for the tutorial and ask for some guidance here.
Kevin Power (Oct 04 2018 at 16:38):
To our conversation yesterday - we should consider if the genetic specific ones even belong in FHIR core or should only be contained in our IG
Kevin Power (Oct 04 2018 at 16:58):
For anyone who wants an example, the Breast Cancer Data IG has defined their own ValueSets and CodeSystems (there are other examples as well)
Patrick Werner (Oct 04 2018 at 17:57):
This is definitely an important question. I just want to point to the difference, that Breast Cancer is US Realm and we are universal.
Lloyd was yesterday in favor to put our systems into core with the intention that everyone is using the same uris (which makes sense to me).
As the page we are talking about belongs to Vocab finally it's up to them to decide, i think.
Kevin Power (Oct 04 2018 at 18:05):
Likely a stupid question ( but that has never stopped me before ) - in our case, what is the difference between the systems listed in Core versus a universal IG? If our IG is universal, wouldn't we end up with one URI anyway? Or does having them in Core give us some advantage? I can see how they would likely be more visible - but are there other reasons?
Lloyd McKenzie (Oct 04 2018 at 18:10):
URIs that appear in the core spec are binding on everyone. That's the only authoritative HL7 list for code system URLs
Patrick Werner (Oct 04 2018 at 18:10):
The reason is the visibility and that the URIs are defined at one place. Also the scope of the URI in core would be universal, so other usecases/WGs using the CodeSystem are using the same URI.
If it is defined inside of an IG than the scope is the IG and other IGs could use other URIs
Patrick Werner (Oct 04 2018 at 18:11):
oh. Lloyd was faster :grinning:
Kevin Power (Oct 04 2018 at 18:15):
OK, fair enough.
Patrick Werner (Oct 04 2018 at 18:47):
After a short discussion with Grahame: https://chat.fhir.org/#narrow/stream/48-terminology/subject/Changing.20CodeSystem.20URIs we came to the conclusion that it is dangerous to change URIs. For errata we can change the Source
Patrick Werner (Oct 04 2018 at 18:53):
e.g.: http://www.ncbi.nlm.nih.gov/nuccore REFSEQ : National Center for Biotechnology Information (NCBI) Reference Sequences
changing it to http://www.ncbi.nlm.nih.gov/refseq isn't possible, but changing the Source to something containing Nucleotides is possible
Kevin Power (Oct 04 2018 at 18:56):
Seems reasonable - I think we just need to catalog all of this on your spreadsheet (what we have now, what we need to change, what we need to add?
Patrick Werner (Oct 04 2018 at 18:56):
@Julian Sass already discovered the first issue. nuccore is containing Codes from different source CodeSystems. So it is rather an ValueSet, containing multiple CodeSystems like: NCBI RefSeq IDs, GeneBank Ids ....
Patrick Werner (Oct 04 2018 at 18:56):
Seems reasonable - I think we just need to catalog all of this on your spreadsheet (what we have now, what we need to change, what we need to add?
yes. We are on it here: https://docs.google.com/spreadsheets/d/14sIVVED2rI0Ab2ZQ3oV6v8njzX-sRezkIWcgkkepUUg/edit?usp=sharing
Kevin Power (Oct 04 2018 at 19:13):
and we don' t have RefSeq and GenBank (nor any of the others source databases) as code systems yet, right?
DDBJ
EMBL
GenBank
INSDC (GenBank)
PDB
PIR
RefSeq
UniProtKB / Swiss-Prot
Kevin Power (Oct 04 2018 at 19:15):
Is there some point at which should just treat nucore as a Code System, and perhaps document that it is fed by many databases? I am a bit concerned that if we take the pure approach here, we will have trouble finding every possible code system that might be possible here.
Kevin Power (Oct 04 2018 at 19:25):
nuccore seems to treat the codes from each of the other databases as its own "code" as well -
Genbank MF167456.1
RefSeq NG_016465.4
Patrick Werner (Oct 04 2018 at 19:38):
We shouldn't treat it as a single CodeSystem at it isn't and we can't. The Version of the source databases are independent with individual update cycles. In the next step we can define a ValueSet containing alle codes from all CodeSystems in this area.
Kevin Power (Oct 04 2018 at 19:48):
Yea, I know dang it. Was just taking a shot at keeping it simple. Thanks for keeping me honest.
Patrick Werner (Oct 04 2018 at 20:39):
It gets even more complicated. Looking at https://www.ebi.ac.uk/ipd/imgt/hla/ i'm not sure if this qualifies as a CodeSystem. The underlying nomenclature: http://hla.alleles.org/nomenclature/naming.html certainly does (in reality it is rather a grammar, but it is the same with UCUM so this seems to be ok as you can understand a grammar as a complex post-coordinated CodeSystem).
But is IPD-IMGT/HLA a CodeSystem or a Database - or both?
Kevin Power (Oct 04 2018 at 20:45):
Why wouldn't the it qualify as a Code System? I think they recently started publishing the allele database on GitHub.
I wonder how that nomenclature/naming reference is different from gl-string? CC: @ Bob Milius @Joel Schneider
Patrick Werner (Oct 04 2018 at 21:47):
after looking into Github i agree. I now understand it as a defined list, using the HLA naming nomenclature. Will give this some more thought together with @Julian Sass . Differentiating the different CodeSystems and deciding what is a CodeSystem, what could be ValueSets and not mixing things up is quite complicated in our domain. We are documenting our decisions also in the table so we can discuss this in a call.
Joel Schneider (Oct 04 2018 at 22:13):
@Kevin Power The HLA nomenclature defines a code system for recognized HLA alleles, serotypes, proteins, etc. The GL String code (GLSC) code system combines the official nomenclature with several related code systems, and adds a robust grammar. This is needed for describing nomenclature-level HLA typing results/observations, which typically (unless homozygous) consist of two HLA alleles per gene, and may also be ambiguous.
Kevin Power (Oct 04 2018 at 23:00):
Thanks @Joel Schneider - So, this just talks about the naming structure of an allele, and GL String talks about pulling together the two HLA alleles per gene, and expressing the possible ambiguities and such. Yes? I will understand HLA someday, I promise. :confused:
Joel Schneider (Oct 04 2018 at 23:04):
Correct.
Andrew Patterson (Oct 06 2018 at 07:43):
Can we also document (in the same spreadsheet?) the preferred actual code formatting? So a lot of them are prefixed with a qualifier in the ontology source data - that is a proxy for the code system. So we need to make a decision if that prefix stays in the actual code values.
e.g from disease ontology - taking "bacterial sepsis"
DOID:0040085 (as per their website)
DOID_0040085 (seen some people replacing colons with underscore for non FHIR uses)
0040085 (no need for a prefix as that is what the role of our codesystem in FHIR is for but leave in the leading zeros)
40085 (take out the leading zeros and say the code is an integer - am seeing a mixture of has leading zeros or not _within_ the same ontology so do we normalise)
Andrew Patterson (Oct 06 2018 at 07:45):
The same questions for HGNC and HPO (is HPO on the list of terminologies for genomics btw?) and probably others I haven't looked at yet
Kevin Power (Oct 06 2018 at 16:39):
I think that is a great idea. I too have seen some inconsistencies in the wild.
Patrick Werner (Oct 08 2018 at 08:57):
I would stay with the official published format of a Code. So in this case: DOID:0040085. We can give some guidance about the proper format of a valid code (maybe we need a "How to utilized CodingSystems in Genomics" section in the IG), but the decision if a code is valid or not is specified by a Terminology Server/Service used in an implementation context.
This service can then decide to accept all possible (not official) representations of a code, or only the official one, or accepting other formats with warnings.
For disease ontology: I don't understand their replacement of ":" with "_" in their JSON representation. Colons are valid in JSON Strings, but some people think they aren't.
Robert McClure (Oct 10 2018 at 20:23):
@Patrick Werner I agree. Code system authorities can sometimes muddy the water but the point of a code system is that someone is the authority and the rest of us need to play along. I looked at the DO site and agree that they seem to prepend all the codes in the code system with "DOID:" and they also have zeros that follow that colon, so the entire string DOID:0040085 seems to be the code. And yes, we need to allow that there are "alternative" codes for a concept even within the same code system. That is not the same as a mapping to another code system such as SNOMED CT and that really would be a property of the concept if the mapping is maintained at a concept level. I see that unfortunately DO seems to use the exact same structure to identify these mappings - code system identifier:ID, but that should be seen as a shorthand for what a FHIR code system resource should do.
Bob Dolin (Oct 16 2018 at 16:06):
@Patrick Werner Patrick, here are a couple locus-specific databases that I was thinking about during today's call: http://www.hgvs.org/locus-specific-mutation-databases; http://www.lovd.nl/3.0/home. It seems as though we need to define a mechanism to communicate identifiers from these databases, just as we need to communicate identifiers from, say, ClinVar.
Patrick Werner (Oct 16 2018 at 16:23):
Thanks for pointing us in the right direction here. I think we should define URIs for all used CodeSystems in Genomics, if we don't do that everyone will use its own SystemIdentifier
Patrick Werner (Oct 16 2018 at 16:26):
If someone want to communicate something from a system which is not yet defined in the core spec or inside our IG he can "invent" an Identifier. But i get the point that this field is moving rapidly so we would need some guidance for implementers: If they want to use a new CodeSystem give them best practice in naming the System and how to communicate to us that we will include this Identifier in our IG.
Patrick Werner (Oct 16 2018 at 16:28):
Lloyd mentioned that it is best to get the CodeSystem URI into the Core Spec for better visibility to Implementers. I agree on that argument. I'm only a little concerned if VOCAB likes a huge list of genomic specific Systems, and if they like to update this page frequently.
Maybe these locus specific Databases should be mentioned or defined only inside the IG?
Bob Dolin (Oct 16 2018 at 16:37):
Yes, that makes sense, to mention them in the IG and/or provide guidance for implementers, since as Bret mentioned it will be hard to enumerate them all and keep the enumeration up to date
Kevin Power (Oct 16 2018 at 17:14):
Outside of genomics, how are "lab specific codes" handled? Are labs with specific codes required to register themselves as a code system?
Lloyd McKenzie (Oct 16 2018 at 18:36):
Registration is where multiple independent applications will be capturing data using that code system. If the only application that assigns codes using the lab's internal code system is the lab, no need to register it.
Lloyd McKenzie (Oct 16 2018 at 18:36):
(Though it'll still need a URL.)
Joel Schneider (Oct 16 2018 at 19:04):
FYI, vocab has an open task to move the tracking of external code system URIs out of the core spec, and into some kind of registry.
Kevin Power (Oct 16 2018 at 19:35):
So, by "need a URL" - do we mean any lab can send "https://mylab.com/" stamped on their own Code and call it good? Or it has to be registered somewhere or it will be rejected? Up to the server receiving the data?
Grahame Grieve (Oct 16 2018 at 19:38):
labs need to send their own codes. Any lab can assign their own URL to their own codes and go ahead and use them. That's legal. Which doesn't mean it will accepted by a particular target, who might say: no unknown codes. Or they might say: you can send us unknown codes as long as you also send us a known code. Or they should say: you can send us unknown codes as long as you send us text (CodeableConcept.text) too
Lloyd McKenzie (Oct 16 2018 at 19:39):
Saying "no unknown codes" is super-bad form. So long as the sender sends you a coding you recognize, there's no reason to complain about other codings you don't recognize.
Lloyd McKenzie (Oct 16 2018 at 19:39):
Where there might be an issue is if you only send unknown codes.
Grahame Grieve (Oct 16 2018 at 19:40):
even that is bad form (not super-bad). In lab data particularly, it's saying: I won't accept any lab report where you make an observation I haven't heard about. That's a really bad idea
Lloyd McKenzie (Oct 16 2018 at 19:43):
That depends on context. If you're submitting "reportable lab results", it's reason to yell if the test isn't on the "reportable" list. But yes, in general you'll need to allow for tests you don't recognize. Yelling about non-recognized codes if there is a recognized code is bad pretty much regardless of context.
Kevin Power (Oct 16 2018 at 19:44):
I think is where these Locus Specific Databases fit (sorry for the lab code analogy, but it works I think) - there are many of these specific databases that might pop up, and if the lab chooses to send a code from one of them with OR without a known code (say ClinVar in our case), it seems we should do our best to make that an expectation for those implementing our IG.
Lloyd McKenzie (Oct 16 2018 at 19:48):
The trick with a Locus-specific database is that if you have 10 different labs using it, we need them all to use the same URL for codes from that database - and that sort of coordination typically requires registration.
Kevin Power (Oct 16 2018 at 19:59):
Fair point. We will just have trouble scaling that up until the registration process is outside of the core spec.
Grahame Grieve (Oct 16 2018 at 20:00):
sounds like what UTG is for
Kevin Power (Oct 16 2018 at 20:01):
UTG++
Lloyd McKenzie (Oct 16 2018 at 20:05):
Registration process is outside the core spec. Registration publication needs to be too I guess...
Last updated: Apr 12 2022 at 19:14 UTC