Stream: genomics / eMerge Pilot
Topic: Genotype-Haplotype-Variant
Larry Babb (May 08 2019 at 00:32):
Not 100% clear on how to deal with representing a diplotype like CYP2C19 2/3. We need to represent the actual set of variants that comprise the individual haplotypes in this case a CYP2C19 *2 haplotype and a CYP2C19 *3 haplotype. For brevity sake, let's say that there are 3 locations along the CYP2C19 gene sequence that are actually observed and used to determine the changes with a phase of in-cis that will be used to derive the *2 and *3 haplotypes. Since there are 3 positions on each copy of the gene's allele then we would presumably need to represent 6 variant findings (3 for the *2 haplotype and 3 for the *3 haplotype).
I was going to simply list the unique variants (including any that are same as reference and needed to define the haplotypes). I was going to leave out the zygosity of these variants so they would not have any phase information. However these type of variant findings are not really sensible in that you might have the same variant be referenced by both haplotypes. In actuality these variants are different since they are on different alleles, and this may cause confusion or concern.
If the variant was purely a definitional represenation of the variant and not an observation representation then when it is associated with the haplotype it can be used in an observational way and thus the reuse would make sense. But we do not have the ability to define variants only variant observations.
If we share the variants with their zygosity (or the genotype form of the variant) then it becomes confusing in referencing it from the haplotype as it wouldn't be possible to directly reference the alternate variant in this heterozygous form.
I realize this is tricky to communicate in this form, but the issue is that it seems difficult or not possible to accurately use the genotype->haplotype->variant structure without either referencing the same variant for both copies of a genotype. But I may be overthinking it.
Anyone that has tried this please share your thoughts.
Bob Dolin (May 08 2019 at 01:38):
Hi @Larry Babb ,
We haven't been using the derivedFrom link between Haplotype and Variant. We do use the SequencePhaseRelationship profile to assert CIS or TRANS relationships between variants where we have that information from the VCF file. It may be easier to just report haplotypes and variants independently, ignoring derivedFrom, especially since haplotypes are generally a best guess anyhow, based on a variety of algorithms (e.g. PharmCAT, population frequency-based). I'm not exactly sure how anyone would use the derivedFrom information, and I personally feel that Haplotype.method is more important. If haplotypes are directly observed from long reads, then they would be derivedFrom the MolecularSequence.
Larry Babb (May 08 2019 at 10:17):
Thanks @Bob Dolin for the quick response. And this is an interesting perspective, I'm going to verify with the two emerge sequencing centers to verify if they would find it helpful or useful to send only star allele representations without the precision to represent them more accurately. I do recognize that it would be easier to do so, but I think the challenge is to send information that is computationally useful. I've been running under the assumption that the star allele calls by themselves may be helpful but can have a significant difference in what they really are from lab to lab.
I may end up following your suggestion, but I will find out from the labs if simply stating the star allele name is good enough for CDS without the backing information to verify it or its precision. I'll let you know what they say.
In any case, I still have an issue with how to use the Genotype derivedFrom Haplotype relationship. And I think the concern would still apply if I used the SequencePhaseRelationship profile as you suggested.
The point is that each Genotype, Haplotype (and Variant) is based on the notion of an observation, so they each represent an instance of an observed Genotype, Haplotype, etc.. So if we find that a patient is 2/2 (homozygous, 2 copies of *2) then do I create two separate Haplotype observations , one each for the two instances of *2, or do I create one and refer to it 2x from the derivedFrom attribute of the Genotype. Both of these options seem wrong to me as it would be unclear in the first option which Haplotype observation was which and might seem like errant reporting. But the second option also seems wrong in that a homozygous genotype of 2/2 shouldn't point to the same Haplotype more than once.
If these Observations are not instances or occurrences of variation found in the specimen/subject but more of definitional representations, then a whole bunch of other presumptions seem to break.
I do think this is a really strong argument for why I've been saying that we need a way to represent the definitional form of variation independent of the observation occurrences. Since we don't have true coding systems and databases of all the variants that we can standardize and use (similar to diseases and phenotypes) we need an outlet to define these patient independent forms of variants so that we can then reference them in terms of how and when they are observed in the context of a sample/specimen.
Anyway, any further insights would be helpful.
Bob Dolin (May 08 2019 at 18:17):
Hi @Larry Babb ,
Another thought - If I'm running PharmCAT, I'm feeding all variants and relevant wildtypes into the algorithm, which is then computing a genotype. So in this case, haplotypes are derived from all the variants fed into the algorithm, and both haplotypes are derived from the same set of variants. From there, you'd have haplotype.method to tell you what algorithm was used.
I do think that in the case of GENE 2/2, it would be incorrect to assert two derivedFrom links from the genotype to a single haplotype.
Larry Babb (May 09 2019 at 11:23):
Thanks @Bob Dolin. I appreciate the feedback.
However, it seems like the fundamental issue remains...
In order to use "derivedFrom" in Genotype or Haplotype, the observations that one higher order "composite" Observation uses to derive itself should be discrete/independent Observations in themselves. This means that we need to represent a homozygous genotype by deriving it from two independent but identical Observations that represent the Haplotype that is homozygous. Same goes for any Observations that are deriving themselves from ObsVariant profiles as well.
To me this is a fundamental issue that must be resolved in a manner that does not cause confusion to the producer or consumer of these Observations. Either that or the CG might consider removing/replacing the "derivedFrom" guidance, significantly re-defining and providing constraints on its use to support use cases for homozygosity and use of identical but truly independent sub observations, or finding a pathway to a small set of data types/resources that can be used to "define" variants so they may be referenced in observations such that it separates the definitional aspect of the resource from the observational aspect.
It is impressive that the CG has been able to get to this point with the DR and OBS profiles without having the data type and/or resource building blocks needed to define the incredibly complex and detailed aspects of genetics which all of these results are based on. Observation profiling seems to be falling into the same trap as the V2 Z-segment issues whereby folks use it to capture all the stuff that doesn't have a home anywhere else. It would be really great to not repeat that in FHIR to the extent we can.
Larry Babb (May 09 2019 at 12:00):
We did get a list of the precise sites that were measured to arrive at the various star allele meanings and we will be using that to create these more accurate representations of star alleles. I do think the entire use of star alleles (while very standard in reporting results) has some significant concerns from a computational use standpoint and thus should be addressed. I think it is better to clarify how the lab arrived at the star allele representation then to simply pass only the representation and let them weed through the methodology or other narratives provided to try to decipher if their patient was measured with the same accuracy and thoroughness as a competing lab's test. These tests may have implications based on population and other testing technology capabilities that could be very helpful to aiding a system and the physician to understanding exactly how reliable the information is.
In any case, we will do our best to pass this information, possibly by expressing the individual sites as variants with zygosity and bypassing the haplotype construction but instead going directly to the genotype/diplotype. I think this is much better aligned to what the lab is actually doing to make their best guess as the phasing when it is not clear. Actually, in some cases they will report that the combination of sites variants observed could be one of two diplotypes (since they can't be sure of the phasing). So I can make this work using this approach.
Bob Dolin (May 09 2019 at 15:02):
It's a good point @Larry Babb , that often times the star allele is ambiguous. In our work, we're just feeding the variants+zygosity+phase information to a CDS engine, so are only using the Variant profile. But perhaps there is a need to take a closer look at @ Bob Milius work on the Genotype List String, which can represent the ambiguity in HLA genotyping.
Larry Babb (May 13 2019 at 14:19):
@ Bob Milius can you point me at this "Genotyep List String" concept that @Bob Dolin is referring too? Is this an alternative to the Genotype/Haplotype/Variant "derivedFrom" approach that is described in the IG currently?
Bob Dolin (May 13 2019 at 14:58):
whoops. I meant to include that: https://www.ncbi.nlm.nih.gov/pubmed/23849068
Bob Milius (May 14 2019 at 20:59):
@Larry Babb https://glstring.org/syntax.html
This was created to describe allele and genotype ambiguities for HLA & KIR, but the syntax is nomenclature agnostic as long as it doesn't collide with the glstring delimiters (^|+-/). Right now the code system supports HLA & KIR, but other namespaces/nomenclatures can be added. For use in HL7 FHIR, we use the code system http://glstring.org and a code of the the form namespace#version#glstring e.g., here is a glstring code for genotype for HLA-C that contains an ambiguous allele.
<valueCodeableConcept> <coding> <system value="https://glstring.org"/> <version value="1.0"/> <code value="hla#3.31.0#HLA-C*06:02:01:01/HLA-C*06:02:01:10+HLA-C*07:02:01:03"/> </coding> </valueCodeableConcept>
Bret H (Aug 21 2019 at 03:48):
Bob's glstrings are a very convenient shorthand for HLA. But don't be afraid to use the a more complex representation that utilizes phase. Don't want to push you in either direction. But with all the options for sending genetic data, normalization will continue to be a problem.
Bob Milius (Aug 27 2019 at 18:10):
GLStrings can describe alleles being in phase using the "~" delimiter. The string
HLA-A*02:302~HLA-B*44:02+HLA-A*23:26~HLA-B*49:08
describes a genotype containing phased HLA-A and HLA-B alleles. (made up example).
Last updated: Apr 12 2022 at 19:14 UTC