FHIR Chat · Need to include chromosome as a component to ObsVariant · genomics

Stream: genomics

Topic: Need to include chromosome as a component to ObsVariant


view this post on Zulip Mullai Murugan (Oct 04 2019 at 22:11):

Though chromosome can be derived from other input, it will be helpful to call out chromosome as a component in ObsVariant. Can this be added, please?

view this post on Zulip Jamie Jones (Oct 04 2019 at 22:16):

it's on the genomic finding profile it inherits from, should use the component 48001-2 "cytogenetic (chromosome) location" unless I'm mistaken

view this post on Zulip Bret H (Oct 04 2019 at 22:22):

component:cytogenetic-location ΣI 0..* BackboneElement Cytogenetic (chromosome) location make sure to view the snapshot view as well as the differential. The elements inherited from more abstract profiles are not easily seen otherwise

view this post on Zulip Mullai Murugan (Oct 04 2019 at 22:33):

Apologies if I am not understanding correctly, but wouldn't the name "cytogenetic (Chromosome) location" create confusion? In addition to cytogenetic location, what about also including Chromosome using LOINC 48000-4? The answer list for this code includes the type of data we want to display.

view this post on Zulip Jamie Jones (Oct 04 2019 at 22:36):

I personally like 48000-4 and LL2938-0. Slicing is currently 'open' so you should feel free to include that component on your use case. I think we should consider standardizing it in our profile.

view this post on Zulip Jamie Jones (Oct 04 2019 at 22:39):

The current approach is through RefSeq (and either 51958-7 or 48013-7) but it's not as obvious in my opinion

view this post on Zulip Kevin Power (Oct 04 2019 at 22:46):

Yes that feels like a miss. @Mullai Murugan can you log a tracker?

view this post on Zulip Bret H (Oct 05 2019 at 01:25):

you think we should change our component?

view this post on Zulip Bret H (Oct 05 2019 at 01:28):

@Mullai Murugan if it is just the value set. the component we have is unbound. The LOINC code is NOM. You can use the coded values you want with the current component. @Kevin Power @James Jones

view this post on Zulip Bret H (Oct 05 2019 at 01:31):

LOINC 48000-4 is specifically limited to chromosome. The LOINC code we are using is 48001-2 which allows for one to use chromosome regions and chromosome ids. I think the greater flexibility is an important characteristic. You can use the current component with ids such as Chromosome 1 etc... So, I'd recommend using the exisiting component rather than duplicate the meaning or constrain away the possibility of chromosomal regions.

view this post on Zulip Mullai Murugan (Oct 05 2019 at 03:01):

@Bret H @Kevin Power @James Jones Isn't the cytogenetic location referring to the location in a chromosome e.g. CFTR gene is 7q31.2, which indicates it is on chromosome 7, q arm, band 3, sub-band 1, and sub-sub-band 2? If this correct, for my understanding, are you recommending that we use the cytogenetic location but limit our result to just the chromosome name (using the appropriate value set from 48000-4)? I understand the duplication concern and I realize cytogenetic location offers more granularity; my only concern is that we might not be explicit with our data if we say cytogenetic location and provide only the chromosome name. Since the components in ObsVariant are mostly optional, it will be helpful for implementers like us to have both and decide which one is more appropriate. I have created the tracker item and referenced this chat, looking forward to feedback on the same. https://gforge.hl7.org/gf/project/fhir/tracker/?action=TrackerItemEdit&tracker_item_id=24906

view this post on Zulip Bret H (Oct 05 2019 at 04:23):

The current element allows one to chose to send chromosome or chromosome+band information. So I'd say the developer already has the choice. The field will need to be parsed unless stored as is, so in terms of precision that would be up to the underlying system to determine. The chromosome name+band info is a well known standard, and easily parsed. Why would sending only 7 in the current element not be explicit?

view this post on Zulip Jamie Jones (Oct 21 2019 at 14:42):

linking to tangentially related discussion re: chromosomal representation granularity so we don't lose it: https://chat.fhir.org/#narrow/stream/179197-genomics/topic/CNV.20size

view this post on Zulip Bob Dolin (Oct 29 2019 at 22:51):

Hi @Larry Babb , @Mullai Murugan

I'm not sure I was completely following your All of Us presentation this morning, where it seemed the conversation briefly turned to whether it's better to communicate chromosome number + build, vs reference sequence. The comment I made on the call was that the former is ambiguous. I thought I would explain my reasoning a bit more here, if only to make sure my understanding is correct :wink: .

When you look at a VCF file, the #CHROM field doesn't represent a chromosome number, but rather, a pointer to a contig in the VCF header:

##fileformat=VCFv4.1
...
##contig=<ID=1,assembly=b37,length=249250621>
...
##contig=<ID=GL000191.1,assembly=b37,length=106433>
...
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample
1 14370 . G A . . . GT 1|0
GL000191.1 14370 . G A . . . GT 1|0

GenBank states that GL000191.1 is 'chromosome 1 unlocalized genomic contig'. So, if I simply tell you I have a variant on Chr 1, build37, and I don't tell you the RefSeq, how can I differentiate between these two VCF rows?

Also, my understanding is that GA4GH's refget API is designed to further reduce ambiguity, by introducing a unique identifier for each reference sequence, such as:

##contig=<ID=21,length=48129895,assembly=B37,md5=2979a6085bfe28e3ad6f552f361ed74d>
##contig=<ID=22,length=51304566,assembly=B37,md5=a718acaa6135fdca8357d5bfe94211dd>

So maybe in a future iteration of the IG, we need to make room for this checkSum too.

view this post on Zulip Larry Babb (Oct 30 2019 at 14:30):

Thanks @Bob Dolin .

I am not a bioinformatician (which is probably obvious). I do think in the case above one would be required to represent the second call "GL000191.1" using an accession and not an "assembly+chr#" approach. If I gave the impression that there was only one way to do it, then I misspoke. But, like hgvs, folks use this assembly+chr# to convey the indirect reference to the accession which is presumed to be in the primary assembly of the "assembly" portion of the pairing. I've always questioned the practice myself, but folks do it, probably because in large portions of calls it is likely called against the accessions in the primary assembly mapped by chromosomes.

So, the question for the CG IG is whether or not it will support both approaches or will it mandate that refseq accessions are the only way to represent? It then opens the door again to whether hgvs should be formalized to include the accession or whether we should keep it as a nice to have and assume the critical component attributes that are embedded in the hgvs expression are pulled out into required attributes to represent the allele or other types of genetic variation that the sender may be attempting to share.

And I do think the refget (and VR) digests (checkSums) should have a place in the model. I think it will naturally present itself when we get a real variant model and set of data types created that will enable us to share variation data more consistently and precisely. I'm holding out hope this will occur when the IM work is finally refactored back into the CG IG spec and we convince the FMG that we actually need some core data types to deal with the complex and unique constructs within the clinical genomics domain.

view this post on Zulip Bret H (Nov 18 2019 at 16:40):

@Larry Babb to be complete HGVS requires a reference, per the authors of the syntax. It is also best practice to include a reference.

view this post on Zulip Larry Babb (Nov 18 2019 at 17:45):

i agree. I think this was more of a question about whether or not the IG would constrain that the hgvs expression be complete or not. I believe the answer is no. The CG IG is trying to be a much more flexible model to allow folks to share variants in the variety of ways they do today.

view this post on Zulip Bret H (Nov 18 2019 at 17:46):

right. we can give guidance that best practice is to use correctly formatted HGVS. But implementers of the spec - senders or recipients - will have to write code to determine if the HGVS is acceptable.

view this post on Zulip Larry Babb (Nov 20 2019 at 14:15):

Getting back to the original thread related to the need for a specialized chromosome # only component versus a more flexible cytogenic-location that could be a mix of chromosome and chromosome+band information...

@Bret H There are two very common practices for sharing computable representations of sequence variants.

  1. reference-assembly + chromosome#, genomic start, genomic stop, genomic ref, genomic alt
  2. reference sequence accession, genomic start, genomic stop, genomic ref, genomic alt

I prefer # 2 above as it is more compact and there is no need to "derive" the reference sequence accession that was used as the basis for the sequence change. But, nevertheless, there are many folks that use the assembly+chr# to convey which reference sequence was used as the basis for the positions and change that the variant is made up of.

So, the Baylor HGSC lab uses the assembly+chr# approach to communicate this information in eMERGE. We would like to support it.

If we use the "cytogenic-location" component, I could simply apply the chromosome value set referenced above since this is an "unbounded" CodeableConcept and it would work. However, if we do that, we would be presuming a specific implementation detail that is useful only within the emerge pilot. Since chromosome# is arguably as useful of a value and maybe more so than something like gene symbol it doesn't sit well.

The argument is really about precision. Having a component that is designed to relate a constrained value set is useful to all and and requires no negotiations between sender and receiver. Passing chromosome via cytogenic-location would require all to be required to use some type of regex expression to validate the "text" value being passed and then parse out the chromosome (presuming it passed). Or to inspect whether or not a coded value from the valueset (that was is not bound to the component) is a legit chromosome value from said value set. Or both. Or some other contrived method.

I get the desire for flexibility and reuse. Its a slippery slope either way. It comes back to some basic principles in how the group is determining what components are worthy of specification and which can be "generalized" and left to the individual implementers to negotiate amongst themselves.

If you think that cytogenic-location component is the preferred way to share some type of controlled values for chromosomes, could you please demonstrate how this should be done in a standard way that does not leave it up to implementers to vary? I'd be happy to follow a standard practice on this if it can be acheived. I just would rather not have to tell the various consumers of our message that they must investigate and parse out the chromosome # from a cytogenic-location text field. Nor would I prefer to say that we all have to presume our cytogenic-location value is bound to a value set of chromosomes but other sources using this component may not use it the same way.

view this post on Zulip Kevin Power (Nov 20 2019 at 14:47):

I am torn on this one, but lean towards a new component that is for 'chromsome' only.

view this post on Zulip Patrick Werner (Nov 20 2019 at 15:12):

i think we should rename this to chromosome location. I don't see a need for two components capturing the same concept, only differing in "precision" of the chromsome location.

view this post on Zulip Larry Babb (Nov 20 2019 at 15:49):

@Patrick Werner But there is a difference between a chromosome and a chromosome location, similar to a gene and a gene location of any other kind of concept where the thing itself is important and worthy of capturing as a precise thing. parsing values and requiring specialized handling to derive the concept needed is a natural inclination, but seems to deviate from how other components were declared. I mean you can get accession, position, ref and alt from an HGVS expression, so why have separate components for those as well. When I ask the domain experts I work with if they think a chromosome is the same or different than a cytoband location or chromosome location they think one is the basis for other (like a whole-part association). The location is not the thing, it is a specific place within it (even though it may be the entire place represented by the start and end point within it).

view this post on Zulip Bret H (Nov 21 2019 at 11:36):

@Larry Babb the IG accommodates #2 and #1. But the question here is more about where to put '7.q2' versus 'Chromosome 7.' @Patrick Werner how are you using this field in Germany? As a display field or for computation?

view this post on Zulip Bret H (Nov 21 2019 at 11:42):

all - Bear in mind the search queries that are likely to be performed if you are interested in a fully computable field. If the field is for display only, like a comment, then I wouldn't be too stringent. @Larry Babb I want to up-vote out your comment 'could you please demonstrate how this should be done in a standard way that does not leave it up to implementers to vary' Totally agree that what the IG has now is very flexible and has many places where the implementer could vary. The current state in genomic reporting is like this now. The structure we have is a good move forward and gives a degree of semantic interoperability. But you are right in saying that asking the industry of lab reporting to adopt a specific value set for specific fields would be more standard. If G4GH, ClinGen, ClinVar, EMERGE, SunQuest, Lab Sequence, Myriad, Invitea, ARUP, etc.. all used the same value sets as well as structure and field definitions that would be ideal.

view this post on Zulip Jamie Jones (Jan 14 2020 at 22:18):

Seemed to be backing on call today to include a component for chromosome identifier, based on LOINC 48000-4. Thoughts on cardinality and binding strength to the referenced valueset?? I put [0..1] and "preferred" on the disposition in JIRA but am open to others (specifically 0..* and "required" or leaving it 0..1 and "extensible" to handle edge cases)

view this post on Zulip Kevin Power (Jan 14 2020 at 23:52):

Yikes - most all our findings would never cross chromosomes, but of course there are some that do - so not sure what to say on cardinality. I think I am OK with doing 'required' as the binding strength, unless others disagree?

view this post on Zulip Jamie Jones (Jan 15 2020 at 16:03):

I know we say we're aimed at human genomics but I'd like to see us keep the chromosome list as "preferred" in the universal IG in case others down the line wish to branch off of it for veterinary or bacterial.

Looking more closely, I feel we should mirror the cardinality we selected for cytogenetic location, which is 0..* to cover edge cases.

view this post on Zulip Patrick Werner (Jan 15 2020 at 16:16):

+1 for: 0..* and for a required binding.

view this post on Zulip Bret H (Mar 15 2020 at 16:12):

We're going to need a profile for microorganisms. Is the Sequence Resource, ahem, Molecular Sequence still available? For example, the concept of diploid, haploid...rather, Heterozygous and homozygous are diploid-centric. Microorganism sequences should perhaps be treated more like a reference sequence (as we've talked about using Molecular sequence for)...but this would be off topic. ..For the topic here, BCR-ABL is well-known translocation that involves two chromosomes. Our IG does not handle chromosomal rearrangements with the very important exception of the field for ISCN terms. For clinical use this is consistent - for research use, I'm not as certain. So, I would suggest we provide guidance that the field in question here is used only for the case of variants on a single chromosome. Which would put me in the 1..1 cardinality camp.

view this post on Zulip Bret H (Mar 15 2020 at 16:15):

at some point we're limited to expression of what the common use cases warrant. Anything beyond that requires careful parsing.

view this post on Zulip Bob Milius (Mar 16 2020 at 19:20):

@Bret H

Is the Sequence Resource, ahem, Molecular Sequence still available?

I use MolecularSequence a lot. It allows me to report a sequence and all the variants to a reference within that sequence in a single resource. I much prefer the simplicity of that rather than having a separate resource for each variant. This should probably be a separate thread.


Last updated: Apr 12 2022 at 19:14 UTC