FHIR Chat · CoordinateSystem ValueSet

Stream: genomics

Topic: CoordinateSystem ValueSet

Jamie Jones (Feb 04 2019 at 17:31):

Hi all, starting a thread here for considering the CoordinateSystem ValueSet in our Variant profile, currently the internal textual set: "Interval | Variant method | Alignment method". See http://hl7.org/fhir/uv/genomics-reporting/2018May/ValueSet-genetic-coordinate-system.json, GF#19938, and the textual description from Bret:

"Base number of coordinate system either 0-based, with inclusive start and exclusive end (called interval), or 1-based, with inclusive start and end. However, two versions of 1-based are in common use.These systems are HGVS 1-based (called variant method) and VCF 1-based (called alignment method). In general, HGVS recommends right-justification and VCF recommends left-justification. However, these systems further address questions such as wheather or not to place an insertion before or after the nucleotide, and, if always placed before the nucleotide, how to handle insertion after the end of the sequence. Additionally, the systems deal with handling boundary effects of numbers between features. For more details see the HGVS and VCF guides."

The goal is to iron out the contents of this set and get it to Clem/LOINC for inclusion.

Jamie Jones (Feb 04 2019 at 18:07):

See also GF#16244, GF#19835, and GF#19836. It seems @ Bob Milius had some great links about this topic that may be useful, if someone can find them (I tried but could not).

Jamie Jones (Feb 04 2019 at 20:13):

It seems this description doesn't quite match with the table we give on http://build.fhir.org/ig/HL7/genomics-reporting/genetic-coordinate-system.html

Bret H (Feb 05 2019 at 15:42):

Can you point out the specifics? I wrote both, so in theory there should be consistency

Jamie Jones (Feb 05 2019 at 17:03):

I think I am confused by the last entry in the table, which uses the words "Variant Method" to define the Alignment Method:
Alignment method | 1-based [inclusive start, inclusive end] - with Variant Method (HGVS-like)

Kevin Power (Feb 08 2019 at 23:41):

@Bret H - I think this will be a topic on Monday. Will you have a chance to respond here or be available to discuss Monday?

Bret H (Feb 09 2019 at 19:05):

Yep. That extra text is misplaced. @James Jones with removing "with variant method..." to the end, see any other problems?

Still looking for Bob's references.

Bob Dolin (Feb 10 2019 at 19:20):

Bret, I don't understand why alignment method, right vs. left justification, etc, are part of the definition of coordinate system. Isn't the main objective to just know that when I'm referring to, say, position 7-8 in the string 'ACTGGCATAG', which base(s) I'm referring to?

Bret H (Feb 11 2019 at 12:53):

From: Bob Milius <bmilius@NMDP.ORG>
Sent: Monday, November 19, 2018 10:33 AM
To: Bret Heale <Heale@imail.org>; McDonald, Clem (NIH/NLM/LHC) [E] <clemmcdonald@mail.nih.gov>; Amos, Liz (NIH/NLM) [E] <liz.amos@nih.gov>; Power,Kevin M. <Kevin.Power@Cerner.com>
Subject: Re: Question re: 16244

Brett,

No, both are inclusive, and the coordinates of the sequences themselves remain the same.

So, you could just use two coordinate systems for describing just the coordinates (1-based, inclusive start and end; 0-based, inclusive start, exclusive end, aka interval)

But the 1-based systems differ when describing variants.

When you described an insertion with a 1-based system, do you put the insertion before or after the nucleotide?

If you decide it's always before the nucleotide, how to handle an insertion after the end of the sequence?

How do you handle boundary effects of numbers between features?

HGVS recommends right-justified, and VCF recommends left-justified.

There was a great discussion on this that Larry Babb pointed me to, but the link appears to be dead now. I've asked Larry if it's been moved.

Here's figure from that now dead link which describes how variants are described in the three system.

ttp://datamodel.clinicalgenome.org/development/images/variant_numbering_examples.png

Re changing the IG, there should be a QA period where we can make corrections.

Bob M

Bret H (Feb 11 2019 at 12:54):

pasted image

Bret H (Feb 11 2019 at 13:06):

@Bob Dolin I added the email from Bob Milius. The figure describes the difference. My role was to take the information from Bob M as agreed by Clem et al. We should discuss the differences in the call. My preference is to use only Zero, with explicitly stated inclusive start and exclusive end. The machines on either end can convert to a whatever numbering system is desired. But an explicit definition seems necessary to satisfy all people. Very much related to the need for using conical SPDI in HGVS. folks are too creative...

Bob Dolin (Feb 11 2019 at 14:18):

@Bret H Sorry, still not understanding. In the figure, both Alignment Format and Variant Format are 1-based fully closed (https://plastid.readthedocs.io/en/latest/concepts/coordinates.html). The differences aren't due to choice of coordinate system, but rather, choice of alignment and variant representation. About 2% of simple variants in 1000 Genomes VCF files have redundant (non-normalized) representation, but we can still say that the VCF files are 1-based fully closed. Wouldn't alignment method be a separate field?

Bret H (Feb 11 2019 at 15:55):

@Bob Dolin not sure what you mean. Look at the row in the figure for 2delT. The numbering for the two 1-based options does not come out the same. Can you talk me through that example?

side-bar: Also, you'll note in your example that the authors of plastid choose, and describe, a specific numbering/alignment method. Again, my vote would be for a single-representation. The benefits include a consistent expectation for coordinate system. Unless reporting a method or wanting to describe metadata about a repository, when is it a problem to report everything with zero-base inclusive end (PS half-open, open, inclusive, exclusive are all terms that different authors have chosen, there is not a standard just practice). You've been working on conversion from VCF and BAM more recently than I. I look forward to your comments.

Bret H (Feb 11 2019 at 15:59):

@Bob Dolin pps: that's a well written resource.

Bret H (Feb 11 2019 at 16:52):

@Larry Babb @ Bob Milius we talked about this on the call today. In the past you both made important contributions. will you be available to review the final disposition? many thanks!

Bob Dolin (Feb 11 2019 at 22:12):

@Bret H I'm all for a single coordinate system, although I kinda recall in prior committee discussions that it would be a hard sell.

Regarding your 2delT question - In the figure, 2delT is represented in: [1] Alignment format (1-based): 1..2 AT A; and [2] Variant Format (1-based): 2..2 T -. Note that in both cases, the "T" is at position 2. What's different is that the Alignment method doesn't want null's in the Alt allele, so begins the Ref allele with the position before the deletion.

I'm not an expert on alignment methods, but have seen some references describing a VCF approach (https://academic.oup.com/bioinformatics/article/31/13/2202/196142) and the NCBI SPDI approach (https://www.biorxiv.org/content/biorxiv/early/2019/02/05/537449.full.pdf). I'm okay with deferring on a LOINC code for alignment method at this time, because as you mentioned on the call, it's not clear anyone has a compelling need for it.

As for Coordinate system, here's a draft definition and value set:

A genomic coordinate is a position along a sequence. Coordinate systems can start counting from 1 ('1-based') or from 0 ('0-based'). HGVS, GFF, SAM, VCF, UCSC genome browser, and other applications employ a 1-based system. In a 1-based system, the end is "fully-closed", meaning that the end coordinate of an interval's range is the last position IN the range. BAM files, BED files, NCBI SPDI, UCSC database tables, and other applications employ a 0-based system. In a 0-based system, the end is "half-open", meaning that the end coordinate of an interval's range is the first position NOT IN in the range. For example, assume we have the sequence 'ACGTAGTC'. Interval range 2..4 in a 1-based coordinate system refers to the bases 'CGT', whereas interval range 2..4 in a 0-based coordinate system refers to the bases 'GT'.

Bret H (Feb 25 2019 at 16:16):

need a definition for interval range

Bob Freimuth (Mar 06 2019 at 19:18):

FWIW, the authoritative write-up on this topic, which we did for ClinGen, is here: http://dataexchange.clinicalgenome.org/allele/discussion/

Larry Babb (Sep 14 2019 at 02:03):

I'm restarting this thread to draw attention to the fact that the MolecularSequence.coordinateSystem does not use the same value set as the CG Variant profile component for coordinate-system (http://build.fhir.org/ig/HL7/genomics-reporting/obs-variant-definitions.html#Observation.component:coordinate-system).

It appears as if the thread above helped produce a good value set for the ObsVariant profile component, but there is no harmonization with the same concept in the MolecularSequence.coordinateSystem.

Should we have a gforge item to make sure this gets corrected?

Patrick Werner (Sep 14 2019 at 02:28):

Yes that would be nice. MolecularSequence didn‘t get much love recently. There also was a discussion at some point if we still need it or if we can replace it by the IG/ profiled Observation. I‘d like to continue the discussion soon.

Larry Babb (Sep 14 2019 at 20:22):

i don't think its reasonable to replace it with the obsVariant profile. In the eMERGE pilot we've come to the recent conclusion that the MolecularSequence is the most computationally sound representation of genomic variation. Since there is no "definitional" representations of genomic structures, variants and otherwise, we are going to depend on it to provide the build, chr, refseqId, ref, alt, start, end information that is the closest to the standards coming out of the GA4GH VR work.

I do think it would be helpful to implementers if there was a single path for dealing with computationally precise variation for use in either case level or variant level assertion models.

to be clear, I think having the profiles that use componentized sets of LOINC codes to create codeableconcept attributes for things like start/end ranges and their dependency on a specific sequence (be it protein, transcript or genomic) and the loose coupling of other components that may or may not contribute to the 'definition' of a variant is not what I think is a clinically reliable and sound computational form of sharing genetic variation data.

While I get the flexibility and consideration to allow the various groups around the world to share whatever sets of attributes they use either for human convenience or dare I say "computational identifiers", I view these profiles as a first shot at trying to solve two issues with one "bag" of attributes in a semi-reasonable shape of variation. It may start the ball rolling to allow folks to "find a place" to pass their favorite piece of variant information but it does not rise to the level of a computationally sound standard that I would want my health information to be represented in. I most definitely wouldn't want any CDS algorithms to rely on it.

Last updated: Apr 12 2022 at 19:14 UTC

Main menu

FHIR Chat · CoordinateSystem ValueSet · genomics