Stream: genomics
Topic: Structural Variants Spanning Multiple Genomic "Regions"
Rachel Kutner (Oct 28 2020 at 20:57):
How do we expect structural (e.g. large scale genomic mutations >>50bps) spanning multiple regions of the genome to be reported/structured?
Specifically, variants that are still single variants (e.g. wouldn't make sense as multiple separate variants related by a common therapeutic/diagnostic implication), but where one may need to record information about how each functional region is affected by the mutation.
An example would be a Gene Fusion where a chromosome region is inverted, resulting in two genes being fused. In this case, gene 1 should be recorded associated with it's break point (e.g. EGFR exon 20) and the other gene recorded with its own break point (ALK 16) (examples not researched for clinical relevance). This could happen /external/ to genes as well, where two chromosomes are part of a large scale mutation and each chromosome needs to have its break point recorded.
Another example could (possibly) be a large-scale chromosomal mutation where a lab wants to note that key regions (introns/exons/genes/etc) are included in/affected by the deletion in certain ways (whole or partial deletion).
How do we support this type of result in FHIR? Is there the ability to have "related" subsets of information within a single reported variant?
Additionally, are there other types of information that we're aware of that might need to be included in these "pieces" beyond chromosome, gene, breakpoint, and region?
Kevin Power (Oct 29 2020 at 14:56):
@Rachel Kutner - Great stuff. I don't have immediate reactions, other that this is a great set of questions. Any chance you (or anyone) can point at at some report examples for such use cases? I honestly can't remember if our example folder has any of these or not.
@Bob Freimuth - You will likely want to include these great questions/thoughts into the IM/Structural Variants discussion you ask about on the list serv today.
Bob Freimuth (Oct 29 2020 at 15:04):
Thanks for the tag, @Kevin Power . @Rachel Kutner - your timing is perfect. I just asked for this type of thing on our listserv. :-) I'll add this to the rest.
Rachel Kutner (Dec 22 2020 at 15:22):
@Bob Freimuth @Larry Babb @Jamie Jones Why do we have both Duplication and Copy Number Variant listed separately on our Structural Variants description?
image.png
I'm asking because I asked some contacts at MDA about how to represent Copy Number Variants and they stated that Deletions and Duplications are both types of Copy Number Variants. I could understand showing how to represent each different type, but Duplication and CNV in the chart are redundant (they have no differences in how they're represented).
I also am wondering why Allelic State is not relevant for CNV/Dup variants?
Larry Babb (Dec 22 2020 at 15:50):
@Rachel Kutner I have not been directly involved in the creation of the table and definitions above, but here's my understanding of the challenge with CNVs, Dupes, and Dels.
It is true that Dupes and Dels are "kinds" of CNVs. But I think it is fair to say that not all dupes and dels are CNVs. Some groups provide an arbitrary size on the length of the sequence Duplicated or Deleted to constitute a CNV. Additionally, there's the conflation caused by the HGVS representation of Dupes being "a sequence change where, compared to a reference sequence, a copy of one or more nucleotides are inserted directly 3' of the original copy of that sequence.".
One might argue that a deletion, insertion or duplication of any sequence in any molecule changes its structure (you might even be able to argue that a substitution has a notion of changing the structure). The GA4GH VRS group has been working very hard to address these issues and we're circling a solution that helps clarify the distinction between Structural or "systemic" changes from Sequence or "molecular" changes. If, for example, a change only impacts a single molecule then we refer to it as a molecular variant and note that the impact or variation is communicated to depict a change in that single molecule. As such you can imagine then the need for something like "allelic state" for molecular variants. Allelic state on a molecular variant provides "systemic" context. In the case of the allelic state of a DNA variant on a chromosome you could convey the genomic (or genotypic) representation of an allele (or molecular concept).
As you move into the SVs you will find it is typically more about the systemic state of the change(s). While it is true that you can express the molecular representation of a loss (Del) or gain (Dupe) of a region on that molecule the question becomes how to decide whether that Del or Dupe is representing the totality of copies in the genome or system within which the given Allele or molecule resides.
What really hampers folks from figuring out how to effectively and accurately convey this information is that the practice of assuming a Deletion of a large region is a single copy loss of that region. As you astutely pointed out, there needs to also be an allelic state to determine if the Deletion on a sex chromosome or diploid chromosome is heterozygous or homozygous so one could infer whether the total copies remaining where 1 or 0, respectively. Not even to say, what if there was a dupe of the region on one copy and a del of the same or overlapping region on the other, would this be a net copy of 2?
If you peruse through clinvar at copy number variation you will see precisely marked CNV gains and losses (typically with ISCN x? nomenclature) that provide the total number of copies for the given regions of a certain size (i think they use 50bp - but this is admittedly arbitrary). You will also see Dupes and Dels which may be identified using methodologies that can confirm that they know the location of the duplication or where they will assert that they do not. In these cases, the Dels are considered relative and molecular and thus would require an allelic state to derive an "absolute" copy number. The Dupes (depending on the methodology) use the HGVS nomenclature that implies that they are in tandem, even though the vast majority are not known to be that way (but some are). So the research and dare i say "clinical" folks assume that dupes (of a certain size) are similar to large dels in that they can be used to derive the total absolute copies in a genome (given the allelic state). Even though they cannot for certain determine that the copied region is in tandem or somewhere else in the genome.
Fun stuff.
The big takeaway for me (and the VRS team - i think) is finding a way to give the labs a way to accurately represent what they found and not provide a lossy method like hgvs for conveying information that is not precise with what was observed. Using HGVS dupe notation is tricky at best. You can make a computational decision for your system to always treat HGVS dupes as "not in tandem" and representative of a single molecule change (even though you can't be certain the duplicated region may not be on the same instance of the molecule from which it was copied). If everyone does that it is probably the safest use of HGVS dupes. If not, then there will be many cases, where there is too much presumed by the nomenclature itself.
Rachel Kutner (Dec 22 2020 at 16:01):
Interesting - thanks for the background @Larry Babb .
We mainly treat Structural vs Not Structural variants as "small" vs "large" and generally tell people to use "structural" when Genomic Coordinates or general Genomic Concepts (exons, genes, chromosomes) are most appropriate to represent the variant (to boil it down).
Can you explain more on the "Del/Dups are types of CNVs, but not all Deletions/Duplications ARE types of CNVs?" I'm trying to understand whether those values would be redundant or not and why.
Larry Babb (Dec 22 2020 at 16:40):
If you go to ClinVar and place "copy number gain"[Type of variation] in the search you will notice that clinvar shows the full number of copies for each variant expressed, even though they also share the HGVS "DUP" syntax (typically with ambiguous or imprecise endpoints due to the use of microarray technology for testing). You should also look at the "copy number loss"[Type of variation] results as well, you will see the same thing except with the HGVS deletions as the transformed ISCN syntax on the individual variant pages.
However, if you look for Duplications or Deletions and NOT copy number gain or loss in clinvar you'll see many Dupes and Dels that you could argue are CNVs. And they are, but not all are (I assume).
If there's a deletion or duplication of one, two or more nucleotideds in a sequence would it constitute a CNV? I'm fairly certain that the CNV folks don't consider every Deletion and Duplication a CNV. And how does one discern if a Duplication is truly a tandem duplication or not particularly when using microarray technology?
If you "tell people to use 'structural' when genomic coords or general genomic concepts (exons, genes, chrs) are most appropriate to represent the variant..." then it leaves quite a bit to the interpretation of that suggestion. The big question is How do provide a computationally reliable and consistent way to represent the data. So that every system that produces and consumes this data is provided an explicit means of knowing the difference between a sequence change, a tandem dupe/repeat, a molecular copy or loss, and a genomic copy or loss?
It is very possible for a given variant to be both a CNV and a sequence variant. However, I think of the CNVs as a genomic copy gain or loss and thus any large deletion or duplication that is meant to convey the concept of gain and loss of a region of the genome should provide the allelic state or some notion that would help convey the full impact on the genome.
Last point of interest - which may help a bit. When submitting CNVs to ClinVar you can put in the relative copy count or the absolute copy count. The relative copy count would be associated to a single copy of the molecule (or chromosome) and the absolute would be to the total copies of that region in the entire genome.
Larry Babb (Dec 22 2020 at 16:44):
I think it is pretty important to make sure the variant data is as precise and accurate as possible if we truly intend to use this data for Clinical Decision Support (alerting, etc...). If we make these broad assumptions at this level of the data then we will impact the integrity and use of clinical genetics in clinical care. Without serious consideration of the risk of the misrepresentation or misunderstanding of the data we could create errors that will set the use of genetics in practice back significantly.
Bob Freimuth (Jan 04 2021 at 17:10):
At the risk of getting a bit too philosophical, but echoing some of what Larry said:
When a genomics test is performed, a state is observed. Exactly what is detected will depend on the technology and method used to interrogate the specimen, which could impose limitations on downstream analysis.
Once a state is observed a series of interpretations can be made. Classifying a particular state as a CNV, for example, is an interpretation that requires the choice of a reference state. In general, the classification(s) assigned to a given variant are heavily dependent on the choice of reference. Few classifications are inherent to the variant itself and most are dependent on the chosen context. Furthermore, since a CNV can be rendered as a sequence variant, these classifications are not mutually exclusive and the use of one over the other comes down to which representation is most useful in a particular use case.
I think we should clearly separate the task of capturing "state" from that of recording interpretive statements. Ideally, both would be backed by well-defined models that enable important data to be represented as discrete and standardized forms, but the bar of unambiguous computability is much more important for state than it is for interpretation.
Joel Schneider (Jan 04 2021 at 18:23):
Some of today's discussion seemed to come close to the topic of lab reflex testing. Is that a topic we might want to explore further?
Bret H (Jan 26 2021 at 14:23):
A structural variant involves two or more 'genes' but what is a gene other than a locus. The Information Modeling sub-group run by @Bob Freimuth is whom I would look to for a computable definition that spans from the 'single-nucleotide' variant to large structural variation. I wonder if the term 'structural variant' has too much business logic associated with it (i.e. the operational definition depends on the team using it) to be useful in a standard. My suggestion is to look to the Information Modeling sub-group to offer a computable structure that accommodates variants that involve any number of genetic loci. Thanks
Bret H (Jan 26 2021 at 14:30):
for computation a diff from a standard would be nice. for the human being, a named variation feels easier. for a truly excellent system, the complexity is hidden behind identification of the problem - i.e. the human user sees actionable information (problem with X disease based on genomic analysis suggested actions are - click here for positional information/evidence). but ignore this philosophy for now. My first point stands, that *there should be a single computable structure for genetic variation positional information that is agnostic to the number of chromosomes/genomic loci involved (which is more or less what Bob F said above : ^ )*
Bret H (Jan 26 2021 at 14:30):
We can't go on forever inventing new types and expect continuity.
Last updated: Apr 12 2022 at 19:14 UTC