FHIR Chat · Representation of Repeat Expansions in FHIR · genomics

Stream: genomics

Topic: Representation of Repeat Expansions in FHIR


view this post on Zulip Rachel Kutner (Dec 03 2021 at 20:39):

Hey all, I just submitted the following tracker in Jira: https://jira.hl7.org/browse/FHIR-34418

There's a proposal attached with extra information, I gave a short summary in the main JIRA ticket.

Looking forward to discussing this further on the 20th.

view this post on Zulip Bret H (Dec 03 2021 at 21:05):

Awesome. I commented on the JIRA. The documentation is extremely informative and well done. I would be particularly interested in why the current mechanisms for phasing do not work. BTW While I do believe the current components meet the use case, I would like to understand better why/if your team sees a gap.

It should be more than a desire to have a named component called 'repeat' versus something called 'copy-number' on a molecular level they are the same (and the component names are not meant to be computable). At a higher-level there are differences, but this is similar to the differences at a high-level for an HLA-allele and a Star-allele. For these two the same underlying components are used - i.e. we do not have a component called HLA-allele and one called Start-allele.

Be good to learn if it is the high-level category and classification that is desired to be communicated. (might be worth exploring component:variation-code for that. Or perhaps,component:molecular-consequence or component:coding-change-type ).

Here's the Sequence Ontology term to look at http://www.sequenceontology.org/browser/current_release/term/SO:0002165 (it fits in the extensible value set for component: molecular consequence).
@Rachel Kutner

The sender could hard-code the value in when sending a triplet repeat finding.

view this post on Zulip Arthur Hermann (Dec 03 2021 at 21:30):

Rachel Kutner said:

Hey all, I just submitted the following tracker in Jira: https://jira.hl7.org/browse/FHIR-34418

There's a proposal attached with extra information, I gave a short summary in the main JIRA ticket.

Looking forward to discussing this further on the 20th.

@Bob Freimuth - please take a look at well... Bret H has been responding quite a bit - but I would love to hear what you think from a modeling perspecitve (or if Bret H has that covered already)

view this post on Zulip Bret H (Dec 04 2021 at 00:06):

if the nested tandem repeat structures are added. We might not be able to align with the WG IM unless the IM model decides to adopt something like a 'structural' repeat based on repeats of alleles (in the IM sense and for CNV) versus a 'local sequence tandem repeat' where sequence repeats exist that do not encompass an entire gene (current arbitrary limits for CNVs tend to be 1kb in size but it's arbitrary).

As long as the start/end of the repeats are defined relative to a reference, then you're good to go. With the nested structure suggested it can simplify the representation of a tandem repeat (won't need to use haplotype, for example, when two types of trinucleotide repeats are occurrent).

view this post on Zulip Bret H (Dec 04 2021 at 00:15):

and just to say it - from a sequence level perspective sending the sequence change in Alt and a reference and a position would communicate the change. However, the desire to communicate the number of each type of repeat is what forces a more complex representation (either multiple variant profiles with haplotype or the nested structure proposed)

view this post on Zulip Bret H (Dec 04 2021 at 00:16):

I still like the idea of sending an SO code for tandem repeat to give the consumer an opportunity to know more clearly what they've just received.

view this post on Zulip Bob Dolin (Dec 04 2021 at 02:23):

@Bret H If the ref-allele field were changed to 0..*, and you include multiple ref-alleles, along with multiple locations and multiple copy numbers, does FHIR or JSON guarantee that you can determine which location and copy number goes with which ref-allele?

view this post on Zulip Bret H (Dec 04 2021 at 15:39):

@Bob Dolin It depends on what you are thinking. If we are talking about using multiple Variant profiles they would each have there own references.

Not sure what your thinking.

view this post on Zulip Bret H (Dec 04 2021 at 15:42):

I assumed for repeats composed of multiple triplets that each triplet would have it's reference included, or positional information provided. This would be in each repeat-pair if using the nested extension type of communication.

view this post on Zulip Bret H (Dec 04 2021 at 15:46):

The use of component:molecular-consequence can tip off the receiving system that it's got a triplet to deal with.

Here's the Sequence Ontology term to look at http://www.sequenceontology.org/browser/current_release/term/SO:0002165 (it fits in the extensible value set for component: molecular consequence).

Using multiple Variants would be more consistent in some ways with how other Variants arrive - meaning that the system can use the data without special programming to handle tandem repeats.

view this post on Zulip Bret H (Dec 04 2021 at 15:53):

a third way to look at it, just provide one Variant which covers the whole region, don't call out the repeat units with an element but have them calculated from sequence data or parse from a provided HGVS string.

A fourth way is to have the repeat numbers provided as an implications (with the sequence region provided in a Variant profile).

Looks like it is an opportunity to rethink CNV and other structural variation.

BTW In the IM group we've been talking about multi-loci and alleles - that would map nicely for tandem repeats, nearly 1 to 1, using multiple variant profiles and haplotype. But not so much to a nested tandem repeat extension. However, there might be a compelling reason to use the nested tandem repeat extension.

view this post on Zulip Kevin Power (Dec 06 2021 at 14:33):

Bob Dolin said:

Bret H If the ref-allele field were changed to 0..*, and you include multiple ref-alleles, along with multiple locations and multiple copy numbers, does FHIR or JSON guarantee that you can determine which location and copy number goes with which ref-allele?

Just to clarify Bob, simple example:

component[exact-start-end] = {10, 18}
component[ref-allele] = CAGCAGCAG
component[copy-number] = 3

component[exact-start-end] = {19, 30}
component[ref-allele] = GCAGCAGCAGCA
component[copy-number] = 4

If you are asking if we can assume the first three are related, and the second there are related - today, I think the answer is 'no' other than to document guidance to recommend when you send these, please make sure they are in the correct sequence

view this post on Zulip Bret H (Dec 06 2021 at 15:30):

That's not quite what I meant. Obviously with two variant profiles the position and sequence reference for each repeat is explicit. I was saying that with the nested structure to have the same level of precision you would need to add the sequence reference and position in each 'repeat pair' of the proposed.

sequence reference being an accessioned reference.

view this post on Zulip Kevin Power (Dec 06 2021 at 15:32):

Understood @Bret H - I was trying to clarify if that is what @Bob Dolin was asking.

view this post on Zulip Kevin Power (Dec 06 2021 at 18:18):

@Bob Dolin - Did your thumbs up mean that what I described is what you were asking?

view this post on Zulip Bob Dolin (Dec 06 2021 at 18:19):

yes @Kevin Power , thanks.

view this post on Zulip Kevin Power (Dec 06 2021 at 19:00):

I like to try and fit things into current models, but I don't want to force it either. At minimum, I want to limit the number of extensions we add.

To that end, another option would be to introduce some sort of 'component association' extension at the Observation.component level that would allow implementers to indicate when components should be treated as a set. So the two sets of components I said above could have an extension something like:

component[exact-start-end] = {10, 18}
component[exact-start-end] .extension[component-set] = "repeat-expansion-set1"
component[ref-allele] = CAGCAGCAG
component[ref-allele] .extension[component-set] = "repeat-expansion-set1"
component[copy-number] = 3
component[copy-number] .extension[component-set] = "repeat-expansion-set1"

component[exact-start-end] = {19, 30}
component[exact-start-end] .extension[component-set] = "repeat-expansion-set2"
component[ref-allele] = GCAGCAGCAGCA
component[ref-allele] .extension[component-set] = "repeat-expansion-set2"
component[copy-number] = 4
component[copy-number].extension[component-set] = "repeat-expansion-set2"

It is still an extension, but only a single (and therefore a simple) extension. I admit its intention isn't as clear as the proposed complex extension from @Rachel Kutner and would require some very clear guidance when and when not to use this extension. If this is the only use case for it, perhaps we name this extension very narrowly, calling the extension "repeat-expansion" to be clear that is only should be used for this use case. Or, we keep it more open/generic and rely on implementers to use it correctly.

view this post on Zulip Bret H (Dec 07 2021 at 15:50):

@Kevin Power what do the positions refer to? Assuming you have a reference sequence with an accession id in the Variant profile.

view this post on Zulip Kevin Power (Dec 07 2021 at 15:55):

Sorry for my very narrow example, just trying to keep it focused on the specific attributes needed for the problem at hand. I proposed the positions would be the start and optionally the end of that specific repeat expansion.

view this post on Zulip Bret H (Dec 07 2021 at 18:16):

within?

view this post on Zulip Bob Dolin (Dec 07 2021 at 18:20):

@Bret H what are your thoughts around position? Imagine for instance we want to represent CAG[20]CTAG[15]. Let's say we were to represent it as two variants: CAG[20], CTAG[15]. If we include positional information for each variant, then we know 5' vs. 3'. But what exactly do we mean by 'position'? Taking CAG[20], if position is the location of the 'C', and ref-allele is 'CAG', and copy-number is 20, this is not the same as CAG[20]. Rather, it's saying that we are replacing one CAG with 20 CAGs.

view this post on Zulip Kevin Power (Dec 07 2021 at 18:27):

DOH! Just realized I totally got that part of my example wrong. I should have been 'alt-allele' not 'ref-allele' sorry for the confusion. So maybe something like this:

component[exact-start-end] = {10, 18}
component[exact-start-end] .extension[component-set] = "repeat-expansion-set1"
component[ref-allele] = CAG
component[ref-allele] .extension[component-set] = "repeat-expansion-set1"
component[alt-allele] = CAGCAGCAG
component[alt-allele] .extension[component-set] = "repeat-expansion-set1"
component[copy-number] = 3
component[copy-number] .extension[component-set] = "repeat-expansion-set1"

component[exact-start-end] = {19, 30}
component[exact-start-end] .extension[component-set] = "repeat-expansion-set2"
component[ref-allele] = GCA
component[ref-allele] .extension[component-set] = "repeat-expansion-set2"
component[alt-allele] = GCAGCAGCAGCA
component[alt-allele] .extension[component-set] = "repeat-expansion-set2"
component[copy-number] = 4
component[copy-number].extension[component-set] = "repeat-expansion-set2"

view this post on Zulip Bret H (Dec 07 2021 at 19:46):

Bob Dolin said:

Bret H what are your thoughts around position? Imagine for instance we want to represent CAG[20]CTAG[15]. Let's say we were to represent it as two variants: CAG[20], CTAG[15]. If we include positional information for each variant, then we know 5' vs. 3'. But what exactly do we mean by 'position'? Taking CAG[20], if position is the location of the 'C', and ref-allele is 'CAG', and copy-number is 20, this is not the same as CAG[20]. Rather, it's saying that we are replacing one CAG with 20 CAGs.

Not sure what you mean. CAG[20] as start with position 20 or did you want to say CAGCAGCAG..., that is CAG with 20 repeats?

view this post on Zulip Bob Dolin (Dec 07 2021 at 20:00):

sorry @Bret H , a poorly worded question that I'm still thinking through. But take for example CAG[20]CTAG[15]. Let's say that we have a reference sequence, CAG[3], starting at position 4, followed by CTAG[5], starting at position 13:

view this post on Zulip Bob Dolin (Dec 07 2021 at 20:01):

...CAGCAGCAGCTAGCTAGCTAGCTAG...'

view this post on Zulip Bob Dolin (Dec 07 2021 at 20:02):

And let's say we want to treat CAG[20]CTAG[15] as two variants. The first variant, CAG[20], might look like this:
component start-end=4..12
ref-allele=CAG
copy-number=20
It's a bit different from CNVs in that what we're saying here is to replace 4..12 with 20 copies of the reference allele, whereas for CNVs, we're saying that there are 20 copies of the bases in 4..12.

view this post on Zulip Kevin Power (Dec 07 2021 at 21:25):

Maybe a basic question here - if we are thinking like a biologist/bioinformatician/clinician, is it Ok to think about something like CAG[20]CTAG[15] as two variants?

view this post on Zulip Bob Dolin (Dec 07 2021 at 22:09):

I'm not sure @Kevin Power , I was just thinking it would be easier - similar to where there are, say, two SNVs in a gene (that could potentially be grouped into a haplotype)

view this post on Zulip Kevin Power (Dec 07 2021 at 22:19):

@Bob Dolin - Yea, I do think our guidance for how to do this would be easier if we are OK with expressing it as >1 Variant observation with some way to tie them together (I almost said "group" them, but figured that has too much baggage right now :wink: )

view this post on Zulip Kevin Power (Dec 07 2021 at 22:26):

And, there is nothing saying that a biologically approved definition must exactly match our computational representation anyway. The closer we get to matching the better, but I would argue that even if every biologist in the world would agree that CAG[20]CTAG[15] is a single 'variant' that doesn't mean it has to fit into one of our Variant observations.

Before we go much farther though, @Rachel Kutner -- what we are discussing here is certainly different that what you proposed. I think with the right guidance it could work, but do you agree? Perhaps we would need to work up (at least) a specific example for review.

view this post on Zulip Jamie Jones (Dec 07 2021 at 22:38):

It's too bad we can't just use the bracket notation as an alt allele...

view this post on Zulip Kevin Power (Dec 07 2021 at 22:41):

We certainly could, it is just a question of how computable and structured that is.

view this post on Zulip Bret H (Dec 07 2021 at 22:43):

I would say that Haplotype can tie together but sequence position would be needed to infer 5' to 3' relationship (with current constructs). Haplotype has a coded value allowing one to communicate the haplotype if there is a concept of the specific composed repeat one would like to use.

Looking at the VRS as an example,
https://vrs.ga4gh.org/en/stable/terms_and_model.html?highlight=repeated#repeatedsequenceexpression
the model use the starting base of where the repeat is in a reference.

At any rate, positional information is communicated relative to a reference sequence not the alt sequence. @Kevin Power so there's no conflict between variants and numeric position.

If a complete patient composite sequence is needed Sequence could be used. OR send the whole repeat (with reference sequence and location) and perhaps communicate repeat subunit counts with something like the nested structure.

The goal of sending the repeat counts is to keep the receiving system from having to do a calculation. It's a display feature. But if sent in a structured way it'll get used.

Also, be good to have a consensus on composing variants. Multiple Variant profiles with Haplotype pulling together OR something new.

VRS is adding a something new but it still uses their model of Location (like the position elements in Variant profile) and sequence (actual nucleotides, like our alt element).

Tricky part is deciding if receivers are going to infer the order or not. Haplotype can put them together, but the position relative to a reference sequence (hopefully the same one) would be needed to infer the order.

Using the nested form, like what VRS is contemplating, gives you a mechanism for declaring an order (either as an ordered list or with a numeric order number).

view this post on Zulip Kevin Power (Dec 07 2021 at 23:01):

You know, I didn't like it at first, but just simply using the bracket notation in the ref and alt allele attributes is starting to grow on me. That is a pretty simple grammar.

view this post on Zulip Bret H (Dec 07 2021 at 23:08):

Would it be able to alleviate the need for a count attribute? yes
Would one still be able to provide position relative to a reference? yes
Would it communicate order? yes
Would it reduce the number of Variant profiles, which seems to be a goal? yes
Does it mean our data type needs to change from string? no...maybe...
Would we rename the CNV count field to make sure it is understood as being for largish structural copy communication? yes

I'm not a fan of overloading the string data type, but....probably worth considering more closely. It's possible that the string could be malformed or not be parsed correctly. if the count data is vital to clinical interpretation or communication for computable CDS, need to consider how to protect users.

view this post on Zulip Bret H (Dec 07 2021 at 23:08):

But if you're doing that why not only send HGVS and ask receivers to rely on parsing HGVS? (we'd be inventing a bit another syntactical format)

view this post on Zulip Bret H (Dec 07 2021 at 23:11):

HGVS syntax has a number of ways to represent composition. including where the components are not continuous.

view this post on Zulip Bret H (Dec 08 2021 at 16:47):

FYI from the VRS G4GH group: https://github.com/ga4gh/vrs/issues/363#issuecomment-988334323

They use the location in the reference and do not use the repeats in the alt sequence to number location. The location is encompassing the entire affected space in the reference. It is implied to be a Substitution for the reference sequence data. (so if the reference had AAACTGCTGCTGAAA, 3, and the alt had CTGCTGCTGCTGCTG, 5, then the correct region start and stop for position in the reference sequence would be start at 4 and stop at 12 - which would give the correct 5 repeats only in the alt sequence if constructed).

view this post on Zulip Kevin Power (Dec 08 2021 at 17:19):

@Bret H - With all this input, any chance you could mock up a couple of different examples?

view this post on Zulip Bret H (Dec 08 2021 at 17:30):

Yes. You thinking: like EPIC's example with
version 1 using couple Variant profiles and Haplotype
version 2 with components set out like you describe
version3 EPIC proposal-ish (promise only slight modification)
version with brackets?

should we put in a version with full HGVS or would that be too much, seeing as that would be one Variant profile - no specific repeat information added in separate fields. i.e. obvious and not needed.

view this post on Zulip Kevin Power (Dec 08 2021 at 17:45):

Sounds about right to me @Bret H

Two things:
I think we will always have the full HGVS, but it should always be seen as something for visual display and (ideally) we would never require someone to parse it.
And yea, let's do a version with brackets - that WOULD require parsing, but I think we can keep it really simple and easy to parse. Still not sure if it is a good idea or not, but worth keeping on the list for now.

view this post on Zulip Bret H (Dec 10 2021 at 18:14):

FYI I am planning to have the mentioned examples up mid-next week.

view this post on Zulip Rachel Kutner (Dec 16 2021 at 15:31):

Sorry all, I've been busy and haven't had a chance to check back on this thread.

I think my main concern is the assumption that the clinical results will have exact molecular coordinates in order to indicate the order in which the tandem repeats occur. I haven't seen many examples of exact genomic coordinates in the lab results I've looked at, and this isn't necessarily considered "clinically relevant" information for these results, so a spec for sending these results requiring that information might not be practical.

We based our structure off of existing clinical reports and the HGVS definition for repeat expansion, but we rarely see the actual HGVS name represented in clinical reports (at least the several we have) and it isn't that common in research either.

Per the string solution - this is my main concern as well (from Brett):
"I'm not a fan of overloading the string data type, but....probably worth considering more closely. It's possible that the string could be malformed or not be parsed correctly. if the count data is vital to clinical interpretation or communication for computable CDS, need to consider how to protect users."

I'll bring this back to my coworker who is working on the proposal for Epic and get back on this.

@Jamie Jones - are we set up to discuss this during next Monday's meeting?

view this post on Zulip Jamie Jones (Dec 16 2021 at 15:52):

@Rachel Kutner Yep it is on the agenda--along with a 5-10 minute update on genomics operations from @Bob Dolin

view this post on Zulip Bret H (Dec 16 2021 at 19:46):

@Rachel Kutner it's a been a long-standing goal to improve the quality of genetic data labs deliver. Genetic data, in this case the positional information, serves as a very valuable source of information for the recipient organization. It has uses in clinical decision support, identifying all instances of data on a patient or population basis, utility in business intelligence (e.g. number of tests which are reporting on overlapping positions - needed for a Chief Data Officer to help guide elimination of waste), population health assessment, research and the positional information is a valuable asset in it's own right. Germline data is hyper-valuable as it can live as long as the patient does. Your clients would be well served to have the capacity to receive it. If the standard is built to accommodate positional information then they can ask the Labs for it and point to a specification. If we don't accommodate positional data then cost and waste can be higher....that's my ten cents.

view this post on Zulip Bret H (Dec 16 2021 at 19:51):

the labs don't necessarily have to send it, but building the capability into the spec I believe will pay off huge returns!

view this post on Zulip Bret H (Dec 20 2021 at 16:28):

example from FMR1 with 10 GCC repeats and 1 AGG repeat
reference has GCC[9]AGG[1] and the variation has GCC[10]AGG[1]

  • component[molecular-consequence] = SEQONT#SO:0002165

component[coding-hgvs] = "NM_002024.5:c.-128_-69GGC[10]GGA[1]"
component[genomic-ref-seq] = NC_000023.11

component[exact-start-end] = {147912051, 147912078}
component[exact-start-end] .extension[component-set] = 1 "repeat-expansion-set1"
component[ref-allele] = GGC GGC GGC GGC GGC GGC GGC GGC GGC
component[ref-allele] .extension[component-set] = 1 "repeat-expansion-set1"
component[alt-allele] = GGC GGC GGC GGC GGC GGC GGC GGC GGC GCC
component[alt-allele] .extension[component-set] = 1 "repeat-expansion-set1"
component[copy-number] = 10
component[copy-number] .extension[component-set] = 1 "repeat-expansion-set1"

component[exact-start-end] = {147912079, 147912081}
component[exact-start-end] .extension[component-set] = 2 "repeat-expansion-set2"
component[ref-allele] = AGG
component[ref-allele] .extension[component-set] = 2 "repeat-expansion-set2"
component[alt-allele] = AGG
component[alt-allele] .extension[component-set] = 2 "repeat-expansion-set2"
component[copy-number] = 1
component[copy-number] .extension[component-set] = 2 "repeat-expansion-set2"

view this post on Zulip Bret H (Dec 20 2021 at 16:57):

The above is not he EPIC like with a new nested component. btw, it's a little more conservative.

view this post on Zulip Jamie Jones (Dec 20 2021 at 17:13):

summary from today's call - suggestions came in against overloading component[copy-number] for this, and the group wants to pursue testing the viability of using HGVS in the short term via guidance on https://varnomen.hgvs.org/recommendations/DNA/variant/repeated/.

view this post on Zulip Kevin Power (Dec 20 2021 at 18:42):

Jamie Jones said:

summary from today's call - suggestions came in against overloading component[copy-number] for this, and the group wants to pursue testing the viability of using HGVS in the short term via guidance on https://varnomen.hgvs.org/recommendations/DNA/variant/repeated/.

Any chance someone might have a code to represent this slice, or do we just create another TBD code for it?

Who was going to be testing the viability of using HGVS? (sorry, I was multi-tasking during the call, perhaps I missed it)

view this post on Zulip John Li (Dec 20 2021 at 18:46):

@Kevin Power
I don't think there is an universal code to represent repeat expansion. We are using string instead of code for that in Epic.

view this post on Zulip Bret H (Dec 20 2021 at 20:14):

Use a Sequence Ontology term. The ontology is a better set of concepts and essentially this will be hard-coded by systems, as long as it is consistent.

view this post on Zulip Kevin Power (Dec 20 2021 at 22:57):

Is there a good one @Bret H ?

view this post on Zulip Bret H (Dec 20 2021 at 23:46):

Yep. The same as I would suggest for molecular consequence - but open for discussion (here's the concept - http://www.sequenceontology.org/browser/current_svn/term/SO:0002165 ). BTW we don't talk about the power that molecular consequence and DNA change type can give computational (or with invariants), enough ; ^ ) They can be a means to clue the recipient system as to how it wants to handle the data without the system needing to calculate that the information is a tandem repeat (for example)

view this post on Zulip Bret H (Dec 20 2021 at 23:48):

but to be fair the standard is still in the early adoption phase.

view this post on Zulip Kevin Power (Dec 21 2021 at 00:03):

Hmm, does anyone have a concern with using a code for the 'answer' in one context, and the 'question' in another ? I suppose that should be OK, but seems a little weird to me.

view this post on Zulip Bob Dolin (Dec 21 2021 at 00:05):

@Kevin Power Can you give an example?

view this post on Zulip Bret H (Dec 21 2021 at 00:06):

Could run it past the curators of SO. But I get ya, it's been a nice rule of thumb in observations to keep LOINC for the 'question' and SNOMEDCT for the 'answer.' But in our case one could argue that the context of the use provides that. E.g. you would use a different query and not just be relying on the code and code system. The answer would be in answer slot and the question level in a question slot.

view this post on Zulip Bret H (Dec 21 2021 at 00:07):

query for component.code versus component.value+component.code

view this post on Zulip Kevin Power (Dec 21 2021 at 00:08):

@Bob Dolin - I don't have an example before this one, so that is partly why I wanted to ask.

view this post on Zulip Kevin Power (Dec 21 2021 at 00:10):

As long as we document it in the IG, it would work. One risk would be that if someone did a query wrong, they might be a result (as opposed to finding nothing) that might lead to a misinterpretation of what they received.

view this post on Zulip Bret H (Dec 21 2021 at 00:11):

not sure you can do the query wrong, unless you're really super-confused. but guidance can help

view this post on Zulip Kevin Power (Dec 21 2021 at 00:13):

Maybe I am worried someone will be playing Jeopardy and phrase the answer in the form of a question? :smile: :smile:

view this post on Zulip Kevin Power (Dec 21 2021 at 00:14):

But either way, appropriate guidance should clear up possible confusion.

view this post on Zulip Bret H (Dec 21 2021 at 00:16):

good question. Hope others will comment soon : ^ )


Last updated: Apr 12 2022 at 19:14 UTC