FHIR Chat · MolecularSequence - rethink 'referenceSeq' · genomics

Stream: genomics

Topic: MolecularSequence - rethink 'referenceSeq'


view this post on Zulip Kevin Power (Feb 22 2022 at 17:05):

I feel like there are two many options to define a referenceSeq in MolecularSequence:

image.png

I can propose a few things, but anyone else agree/disagree?

view this post on Zulip Bob Freimuth (Feb 22 2022 at 20:10):

I agree there is potential for confusion there. I think the draft we assembled based on the IM work could be informative, but I'm biased. :wink:

view this post on Zulip Kevin Power (Feb 22 2022 at 20:21):

Yea, I was trying to think about how to best apply those principles myself @Bob Freimuth

I think my 'quick wins' for simplification would be:

  • Remove 'referenceSeqPointer' (doesn't seem like we would do this anytime soon)
  • Remove 'referenceSeqString' (probably not needed in many settings we want to use this)
  • Remove 'chromosome' (it seems this is not a typical way to express a ref seq)
  • Remove 'windowStart' and 'windowEnd' (unless @ Bob Milius needs them for HLA use cases)

Welcome any/all feedback

view this post on Zulip Bob Milius (Feb 22 2022 at 20:29):

Some verison of windowStart and windowEnd is important. If a URI is provided to point to the reference sequence (eg full gene), but the observation is based on a part of the sequence (eg. part of intron 1, exon1, part of intron 2), we need to know how much of the ref sequence was actually sequenced. This window into the reference sequence provides that.

view this post on Zulip Kevin Power (Feb 22 2022 at 20:56):

That is fair, we can keep those.

view this post on Zulip Bret H (Feb 22 2022 at 20:57):

umm...not sure why you'd remove the window information, kinda ambiguous without it. Chromosome and Build are informative but more useful to a human for readability. Strand and Orientation are to help in specifying exactly what is being referred to in the reference.

RefSeqString and ReferenceSeqPointer should not be removed if the only argument is that one does not see them being used soon. It is certainly a viable option that one might have sequenced a parent person and wish to relate a child's sequence to that parent sequence.

view this post on Zulip Kevin Power (Feb 22 2022 at 20:58):

If the window info is important, might be a consideration for IM model, eh @Bob Freimuth ?

view this post on Zulip Bret H (Feb 22 2022 at 20:58):

a resource should be a larger scope than a profile. If you want to constraint out items for a specific use case that might be an approach

view this post on Zulip Bret H (Feb 22 2022 at 21:00):

Kevin Power said:

If the window info is important, might be a consideration for IM model, eh Bob Freimuth ?

be useful to readers to have the IM model referenced here for viewing. even if draft.

view this post on Zulip Kevin Power (Feb 22 2022 at 21:02):

Bret H said:

RefSeqString and ReferenceSeqPointer should not be removed if the only argument is that one does not see them being used soon. It is certainly a viable option that one might have sequenced a parent person and wish to relate a child's sequence to that parent sequence.

I am not sure MolSeq is ready to act as a 'reference sequence' at this point, but perhaps others disagree? That has been a problem with this resource from the beginning, it was just a large hodge-podge of whatever someone thought might be helpful. And we were left with something that was very difficult to understand and use.

view this post on Zulip Bret H (Feb 22 2022 at 21:03):

What's so hard to understand about referencing a version of MolecularSequence or having a standalone sequence similar to how reference sequences appear today in DB?

view this post on Zulip Kevin Power (Feb 22 2022 at 21:06):

Conceptually, not hard at all. Implementers needing it and willing to use it -- nothing yet. At least none that I know of.

view this post on Zulip Bret H (Feb 22 2022 at 21:06):

How would you represent a link to a a comparison where a Child's genome is compared to a parent? Or a Tumor sequence with the patient's germline? I understood that MolecularSequence was headed towards definitional?

view this post on Zulip Kevin Power (Feb 22 2022 at 21:09):

My opinion -- If we have those features, we need them with documented guidance and with real examples. So if someone is willing to help out there, I would support keeping them. And/Or, if I am the only one wanting to remove them for the sake of an easier to understand and use resource, I can easily be out voted :smile:

view this post on Zulip Kevin Power (Feb 22 2022 at 21:20):

Bret H said:

Kevin Power said:

If the window info is important, might be a consideration for IM model, eh Bob Freimuth ?

be useful to readers to have the IM model referenced here for viewing. even if draft.

Current IM model for MoleSeq represented in ClinFHIR here:
http://clinfhir.com/logicalModeller.html#fdpbo

If you want to see more, @Bob Freimuth cleaned up a version that is much more of IM for FHIR here:
https://github.com/rrfreimuth/HL7-CG-Models/blob/main/CG_Resources_published-210923.zip

view this post on Zulip Bret H (Feb 22 2022 at 21:22):

Ok. you want a tumor comparison with germline example using MolecularSequence? Or, do you mean you'd rather not have MolecularResource at all?

I don't see a good reason to remove those fields for ReferenceSeqPointer and ReferenceSeqString. So far, the argument is that it might be confusing (but not on a conceptual level). It is a harder ask to request evidence of implementers needing and willing to use it. Hard to have the chicken-n-egg argument, again. As a WG we took a stand early on that having the ability to discretely describe positional information was needed - whether or not it was being used in HL7v2 messages often at the time. The IM also reflects this.
The two fields are handy tools for :
How would you represent a link to a a comparison where a Child's genome is compared to a parent? Or a Tumor sequence with the patient's germline?

Can we come-up with an easy way to represent the above without them? Or, because the PDF representation does not today typically portray data in that fashion, then we would advocate eliminating the options? I admit that they are pretty sophisticated mechanisms that provide an unprecedented level of precision. The idea that one's own germline is used as a reference to compare to a tumor is... But what do you use MolecularSequence for if you remove them? To me they are the defining elements of the resource.

view this post on Zulip Bret H (Feb 22 2022 at 21:24):

(for a reference on HRD, where germline and tumor are compared to one another: https://www.ambrygen.com/providers/genetic-testing/16/oncology/tumornext-hrd)

view this post on Zulip Kevin Power (Feb 22 2022 at 21:26):

Bret H said:

Ok. you want a tumor comparison with germline example using MolecularSequence? Or, do you mean you'd rather not have MolecularResource at all?

I am saying I am a huge proponent of having guidance/examples for everything we have, down to the attribute level. I will add I feel that way now a days anyway - I wasn't always this way). I am perfectly fine keeping MolSeq.

view this post on Zulip Kevin Power (Feb 22 2022 at 21:32):

In short, we could build a lot of things to support a lot of use cases - but if we don't have an active need for them, and someone to provide good guidance + examples, and someone to help validate them, we end up ill-defined and therefore poorly used artifacts - which is often worse than no artifacts at all.

view this post on Zulip Bret H (Feb 22 2022 at 21:59):

So, what's molecular sequence for then? originally it was the thing to use for genomics where one wanted to be precise, and computable, with the information conveyed.

Another use case. sending viral or bacterial isolated genomes or sequences.

view this post on Zulip Bret H (Feb 22 2022 at 21:59):

but that does not need a reference necessarily...

view this post on Zulip Kevin Power (Feb 22 2022 at 22:00):

I am pretty sure we have found plenty of ways to use it without those attributes :smile:

view this post on Zulip Bret H (Feb 22 2022 at 22:01):

name a few of the uses of MolecularSequence that cannot be replaced by using Observations? I think it's a useful resource.

view this post on Zulip Kevin Power (Feb 22 2022 at 22:02):

OK, fair, maybe not, 'plenty', but Bob M has several ways to use it for HLA

view this post on Zulip Bret H (Feb 22 2022 at 22:02):

And he's done versions using Haplotype and Genotype as well of similar data.

view this post on Zulip Bob Dolin (Feb 22 2022 at 22:03):

My understanding is the primary use case driving MolecularSequence is the case where you really want to send the observed sequence itself.

view this post on Zulip Bret H (Feb 22 2022 at 22:03):

yep. The referenceSeq in that use case is a useful attribute to reduce the space of bioinformatic calculations - allowing some level of specificity in the message. It's not strictly needed if you're willing to BLAST the sequence sent.

view this post on Zulip Bret H (Feb 22 2022 at 22:05):

oh, and helps to know the species too.

view this post on Zulip Kevin Power (Feb 22 2022 at 22:05):

OH sorry, I (like Bob D) thought you jumped to that question about MolSeq (yes or no).

If you are asking how HLA uses 'referenceSeq' - I would have to defer to Bob M. Hence why I wanted to start the conversation here.

view this post on Zulip Bret H (Feb 22 2022 at 22:06):

Take a look at the RefSeek format. It's mainly about versioning and where the sequence was observed.

view this post on Zulip Bret H (Feb 22 2022 at 22:07):

(but you could send an entire sequence in an observation too.....you could, not saying should).
the referenceSeq portion of the profile is super helpful bioinformatically.

view this post on Zulip Bret H (Feb 22 2022 at 22:09):

klunky, but with a reference to a named HLA allele in Haplotype with a variant profile indicating a region of the genome, one could convey some precision without referenceSeq. again, could not should.

view this post on Zulip Kevin Power (Feb 22 2022 at 22:20):

Since we voted today to keep the changes simple for R5 (given our very short timeline), I was proposing simple changes - proposing we remove attributes that I didn't think we used. I sure wasn't proposing we drop the entire 'referenceSeq' structure, as of course it is a very important concept that we need.

I don't know that we have time to completely "rethink" it (sorry, I shouldn't have used that in the title of the topic - I should have called it "simplify 'referenceSeq'" :smile:

view this post on Zulip Bob Freimuth (Feb 22 2022 at 22:37):

Bret H said:

Kevin Power said:

If the window info is important, might be a consideration for IM model, eh Bob Freimuth ?

be useful to readers to have the IM model referenced here for viewing. even if draft.

The last update to this part of the model can be found here (slide 3):
https://drive.google.com/file/d/16YmA6HB2mwhQn7Fmm7q1vv_OEyfXN_Eq/view?usp=sharing

The "window" bit discussed above sounds like what we've called ExtractedSequence, which is essentially a subsequence operation. I've tried to keep these classes focused very tightly on the bare minimum of attributes that are necessary to represent a sequence, and encapsulate them as needed to achieve what I hope is unambiguous semantics. Specifying a "window" outside of these classes could result in usage of those attributes in different ways, such as defining a region on a sequence that might be important for something; that would be an interval or location in the IM, which is entirely different.

view this post on Zulip Bob Freimuth (Feb 22 2022 at 23:13):

Bob Milius said:

Some verison of windowStart and windowEnd is important. If a URI is provided to point to the reference sequence (eg full gene), but the observation is based on a part of the sequence (eg. part of intron 1, exon1, part of intron 2), we need to know how much of the ref sequence was actually sequenced. This window into the reference sequence provides that.

Please note that "these regions (windows) on this sequence" is a different statement than "the subsequence defined by this region on this sequence". The first is talking about regions as a subject, the latter references a sequence object. ExtractedSequence is for the latter. Interval/location can be used for the former, but if we're expressing the concept of interrogated region then we may develop a specialized attribute to capture that (so the semantics are unambiguous).

view this post on Zulip Bret H (Feb 23 2022 at 00:18):

Would interrogated region be a region in a reference sequence? :thinking:

view this post on Zulip Kevin Power (Feb 23 2022 at 14:38):

The IM work doesn't really touch on 'observational' things like this.

view this post on Zulip Bret H (Feb 23 2022 at 16:11):

regards
Kevin Power said:

The IM work doesn't really touch on 'observational' things like this.

Bob F says:

Bob Freimuth said:

Bob Milius said:

Some verison of windowStart and windowEnd is important. If a URI is provided to point to the reference sequence (eg full gene), but the observation is based on a part of the sequence (eg. part of intron 1, exon1, part of intron 2), we need to know how much of the ref sequence was actually sequenced. This window into the reference sequence provides that.

Please note that "these regions (windows) on this sequence" is a different statement than "the subsequence defined by this region on this sequence". The first is talking about regions as a subject, the latter references a sequence object. ExtractedSequence is for the latter. Interval/location can be used for the former, but if we're expressing the concept of interrogated region then we may develop a specialized attribute to capture that (so the semantics are unambiguous).

view this post on Zulip Bret H (Feb 23 2022 at 16:12):

specifically "Please note that "these regions (windows) on this sequence" is a different statement than "the subsequence defined by this region on this sequence". The first is talking about regions as a subject, the latter references a sequence object. ExtractedSequence is for the latter."

view this post on Zulip Jamie Jones (Feb 23 2022 at 16:17):

My understanding is the intended use of window is to assert that the instance is only concerned with the reference sequence between start and end. That is, any edits within the window are represented in the instance. Variants outside the window are not specified.

view this post on Zulip Kevin Power (Feb 23 2022 at 18:57):

Bret H said:

specifically "Please note that "these regions (windows) on this sequence" is a different statement than "the subsequence defined by this region on this sequence". The first is talking about regions as a subject, the latter references a sequence object. ExtractedSequence is for the latter."

I might be wrong here, but I guess my point is that the IM model today might have the building blocks to model the attributes needed to represent an 'interrogated region' but it has no attributes that would allow someone to deliver an instance of something that could be received and appropriately interpreted as 'this is the interrogated region for a particular study we just performed'

Long story short, I am fine retracting my suggestion to simplify the referenceSeq backbone element. If we want to leave in the flexibility it affords today, then I might just ask that someone think about additional guidance we should provide.

view this post on Zulip Bret H (Feb 23 2022 at 20:55):

I wonder if a raw synthetic sequence could be represented with https://build.fhir.org/substancedefinition.html

see: image.png

https://build.fhir.org/substancenucleicacid.html

https://build.fhir.org/substanceprotein.html

the use case is more medicinal in those resources but how do they correspond to the IM model?

view this post on Zulip Bret H (Feb 23 2022 at 21:00):

the substance resources for nucleic acids and protein biochemical treatments have a specific need to be extremely explicit about the composition of the sequence. Even defining base modifications. Does the IM model and MolecularResource cover that? Could it make use of some of the same features? Or refer to a Substance profile when needed?

In the cases covered by Substance, there's no 'based-on' genetic reference mentioned. This means one would need to search in the genome or proteome to determine where the sequence was derived from. Can be done, I guess.

view this post on Zulip Bret H (Mar 01 2022 at 17:27):

(deleted)


Last updated: Apr 12 2022 at 19:14 UTC