Stream: genomics
Topic: Sequence / Observation comparison
Kevin Power (Aug 06 2018 at 22:30):
CG'ers - Since it is a constant topic now days, and we have the R4 dates looming to hit with any changes to the Sequence resource - I have put together a comparison document. We can start coming up with some alternatives and discussions, but wanted to see if this format made sense. Of course, welcome any and all feedback:
https://docs.google.com/spreadsheets/d/1z4DodoLYawW-s0jbFKQg_xpwir8rEORkNjMfemvqxE0/edit#gid=0
Gideon Giacomelli (Sep 11 2018 at 16:07):
we just discussed this topic in the weekly call and there are quite some open issues how to proceed and come to a common understanding. For me the purpose of the sequence resource is not clearly understood. Should it represent sequence data or also mutations which is also covered quite well by observations right now. Anyway it looks like this will be a topic in baltimore to be discussed
Kevin Power (Sep 11 2018 at 17:30):
Thanks @Gideon Giacomelli - Sorry I missed the call today, but good to hear some conversation was started. I think of Sequence like this: It represents the Sequence data, but it has a variety of ways that it can represent the sequence such that a consumer should be able to create it. To answer your question about "mutations" - given a reference sequence, and list of a variants, someone should be able to take that reference, apply the variants, and come up with the sequence. So, the list of 'variants' on the Sequence resource is really a short cut to sending the full sequence.
I am fairly sure that is right, but welcome @Gil Alterovitz or @James Jones (or I suppose anyone listening on #genomics) to weigh in.
Patrick Werner (Sep 11 2018 at 18:06):
I totally agree, as a consumer of a Genetic Diagnostic Report (e.g. Practicioner in a Hospital) i don't care about the Sequence, i only want to know the variants compared to the reference sequence. These will be described variants.
Sequences would be needed in Bioinformatics or inside a Lab i suppose.
Kevin Power (Sep 11 2018 at 19:00):
That is true, but we do need to consider other use cases where the sequence data is important.
Patrick Werner (Sep 11 2018 at 19:44):
i know, i just wanted to explain the clinical use case. I‘m super interested in other usecases which have the need for the actual sequence itself.
Jamie Jones (Sep 14 2018 at 17:04):
Kevin is absolutely right here, sorry for the delayed response. The variant information on sequence was included as an alternate method to describe the sequence. Rather than building it from scratch you could say "It's this reference/other sequence, but with these variants". I do think this should be made clearer on the resource documentation itself, particularly now that there are dedicated variant profiles, etc.
Jamie Jones (Sep 18 2018 at 16:11):
When considering the desire to be able to encapsulate and send VCF file data to certain places in FHIR, I wonder if we should be looking to extend the sequence resource here rather than restrict it further and use the heavier observations and clunky attachments from other specs.
That may be totally out of scope for now, but we have the opportunity to keep all of the "specialized" genomics data we want to send in the custom resource in a concise way, rather than in bulky observations to users who won't need to look at it.
Kevin Power (Sep 18 2018 at 16:40):
I would like to propose we separate those two things - As we talked, I could see value in having a home for sending the VCF file basically as something simple like an attachment. Like @Bret H stated, the VCF is such a "common currency" in this field that I can see value in us supporting it in a very basic (and optional) way.
While the conversation started as "could a VCF replace Sequence?", I feel like we moved away from that being a possibility. There are still things that Sequence can do today that would not be covered by a VCF. So, we still need to grapple with the Sequence / Observation overlap discussion.
Is it worth our time to evaluate what that "bulky observation" approach might look like? Can we say for sure it is bulkier? I will have to admit I have let myself believe it is, for what seem like good reasons. But perhaps it is worth a determination of how much bulkier is it? And "bulkiness" wouldn't be the only criteria we should consider.
Lloyd McKenzie (Sep 18 2018 at 16:51):
Observation is absolutely bulkier and less elegant than Sequence. If that were the sole consideration, Observation profiles would never have been on the table. The issue is our ability to propagate data to (and more importantly through) systems that are not going to be designed with genetics in mind. If you imagine data coming from genomic specific software and then propagating information into a lab's generic reporting software, then sent as a diagnostic report to an EHR, then uploaded into a patient's Apple phone, then downloaded into another EHR where it'll be looked at by someone like Bob Dollin, then made available to a genetics-aware decision support system, what are the chances the decision support system is going to have the data it needs in each of the two scenarios: all data is represented using (bulkier) Observations or all data is represented using the more concise Sequence?
I think the answer is clear that data propagation is much more likely to be successful using Observations. We can choose to go with both approaches, where "key" information is made available through Observation and more detailed information is only shared using Sequence. The question then is "What information are Bob and the CDS system ok with not having access to?"
Kevin Power (Sep 18 2018 at 17:19):
The question then is "What information are Bob and the CDS system ok with not having access to?"
Or, is the appropriate framing - "What information are Bob and the CDS system ok with not have access to via Observation and take extra steps to get from Sequence?"
I think it is clear that to get and utilize this data (even our Observation profiles) will take a genomics aware system.
Jamie Jones (Sep 18 2018 at 17:41):
I am confused as to where the breakdown in propagating data stored in a sequence resource happens in the Lloyd's example. We seem to be assuming that all of the multiple (arbitrarily deeply) nested observation profiles hanging off a diagnosticReport will make it through okay, but not our custom FHIR resources they reference? What about references to Device profiles and other lower-maturity resources?
Patrick Werner (Sep 18 2018 at 17:54):
Device will be more common than Sequence as many Observations are specifying the used Device.
The DiagnosticReport and the Observations can be handled by most of the systems, the referenced Sequence wont make it through.
The question to me is if information from Sequence will be important at this level (HIS, iPhone, etc.) or would they work with only describedVariants and other Observation profiles anyway.
Kevin Power (Sep 18 2018 at 18:03):
I think in general, the "propagation" of data is at risk in the short term until other resources become normative or at least more mature. Since Observation is close to normative, it has a head start. And I agree with @Patrick Werner - things like Device are more likely to be more broadly implemented.
I think the comment by @James Jones about "arbitrarily deep" is really asking - Will systems support the relationships + all sorts of components in their "standard" implementation of Observation? If that answer to that as yes, then that system would meet the "pass through" test. It could at least store and maintain everything it is given, but likely not make much use of that data.
Jamie Jones (Sep 18 2018 at 18:42):
Displaying and navigating through them in a pleasant way is certainly going to benefit from further support, but I'm very surprised we are worried about any FHIR resources failing the "pass through" test, with their inherited base structure and standardized data types.
Either way, being able to pass the sequence/quality informatics data downstream for further analysis by researchers and regulators seems very valuable, even if it won't all be directly viewed in the clinic or by the patient.
Kevin Power (Sep 18 2018 at 18:56):
Displaying and navigating through them in a pleasant way is certainly going to benefit from further support, but I'm very surprised we are worried about any FHIR resources failing the "pass through" test, with their inherited base structure and standardized data types.
I might have misunderstood your earlier statement. In short, I don't know that our group needs to worry about this level of detail. We should base any of our thinking on the path to normative for the resources we are using and considering.
Either way, being able to pass the sequence/quality informatics data downstream for further analysis by researchers and regulators seems very valuable, even if it won't all be directly viewed in the clinic or by the patient.
There is value, but I still wonder where we draw the line with FHIR? What level of detail do we model in FHIR, versus when should something like GA4GH specs take over? I honestly don't know, but that boundary doesn't feel completely clear to me (yet).
Jamie Jones (Sep 18 2018 at 19:08):
@Kevin Power Your interpretation of what I said was certainly a follow-up I meant to imply, and I agree on both points--thanks for the discussion!
Bob Dolin (Sep 18 2018 at 19:10):
@Kevin Power I'm glad to hear you say that ('boundary doesn't feel completely clear to me'), because I feel the same way, and I feel that the boundary is likely to change as we learn to leverage more detailed genomic data. I'm beginning to think that we should also be considering a set of FHIR GACS Operations, because knowing what we need returned to us from an GACS can be thought of as requirements for Sequence resource or our Observation profiles.
Kevin Power (Sep 18 2018 at 19:36):
I'm beginning to think that we should also be considering a set of FHIR GACS Operations, because knowing what we need returned to us from an GACS can be thought of as requirements for Sequence resource or our Observation profiles
I would think a GACS would more likely want to implement something from GA4GH or other relevant spec, not FHIR? I would expect them to do something like htsget or other things found in the GA4GH Genomic Data Toolkit. I am really concerned about FHIR taking on that level of detail.
Perhaps some of our CG / GA4GH folks could weigh in? @Bob Freimuth @Larry Babb
Bob Dolin (Sep 18 2018 at 20:12):
That's a good question, and it may be that a clinical decision support application will need to support both FHIR and htsget. From a GACS, I anticipate (at least) needing information on variants, haplotypes, genotypes, and regions studied. I'm finding that it's a bit of a learning curve, to figure out how best to extract that information from BAM/VCF, whereas FHIR GACS Operators might be more intuitive and consumable by the clinical informatics community.
Lloyd McKenzie (Sep 18 2018 at 20:20):
I don't know that "extra steps" will be an option. The chances of you having access to the original lab or system where the sequence data existed if you're the decision-making system or the secondary EHR is pretty low. I know in theory you could have the URL and query it RESTfully - but for that to work, you'd also have to have authorization on the original server, and that's typically hard to come by.
Lloyd McKenzie (Sep 18 2018 at 20:20):
My expectation is that if the Sequence doesn't propagate through a system, then nothing downstream will have any mechanism to access that discrete data.
Lloyd McKenzie (Sep 18 2018 at 20:23):
I agree that "normative" is one of the considerations for adoption. But even if Sequence were going normative this round too, I think its adoption would suffer because it's such a specialized space. Everyone needs Observations - and the base structure is pretty easy to manage and maps well to persistence structures almost every system has. Sequence is a lot tougher for folks to wrap their heads around. And few, if any, systems will have dedicated persistence layers that know how to manage that sort of structure.
Kevin Power (Sep 18 2018 at 21:29):
whereas FHIR GACS Operators might be more intuitive and consumable by the clinical informatics community.
@Bob Dolin - They will only be intuitive and consumable if we have the time to put into them to make them that way.
Kevin Power (Sep 18 2018 at 21:46):
Everyone needs Observations - and the base structure is pretty easy to manage and maps well to persistence structures almost every system has. Sequence is a lot tougher for folks to wrap their heads around. And few, if any, systems will have dedicated persistence layers that know how to manage that sort of structure.
@Lloyd McKenzie - While I agree with the premise that everyone needs Observations, I won't say Sequence is a lot tougher for folks to wrap their heads around. Our current IG is not for the faint of heart, and it will take a great deal of time and research to really understand and appropriately implement. It could be argued that if we took an approach of building a first class resources for our data, it could be significantly easier for people to understand. That of course comes at the cost of needing every system to, from scratch, implement resources that are different.
It still feels like we can find something in between. I still don't have a feel for where that balance is.
Lloyd McKenzie (Sep 18 2018 at 21:50):
The IG only needs to be understood by those producing the content and those doing the decision support. To all of the intervening systems, it's just a bunch of Observations like any other. They don't necessarily need to understand the details. And a generic Observation rendering tool should still get it right-enough for anyone who just wants to look at it.
Kevin Power (Sep 18 2018 at 21:50):
BTW - A conversation that is related to some of our discussion for those that want to follow along.
Lloyd McKenzie (Sep 18 2018 at 21:51):
Sequence is tougher because it requires domain-specific knowledge. Observation is just a name-value pair (or a collection of them) which tends to be easier for people to "get".
Lloyd McKenzie (Sep 18 2018 at 21:52):
Also, an investment in Observation can meet lots of use-cases. An investment in Sequence only makes sense if you're doing lots of genetic stuff.
Kevin Power (Sep 18 2018 at 21:53):
And a generic Observation rendering tool should still get it right-enough for anyone who just wants to look at it.
This probably should be the case now that I read this, but I am not sure I agree with this (specifically the "right-enough" comment). I would have to think about how a generic viewing tool looking at our observations would work.
Kevin Power (Sep 18 2018 at 22:18):
If our IG requires that a lab who wants to share all variants as Observations instances (WES ~ 4-5k, WGS ~ 4-5m) - I see implementors saying "nope, sorry, my underlying data storage for observations was never intended for that" - also receiving docs may say "nope, sorry, my observation viewer will blow up" and might just go back to "where is my PDF please?"
I will admit it - I say that as an implementor whose underlying "observation infrastructure" won't handle that volume. Can it handle small(ish) subsets of that? Sure, with some tweaks, it would be fine. However, I do not feel using Observations to deliver 4k (or 4m) VUSs will ever work.
Hence why I go back to framing like this:
1) Observations as used for genomic data that is clinically (actionable/relevant/important/???)
2) The CG workgroup should not define standards for bioinformatics in FHIR (or any product for that matter)
3) Sequence (or something else) (perhaps in combination of an optional VCF as an attachment) can express something in-between
The lines between those groupings are still too fuzzy to me, but should ask - Does this group agree / disagree with those statements?
Bob Dolin (Sep 18 2018 at 22:24):
@Kevin Power I think I like your heuristics, but can you expand on number 2? I do think there are opportunities for GA4GH and HL7 CG to work together, to define, say, a translation from (relevant portions of) SAM/VCF into FHIR objects.
Kevin Power (Sep 18 2018 at 22:45):
I don't feel our group should be in the business the standards like BAM, FASTQ, htsget, BED, etc... anything that already has a home somewhere else. Could we take a page from Imaging and learn how ImagingStudy wraps around DICOM? Does Sequence ~= ImagingStudy, and DICOM ~= bioinformatics standards? Maybe, I think we even mention that in previous discussions. If that is what you mean @Bob Dolin then perhaps there is something there.
I just want us to be careful that we don't force FHIR into something where it is not needed.
Lloyd McKenzie (Sep 19 2018 at 02:53):
I agree with #1 and #2. I'm not clear on the value of #3. What is the "in-between" use-case? Who needs that data? And how is it going to propagate?
Patrick Werner (Sep 19 2018 at 08:29):
I agree with #1 and #2. I also see the idea behind #3: capturing metadata of raw data, similar to Imaging Study. The actual raw-data could then be attached as an Attachment.
Lloyd McKenzie (Sep 19 2018 at 13:30):
What metadata is worth capturing that we're fine not propagating?
Patrick Werner (Sep 19 2018 at 13:52):
We want to propagate all relevant metadata. I meant that i can see the need of sending Sequence Metadata and embed/link the actual sequence File.
Patrick Werner (Sep 19 2018 at 13:53):
As a profiled observation or as a sequence.
Kevin Power (Sep 19 2018 at 13:53):
@Patrick Werner - We should be careful when considering "raw-data" as an Attachment. I am fairly OK with the VCF, but assuming you mean things like BAM, FASTQ, I am not OK with that. I like the approach with Sequence.repository and referring to that raw data, but I don't think we want to use a FHIR Attachment.
@Lloyd McKenzie - To be clear, by "not propagating", you mean "not propagating through a system that ONLY implements Observation, DiagnosticReport, RiskAssessment, Device, etc ..." correct? If so, I don't think that is a fair assessment. If we held everything to that rule, no new resource would ever be created would it?
Let's start building a list of "in-between" uses cases. I can provide two:
1) @Bob Dolin's notion of "Region Coverage and Quality"
2) Delivering all detected variants
Patrick Werner (Sep 19 2018 at 13:57):
@Kevin Power yes i meant VCF Files, not FASTQ etc
Kevin Power (Sep 19 2018 at 13:59):
Oy, and a 3rd "in-between" example (I say Oy because it is the most obvious and the real intention of Sequence today):
3) Represent the actual sequence (ACTG....) that was observed
John Moehrke (Sep 19 2018 at 14:14):
Where an existing file format is desired to use, then use DocumentReference. It can provide the metadata to any arbitrary file that has a mime-type. This is used for some proprietary formats such as WORD. It is also used for generic forms like non-FHIR xml schema, or even CSV.
John Moehrke (Sep 19 2018 at 14:17):
The ImagingStudy resource is a very thin metadata for the deeper knowledge that can be probed and retrieved using DICOM, where the http RESTful flavor of DICOM is prefered. The structure of ImagingStudy is the basic structure of DICOM. Study->Series->Instance is a progressively deeper structure. Where all actual data lives at the instance level, but is structured within series which are structured within a study. There is also a degenerate form that just lists the key images for a purpose (like a specific Observation).
Lloyd McKenzie (Sep 19 2018 at 14:34):
@Kevin Power Resources reflect the data elements and data structures systems already deal with and support. It's certainly true that the existence of resources might entice some system developers to enhance the capabilities of their systems to support a resource. But the typical pattern is that systems simply expose the data they already have and manage using the resource appropriate to that type of data. As a result, I expect we'll only see most systems surfacing support for the Sequence (and RiskAssessment and other more essoteric resources) if they already have a specific interest in and support for those concepts as distinct notions in their systems. Introducing a net new resource is considerably more work - and, at in most cases, will at best happen after the low-hanging fruit of "already supported" concepts has been addressed and will be subject to a financial business case assessment.
Patrick Werner (Sep 19 2018 at 14:36):
And, in the worst case, implementers see Sequence, decide against & rebuilt Sequence as a profile on Observation.
Lloyd McKenzie (Sep 19 2018 at 14:37):
Now, it may be that the demand for genomics data will be sufficiently high that applications across the whole propagation pathway will prioritize introducing support for Sequence and we'll see wide support for it over a period of a couple of years rather than 10-15+. Our decision should be informed by a realistic assessment of what labs, EHRs and organizations like Apple are committed to doing in the near-ish term.
Lloyd McKenzie (Sep 19 2018 at 14:37):
The worst case is that implementers throw away Sequence data.
Lloyd McKenzie (Sep 19 2018 at 14:38):
Converting Sequence to Observation is a much more positive outcome - but is only a likely outcome if we define exactly what that conversion should look like and - ideally - provide tools for doing so.
John Moehrke (Sep 19 2018 at 14:40):
There is just as likely revolt against putting the sequence within Observation as unwelcome bloat... All of these are good reasons why we go through "Standard for Trial Use" to see what works vs what we think might work.
John Moehrke (Sep 19 2018 at 14:41):
so in STU it is good to have alternatives for the market to "Trial Use"
Patrick Werner (Sep 19 2018 at 14:44):
What if we do both? Sequence Resource including Mappings to a profiled Observation? Too much work?
Lloyd McKenzie (Sep 19 2018 at 14:55):
A full mapping might be one path forward - and perhaps a standard operation that can handle the converstion. That then gives implementers the choice of how to access and share the data - and it gives us a chance to evaluate what implementers are actually willing to do.
Kevin Power (Sep 19 2018 at 14:59):
I do think it would be interesting to at least sketch out what a sequence profile of Observation would look like. Perhaps we could do a session on that at the WGM?
Bob Dolin (Sep 19 2018 at 15:10):
As a bit of an aside, from what I understand about htsget, it doesn't simply return a VCF, but rather, just the relevant portion of VCF (or SAM, etc) - so, a much smaller package. It may be worth discussing how we envision, say, a clinical decision support application interacting with a GACS, and how it will leverage FHIR and/or htsget.
Patrick Werner (Sep 19 2018 at 15:15):
I do think it would be interesting to at least sketch out what a sequence profile of Observation would look like. Perhaps we could do a session on that at the WGM?
Genomics sessions at the WGM is mo-we, so thursday at the wgm would be an option
Kevin Power (Sep 19 2018 at 15:18):
I am leaving Wed evening, so Thurs would leave me out. I would be OK taking one of our current "FHIR Reconciliation" blocks Tuesday or Wednesday and discussing. Perhaps we could even have a breakout during the Connectathon to draft something, then review with the broader group Tues or Wed?
Patrick Werner (Sep 19 2018 at 15:25):
sounds great
Bob Freimuth (Sep 19 2018 at 15:48):
I would think a GACS would more likely want to implement something from GA4GH or other relevant spec, not FHIR? I would expect them to do something like htsget or other things found in the GA4GH Genomic Data Toolkit. I am really concerned about FHIR taking on that level of detail.
This is a scope question. If bioinformatics tooling is implemented as part of a clinical genomics workflow, should the scope of FHIR (which is clinical in nature) expand to include it? If so, how far upstream should we go?
My current thinking restricts the core of FHIR genomics to clinically reported results (and relevant metadata). If there is need to support the exchange of file formats used by upstream systems, then we could find a way to connect to and leverage GA4GH (or other) specs to handle the details.
Bob Dolin (Sep 19 2018 at 16:10):
I support this notion too, with some caveats. Let's say, for instance, I have an 'in-between' use case, which isn't addressed by an existing bioinformatics API. An advantage of a FHIR GACS Operation is that we define exactly what data we need, and leave it to the GACS to do the translation. My own feeling is that the scope of FHIR Genomics is a bit broader than clinically reported results, and should include vetted and approved clinical genomics use cases. In other words, if there is genomics data that is used by a clinical application, it is in scope for FHIR Genomics.
Kevin Power (Sep 19 2018 at 17:55):
Where an existing file format is desired to use, then use DocumentReference. It can provide the metadata to any arbitrary file that has a mime-type. This is used for some proprietary formats such as WORD. It is also used for generic forms like non-FHIR xml schema, or even CSV.
@John Moehrke - If we consider a VCF, DocumentReference doesn't feel quite right. In our case, we will have a DiagnosticReport to track much of the context (subject, encounter ...) and metadata (status, category, ...), so Attachment still feels right to me. But I will admit, as discussed in the other thread, the boundaries between some of those options are pretty fuzzy to me.
Lloyd McKenzie (Sep 19 2018 at 18:02):
VCF is one of those data types that we would previously have conveyed using Observation.valueAttachment with a reference to a Binary. Further discussion about whether DocumentReference should/should not be used for such things should happen in the existing Implementer's thread.
Jamie Jones (Sep 19 2018 at 19:23):
We've gotten into some slightly varied topics here and I'm partially at fault for talking about VCF data, but the question of "where we draw the line" with GA4GH and embedding raw bioinformatics data seems to be a concern for future enhancement requests and isn't directly related to the overlap between sequence and observation as it is now.
Toward that end, we accepted a resolution to https://gforge.hl7.org/gf/project/fhir/tracker/?action=TrackerItemEdit&tracker_item_id=16252&start=0 on July 31st, and stated the committee is "considering options for modifying the resource and anticipates potential changes being brought forward in an upcoming ballot". After reading that and not seeing any other related trackers I'm a bit unsure of the immediate pressure to resolve this overlap and/or reevaluate the domain before balloting again, but here goes:
Apart from the metadata, the overlap seems to be in the ability to describe variants in both places (although it's been said that in Sequence those fields are intended for defining the sequence, and are optional). My suggestion would be to have the IG state that we use observation profiles for describing variants where possible, where they can then be interpreted as parts of complex variants or haplo/genotypes. This would line up with the variant fields being restricted in the previous "definitional" sequence profile that we discontinued.
Kevin Power (Sep 19 2018 at 22:15):
I think this discussion has been good for exploring several things, so short of a concrete comment / proposal from anyone, I could see this very broad discussion being broken down like this:
1 - Log a tracker to update the IG similar to what @James Jones recommended - more clearly documenting when one should use Sequence.variant compared to our IG profiles (perhaps it could include other guidance as well).
2 - Log a tracker for us to consider allowing a VCF to be delivered.
3 - Log a tracker to consider removing Sequence.structuralVariant (I made a comment on the spreadsheet as to why I think we should consider it).
4 - Maybe not a tracker, but have someone volunteer to draft guidance/thoughts on boundaries between FHIR and GA4GH (or other bioinformatic standards).
5 - Not a tracker, but I would still like to workup a draft of what a "sequence profile of Observation" would look like to compare to what we have today and present it to the group. I had suggested we review it at the WGM, but that might be premature.
Other thoughts?
Lloyd McKenzie (Sep 19 2018 at 22:18):
I'm happy to undertake Sequence profile of Observation, but no chance of it being done before (or at) the WGM.
Bob Dolin (Sep 19 2018 at 22:51):
@Kevin Power I might suggest broadening 3 a bit - there may be other revisions we want to consider for Sequence (such as the representation of quality). I'm happy to help with 4.
Kevin Power (Sep 19 2018 at 23:35):
Perhaps #3 can stay as is, but we always welcome additional trackers to make things better. I think your statement also overlaps a bit with #4 anyway, so welcome any drafting you would want to do.
Jamie Jones (Dec 04 2018 at 17:24):
Do we want to reopen some of these discussions re: VCF here or break out a new substream?
Kevin Power (Dec 04 2018 at 17:26):
New please
Last updated: Apr 12 2022 at 19:14 UTC