Stream: genomics
Topic: Representation of pipeline methodology
Bob Dolin (May 28 2020 at 23:18):
I think we've talked about this here and there, how to capture the 'method' by which a variant was called.
I'm not a software developer, but let me go out on a limb here, and mention a discussion I had with a colleague today, who works in a molecular pathology lab. They use a workflow manager (in this case, nextflow), which references containers (in this case docker), and as a result, different groups can now process the same input (e.g. a BAM or FASTQ file) and obtain the same results (e.g. a VCF). Further, in this case, since their nextflow pipeline is stored in bitbucket, they essentially have a canonical reference to their pipeline.
Granted, this only addresses the bioinformatics portion of the analysis, but does this sound like something we may want to explore capturing in the FHIR Genomics guide?
Jamie Jones (May 28 2020 at 23:25):
This sort of pipeline reference in observation.method should certainly be possible, but it would be hard to identify an example code system for our guidance I think
Bob Dolin (May 28 2020 at 23:26):
What about a data type of 'identifier'?
Kevin Power (May 28 2020 at 23:28):
Might make more sense in Proveanance ?
http://build.fhir.org/provenance.html
Jamie Jones (May 28 2020 at 23:31):
Recent O&O discussions have included confirming if Device is in scope for this as well. We had briefly explored provenance in the past
Bob Dolin (May 28 2020 at 23:32):
I could see using Provenance.Activity for this, although I think Jamie's question still applies (whether a pipeline is a codeableConcept vs. an identifier)
Kevin Power (May 28 2020 at 23:32):
Bob Dolin said:
What about a data type of 'identifier'?
Where would you put this, and what sort of identifier are you asking about?
Kevin Power (May 28 2020 at 23:33):
This is a really big question, and one that will require a fair amount of thought and discussion :slight_smile:
Bob Dolin (May 28 2020 at 23:33):
@Kevin Power For identifier, I was thinking of how we'd point to, for instance, a specific version of something in github. For instance, let's say my pipeline is version 2 of https://github.com/CRG-CNAG/CalliNGS-NF/
Jamie Jones (May 28 2020 at 23:39):
I think providing guidance to use Observation.method would be the most consistent with the design principles of our IG currently, though it isn't very robust
Bob Dolin (May 28 2020 at 23:40):
What I like about it is the ability to reference a very complex set of applications and configurations, all with a single URL.
Bob Milius (May 29 2020 at 14:41):
How are methods reported in current lab reports? We need to at least capture that.
I see two different kinds of methods. First are the static things, like a standard methodology that is documented in the lab and is used by many different lab reports. Maybe a kit name and version, or a protocol defined at the lab, including the sequencing platform (MiSeq, PacBio, etc), that could be used as a definitional reference. Maybe an accession # to a test in the Genetic Testing Registry? The second kind of method information are the parameters used in a particular analysis that may be unique to that report. I'm looking at an XML export of an HLA analysis software and it includes:
<AnalysisSoftware>
<Software>NGSengine</Software>
<Version>2.16.2.16030</Version>
</AnalysisSoftware>
<AnalysisSettings>
<AmpliconMethod>NGSgo</AmpliconMethod>
<AnalysisRegion>Amplicon</AnalysisRegion>
<IgnoreRegions>Default</IgnoreRegions>
<PhasingMethod MinimumReadDepth="20" AlleleRatioThreshold="20" NoiseThreshold="14">Cluster</PhasingMethod>
<ReadTrimmer Applied="Yes">
<Settings RelativeThreshold="70" AbsoluteThreshold="20" WindowSize="10" />
</ReadTrimmer>
<ApplyCWDList>true</ApplyCWDList>
</AnalysisSettings>
I suspect the AnalysisSettings are likely unique for a particular software package (e.g., ApplyCWDLists), so trying to capture settings in a structured way that could be used for multiple software tools would be difficult at best. But some of these might be applicable to other tools, certainly Software and Version.
Arthur Hermann (May 29 2020 at 15:52):
I think this is a very interesting idea - but I want to make sure we step back and make sure we feel this is in our purview and most definitely would like to prioritize vs other work we need to do on the current IG. As has been pointed out - the data coming directly from a bioinformatics pipeline can take many shapes dependent on the machine and pipeline the lab is using. My first question is why do we want to move that far upstream? Who benefits from adding this to the IG? Secondly, if we do decide we should include this area, we should work closely with O&O on their device work, and as Bob M suggests, develop this in a very open manner since I expect the data from each lab would be very different for each of them.
Bob Dolin (May 29 2020 at 20:58):
Thanks @Arthur Hermann . It's a good point, to figure out if this is needed. And to be honest, I don't think I have a need for it. Rather, it kinda stems from a use case I've heard occasionally, that given, say, a FASTQ or BAM file, it should be possible for two groups to come up with the same VCF file. Lately I see a number of groups formalizing their pipelines using some type of workflow definition language + containers, suggesting that if we do need to address this, we could potentially wrap up all the complexity in a simple URL.
Kevin Power (May 29 2020 at 23:52):
The RelatedArtifact extension is a great place to send a simple URL
Jamie Jones (Jun 01 2020 at 14:27):
@Kevin Power I agree, but RA extension isn't defined on variant at this time, just on implications
Kevin Power (Jun 01 2020 at 15:05):
True, but that seems like an easy addition. Of course, easy to misuse/overuse.
Bob Dolin (Jun 01 2020 at 16:03):
@Kevin Power Not sure I'm understanding how the pipeline methodology would be a relatedArtifact. Is this because Method and Provenance.activity don't currently support representation as a URL?
Kevin Power (Jun 01 2020 at 16:27):
Bob Dolin said:
..., suggesting that if we do need to address this, we could potentially wrap up all the complexity in a simple URL.
I just wanted to mention that if you were looking for a way to deliver a 'simple URL', RelatedArtifact is an option, especially in the short term.
Kevin Power (Jun 01 2020 at 19:52):
... except for the point that @Jamie Jones made about us not having the extension list on the Variant profile. Though I don't think that would prevent someone from adding the extension themselves. Again, this is a short term answer until we can come up with a solid proposal.
Bret H (Jun 08 2020 at 13:45):
Uff. Darn good topic and forward thinking. the URL approach seems good, but I can't tell, is there a suggest standard to follow for presenting information at the endpoint of the URL? Also there has been work done on device definition. https://build.fhir.org/device.html do we consider software a device?
Last updated: Apr 12 2022 at 19:14 UTC