Stream: genomics
Topic: Attaching VCF files
Jamie Jones (Dec 04 2018 at 17:33):
Opening up discussion here about sending VCF files. Main considerations brought up in today's call notes:
1. Decide where we should support attaching the file
2. Decide what resource should be used to model the attachment
I logged GF#19373 but would love to hear more about potential use cases we may want to cover with the attachment.
John Moehrke (Dec 04 2018 at 18:42):
DocumentReference would be the metadata, the vcf file would be in a Binary.
Bob Dolin (Dec 04 2018 at 19:02):
to be sure, what is the use case for sending VCF files? Reason I ask is that it might be useful for us to also define metadata (e.g. build, reference sequence) that accompanies the VCF file.
Kevin Power (Dec 04 2018 at 19:03):
How much of that metadata is included in the VCF itself?
Bob Milius (Dec 04 2018 at 19:32):
fyi, there was a group looking at developing a 'clinical-grade' vcf. I don't think they actually developed one, but they published a paper about the effort which might help inform how we use it: 'Principles and Recommendations for Standardizing the Use of the Next-Generation Sequencing Variant File in Clinical Settings' https://jmd.amjpathol.org/article/S1525-1578(17)30106-X/fulltext
Elliot Silver (Dec 04 2018 at 19:35):
How much of that metadata is included in the VCF itself?
Even if most of it is in the VCF file, consider how much of the metadata you want to be able to search on and access without in the actual file.
Bob Milius (Dec 04 2018 at 19:39):
What metadata beyond what is captured in our Observation profiles and MolecularSequence resource are still needed? Isn't the VCF mostly the derivedFrom thing where these data came from?
Bob Dolin (Dec 04 2018 at 19:46):
Main metadata I was thinking we need (since it is inconsistently represented in VCF header) is build and reference sequence
Bob Dolin (Dec 04 2018 at 19:46):
Here is a short VCF example:
Bob Dolin (Dec 04 2018 at 19:46):
##fileformat=VCFv4.2
##reference=hg38
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##contig=<ID=chr6,length=170805979,assembly=b38>
##contig=<ID=chr10,length=133797422,assembly=b38>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878
chr10 94761665 rs7902257 G A . PASS AF=0.5;AN=2 GT 0|1
chr10 94842866 rs3758581 A G 534.73 PASS AF=0.5;AN=2 GT 0|1
John Moehrke (Dec 04 2018 at 19:46):
it is not uncommon for some metadata to be extracted out of the file. duplication is not a bad thing. However the metadata in DocumentReference is merely there to enable discoverability. It is not intended to carry everything in the attached file. The presumption, when one uses a DocumentReference, is that the attached file is desired to be in the form given. I presumed from the discussion that it is desirable to maintain the VCF file format. Thus DocumentReference is the right resource to provide for discoverability and ease of Referencing in FHIR resources.
See the mapping tab on DocumentReference for various examples of where a file format (e.g. CDA) has some elements copied from inside the CDA to populate the elements of DocumentReference.
John Moehrke (Dec 04 2018 at 19:49):
The other advantage of DocumentReference is that it is aligned with XDS/XCA document sharing infrastructure from IHE. So by specifying how a VCF is carried in DocumentReference, you also have specified how it can be carried on an Health Information Exchange such as CommonWell, CareQuality, HEX, and other exchanges well beyond the USA.
Bob Dolin (Dec 04 2018 at 21:34):
Hmmm, the more I think about it, the more I wonder what the use case is. Reference sequence might be a bit tricky in some cases, because it applies to individual rows of the VCF. The recipient may need to understand the 'contigs' (see ##contig row in sample VCF file) in order to, say, map a VCF row into HGVS.
Kevin Power (Dec 05 2018 at 15:29):
The use case I have is pretty simple - the ordering clinician has expertise in genetics, and would like the lab to provide the VCF. The clinician uses local tools that understand VCF files to do a deeper dive.
The VCF is basically an output of the lab test, and I think that is why most people (myself included) tend to lean towards thinking of it as an 'attachment' to the report or perhaps an observation. It does seem that DocumentReference (which we can point at from Observation.derivedFrom() in case anyone hadn't noticed) is a reasonable home for it, though perhaps a bit 'heavy' - I don't know that we have the same sort of requirements from a mapping perspective like @John Moehrke referenced, but that doesn't mean we shouldn't use it.
John Moehrke (Dec 05 2018 at 19:03):
The mapping is simply a way to get consistent filling of the metadata given the content of the VCF. It is certainly not mandatory. It would be good to register one or more FormatCode vocabulary. I would be happy to help with this.
Kevin Power (Dec 06 2018 at 00:38):
Thanks John, appreciate the offer. Through a different thread, we have also asked for input from O&O - so we will gather their input as well.
I will note the metadata out of the VCF is less about the elements I see mapped from CDA -> DocumentReference (author, authenticator, context, etc ...). While perhaps some of that is important, it will (should) be capture in our DiagnosticReport profile that we also have as part of our IG. When we are talking about metadata, we are talking about a subset of the various name/value pairs in the header rows of the VCF file - things that @Bob Dolin mentioned, like reference sequence(s), and genomic build version. I am likely just missing it, but where would you put that sort of data?
John Moehrke (Dec 06 2018 at 17:41):
I think if you need to expose those kind of things, then you are entering the space of needing to define your own Resource. This is not to say that you can't add extensions, just that when the number of extensions gets large or very specific that should be seen as an indicator that you don't have a general file exposure issue. This said, one possibility is to define these extensions in an Implementation Guide, and from that experience it will be come more clear if a standalone resource is better or if a few extensions are all that is needed. Experience is always more helpful than theory.
John Moehrke (Dec 06 2018 at 17:43):
It may still be the right thing to do to have multiple levels of indirection. Given that you want to continue to use VCF file, then it seems proper to use a minimal DocumentReference pointing at Binary holding your VCF.
Kevin Power (Dec 06 2018 at 23:13):
What do others think of using the DocumentReference resource?
Patrick Werner (Dec 07 2018 at 08:16):
From a technical perspective i would agree, semantically i don't think an VCF file is a document, to me it is an Observation. See this thread in O&O: https://chat.fhir.org/#narrow/stream/4-implementers/subject/Observation.20value.20attachment
I would be in favor of Media which is an Observation with Binary Content. It can be referenced from DiagnosticReport or Observation.
John Moehrke (Dec 07 2018 at 13:43):
Interesting... the name of the resource, DocumentReference, is confusing you to think it is for formal clinical documents; yet the scope description for that resource is clear it is broader. The use of Document is the same as for HTTP, which is simply a set of bits that go together (closer to 'file')... Yet, the name of the resource, Media, is not impeding your understanding of some edge features... I would be happy to have Media renamed to ObservationMedia, so that it is clearly kept to the scope you identified. However even that scope, functionally if not by formal definition, overlaps with DocumentReference stated scope. -- That said, I would agree with that conclusion given current spec. (grumble, grumble...)
Jamie Jones (Dec 12 2018 at 16:39):
It sounds like we are still a bit undecided on point 2, what data structure is best to reference the VCF, and we will be devoting some time to this at the WGM.
Does anyone have strong opinions on point 1, where we should support attaching it? Obvious options to me would be some combination/subset of MolecularSequence
, Computable Genetic Finding
, and GenomicsReport
.
Kevin Power (Dec 12 2018 at 18:02):
I think those are the right options, and I don't think any are wrong. But I would vote in this order:
1) Computable Genetic Finding - nice to have the option to allow the 'findings' to point at "the" VCF
2) GenomicsReport - typically would be a single VCF, so aligns OK here
3) MolecularSequence - OK, but an option is available via the repository
Does it make sense to have a new profile on Observation to deliver the VCF tied to a single observation? I think the decision will be driven by what meta data we want to associate.
Andrew Patterson (Dec 13 2018 at 03:04):
We attach genomic files (fastq, bam, vcf) to Specimen.. in that when we take a 'genomics' order from a lab we give them extensions on the Specimen to indicate locations of fastq and vcf (computationally derived from the specimen).. I realise that that is not particularly a mainstream use case though
Andrew Patterson (Dec 13 2018 at 03:05):
This is very much from the 'order' end.. not the report end
Last updated: Apr 12 2022 at 19:14 UTC