Stream: data extraction services
Topic: character-range extension
Philipp Daumke (Nov 23 2018 at 20:55):
Hi all,
for expressing the character range (=text span) of a text mining annotation within a fhir Provenance resource, we (mainly @Kai Heitmann and @Simone Heckmann) suggest the following format:
"extension": [ { "url": "http://fhir.de/StructureDefinition/character-range/0.5", "valueRange": [ { "low": {"value": 12}, "high": {"value": 24} } ] } ]
Morten Ernebjerg (Nov 26 2018 at 08:46):
Might it be an idea to include an optional field (snippet
, context
or similar) for providing the range of text (with a buffer on either side) from which the concept in question was extracted? E.g if the generated resource is a Condition
stating that the patient has pneumonia, one might have something like this inside the extension on the corresponding Provenance
:
... context: "(...)completed. It seems the patient suffers from pneumonia. This could(...)" ...
I will defer to NLP pros to say whether this is useful and feasible, but it might worth considering for the following reasons:
1. It would give a (human) user the chance to evaluate how trustworthy the extraction is without retrieving further resources, opening documents etc.
2. It could be useful in contexts where it may be difficult to identify the correct text range if the text in question is not in a plain text file (this came up during the NLP session at the DevDays). E.g. It may be a PDF produced from a scan on which OCR was run and if the PDF includes company logos with text etc. it may be hard to tell exactly which character is number 234.
André Sander (Nov 25 2019 at 15:41):
I'd like to revisit this discussion: why not using a commonly used format like http://2011.bionlp-st.org/home/file-formats and transform this into a FHIR class like structure? For text annotations it has the information as Philipp suggested and it also contains the actual context.
Simone Heckmann (Nov 27 2019 at 09:15):
I like the idea of modeling this based on existing specifications, however, I am not sure whether the definition
The start-offset is the index of the first character of the annotated span in the text (".txt" file), i.e. the number of characters in the document preceding it. The end-offset is the index of the first character after the annotated span. Thus, the character in the end-offset position is not included in the annotated span.
is really unambiguous enough. E.g. it doesn't really say how whitespace is handled when counting the positions. Does anyone see that as an issue?
Another question I am still struggling with is how to deal with the txt (OCR extract) in relation to the original document (PDF).
Where do we keep the txt and how do we link it to the pdf...?
A probably obvious thing would be to put it in DocumentReference.text (Narrative)
Because that would allow for the additional functionality to run a full text search on the Document using the _text
parameter.
The character-range extension would then by default count the characters in the Narrative of the DocumentReference, not the Document in the Attachment...
Simone Heckmann (Nov 27 2019 at 10:31):
I have created an Extension based on the results of the discussion so far:
https://simplifier.net/semantischeanalyse/annotationcharacterrange
It has
- a
range
element of DataTypeRange
with a low/high element, both inclusive as per FHIR definition of Range (in the bionlp description, the upper boundary is exclusive!) - a
snippet
element of type String that includes the snippet of annotated text which may also include buffer characters onboth ends.
Comments welcome...
Simone Heckmann (Nov 27 2019 at 10:36):
The Extention Context is Provenance.target, so if multiple target ressources are created from one common source (i.e. a Document), the extensions are specific to each target.
Simone Heckmann (Nov 28 2019 at 11:04):
I created an example instance of Provenance documenting the extraction of a Condition from a Document:
https://simplifier.net/semantischeanalyse/provenance-example
<Provenance> <target> <extension url="http://dmi.de/fhir/StructureDefinition/annotation-character-range"> <extension url="range"> <valueRange> <low> <value value="16" /> </low> <high> <value value="27" /> </high> </valueRange> </extension> <extension url="snippet"> <valueString value="... confirmed Type II Diabetes ..." /> </extension> </extension> <reference value="Condition/12345" /> </target> <recorded value="2019-11-28T13:28:17.239+02:00" /> <activity> <coding> <system value="http://terminology.hl7.org/CodeSystem/v3-DataOperation" /> <code value="CREATE" /> <!-- CREATE: Fundamental operation in an Information System (IS) that results only in the act of bringing an object into existence.--> </coding> </activity> <agent> <type> <coding> <system value="http://terminology.hl7.org/CodeSystem/provenance-participant-type" /> <code value="assembler" /> <!--assembler: A device that operates independently of an author on custodian's algorithms for data extraction of existing information for purpose of generating a new artifact.--> </coding> </type> <role> <coding> <system value="http://dicom.nema.org/resources/ontology/DCM" /> <code value="110150" /> <display value="Application" /> </coding> </role> <who> <reference value="Device/23456" /> <!--reference to a Device Resource describing the details (manufacturer, software version etc) of the application extracting the data--> </who> </agent> <entity> <role value="source" /> <what> <reference value="DocumentReference/345678" /> <!--reference to the Document from which the target Resource has been extracted--> </what> </entity> </Provenance>
Simone Heckmann (Nov 28 2019 at 11:05):
Question: should the high/low values have a unit? Or should we rather use two integer extensions instead of Range...?
Simone Heckmann (Nov 28 2019 at 11:12):
tagging Security WG, since they own the Provenance Resource...
@John Moehrke @Alexander Mense @k connor @Chris Shawn
John Moehrke (Nov 28 2019 at 15:43):
I like the idea of modeling this based on existing specifications, however, I am not sure whether the definition
The start-offset is the index of the first character of the annotated span in the text (".txt" file), i.e. the number of characters in the document preceding it. The end-offset is the index of the first character after the annotated span. Thus, the character in the end-offset position is not included in the annotated span.
is really unambiguous enough. E.g. it doesn't really say how whitespace is handled when counting the positions. Does anyone see that as an issue?
Another question I am still struggling with is how to deal with the txt (OCR extract) in relation to the original document (PDF).
Where do we keep the txt and how do we link it to the pdf...?A probably obvious thing would be to put it in DocumentReference.text (Narrative)
Because that would allow for the additional functionality to run a full text search on the Document using the_text
parameter.The character-range extension would then by default count the characters in the Narrative of the DocumentReference, not the Document in the Attachment...
There is a CR that is asking if the _text search on a DocumentReference should be a text search across the attachment too (provided the server can execute this in some capacity based on it's understanding of the attachment layout).
John Moehrke (Nov 28 2019 at 15:58):
I have created an Extension based on the results of the discussion so far:
https://simplifier.net/semantischeanalyse/annotationcharacterrange
It has
- a
range
element of DataTypeRange
with a low/high element, both inclusive as per FHIR definition of Range (in the bionlp description, the upper boundary is exclusive!)- a
snippet
element of type String that includes the snippet of annotated text which may also include buffer characters onboth ends.Comments welcome...
There is a CR approved but not yet applied J#21284
John Moehrke (Nov 28 2019 at 15:59):
beyond that... in Provenance we have only addressed using reference to point at a whole resource. So, can't really help on trying to point at specific locations within a narrative. That seems well beyond core need.
Simone Heckmann (Nov 28 2019 at 16:08):
The CR you mentioned adresses referencing specific elements in the target resource, whereas the extension above connects the target to a snippet from the source which led to the creation of the target. They have both different scopes but they can work perfectly well in combination, e.g. to express that Condition.code has been created from the snipped "...suffers from Type II Diabetes...", while Condition.onsetDateTime was created from the snippet "...diagnosed in December 2016...".
So, what you're saying is that the extensions should go into an Implementation guide rather than in core extensions?
BTW: I'm definitely not arguing to make them core elements!
John Moehrke (Nov 28 2019 at 19:01):
something that starts in an implementation guide can migrate into the core as it proves useful to a broader community. More importantly, I think this kind of functionality likely needs lots of explanation which is easier to do in an IG. I am not outright rejecting it, I am not that powerful anyway.. just asserting that it feels to me as not core.
Sean Finan (Dec 04 2019 at 21:48):
I cannot help feeling like there is some force fitting here. While a Range does have 2 quantitative values, I don't think that it is a good fit for text character indices. A Range defines upper and lower bounds, within which some value should exist. This does not fit text character indices, which are absolute and not simply within some range. In addition, Range uses the General Purpose Datatype SimpleQuantity. This is not ideal for text character indices, which are always of Primitive Type unsignedInt. In my view, Range is not exact enough to be used for character indices. In addition, consider the Rule for Range: "If present, low SHALL have a lower value than high." That doesn't work for a single-character token, for which indices are not a vector.
Last updated: Apr 12 2022 at 19:14 UTC