Stream: implementers
Topic: String attribute leading space
Jorn Duwel (Oct 07 2020 at 11:55):
Hi!
The datatype spec says the following:
According to XML schema, leading and trailing whitespace in the value attribute is ignored for the types boolean, integer, decimal, base64Binary, instant, uri, date, dateTime, oid, and uri. Note that this means that the schema aware XML libraries give different attribute values to non-schema aware libraries when reading the XML instances. For this reason, the value attribute for these types SHOULD not have leading and trailing spaces. String values should only have leading and trailing spaces if they are part of the content of the value. In JSON and Turtle whitespace in string values is always significant. Primitive types other than string SHALL NOT have leading or trailing whitespace.
(Emphasis added)
The XML representation spec says the following:
Implementers SHOULD trim leading and trailing whitespace before writing and SHOULD trim leading and trailing whitespace when reading attribute values
These two quotes seem to contradict one another. When adding a leading (or trailing) space to a string value, can one expect it to be present in an XML representation of that string value?
Lloyd McKenzie (Oct 07 2020 at 13:03):
@Grahame Grieve @Ewout Kramer
Grahame Grieve (Oct 07 2020 at 19:27):
there's no way to say. you cannot have any expectations; the problem is in xml itself.
Jorn Duwel (Oct 08 2020 at 08:11):
@Grahame Grieve, thank you for your reply. I actually would have the expectation that if my string data is valid following the datatype specification, it would be parsed correctly. If there is no way to guarantee that because of the representation method, shouldn't this ambiguous situation be reflected in the specification somehow?
Rik Smithies (Oct 08 2020 at 08:46):
Does it help that this only affects schema aware processing? If you don't need and don't use the PSVI, is this an issue?
Grahame Grieve (Oct 08 2020 at 09:24):
I thought that the comments were reflecting the problem into the specification
Jorn Duwel (Oct 08 2020 at 16:32):
@Rik Smithies I am not sure if this issue has to do with schema aware processing (I am aware of the difficulty with whitespace and schema). We are having an issue where a sending party sends a string " III" (with a relevant leading space). The receving party parses is as "III" via XML. Both are pointing to the quoted parts of the specification, both are right in their way, but nevertheless a difference between the sent and received data now exists. Based on the specification as it is I cannot tell who is right or wrong.
For our implementation a possible way of solving this would be a solution where we do not use any leading spaces as relevant information in strings, but perhaps it could be discussed if the specification could identify or prevent this issue in the future.
Rik Smithies (Oct 08 2020 at 16:57):
hi Jorn - The issue is documented in the FHIR spec though? Is that not identifying the issue? What would you like to see? (Other than a fix, which might not be possible).
The warning talks about non-string types not preserving spaces. Why would that affect string handling? What data type is your " III"?
Grahame Grieve (Oct 08 2020 at 22:34):
the spec says "you can, but you might have problems'.
Grahame Grieve (Oct 08 2020 at 22:34):
and you are
Marc de Graauw (Oct 12 2020 at 07:54):
It's not PSVI, it's in the core XML spec which predates Schema by a lot. Since conformant XML processors should implement the XML spec, they will drop leading/trailing spaces in attribute content. (Not sure this is an XML problem, one might as well say it's FHIR's problem since they want to store strings with leading spaces in attribute content, not element content. I realize that this remark is utterly unhelpful, since no-one will want to change that now.). The dataype spec might add a warning: "Don't expect leading/trailing spaces to persist in XML serialization." I don't think a fix is possible indeed.
Rik Smithies (Oct 12 2020 at 08:06):
hi Marc - that is interesting. The datatype spec does already warn about this though doesn't it? What more do we think should be added? It may want to not say "XML schema" but instead "the XML standard".
I still don't understand why this affects what seems to be a string datatype attribute though for @Jorn Duwel. Is this not limited to the datatypes mentioned?
Marc de Graauw (Oct 12 2020 at 08:18):
Hi Rik, the datatypes spec does not really warn against leading space in string datatypes, it allows those (and this is fine for JSON). The schema remark is a bit vague - does this refer to the FHIR XML Schema for datatypes? For XML itself, there is no "XML Schema" for XML itself, so if that's intended "the XML standard" or "specification" would be better indeed.
Rik Smithies (Oct 12 2020 at 08:20):
~the spec does not really warn against leading space in string datatypes
and is that a problem? I take it this applies to all datatypes in all XML attributes?
In fact if it's the base XML standard, that doesn't even have datatypes.
Marc de Graauw (Oct 12 2020 at 08:33):
FHIR datatypes says value attributes other than string SHALL NOT have leading/trailing spaces. For string, it says they can have leading/trailing spaces it that's part of the content. For XML, this wil break. We know this now, so we could warn. And it's not about XML datatypes, but core attribute processing: https://www.w3.org/TR/xml/#AVNormalize : "...the XML processor must ... discarding any leading and trailing space"
Marc de Graauw (Oct 12 2020 at 08:36):
I can make a Jira issue, unless people see reason not to.
Vassil Peytchev (Oct 12 2020 at 14:44):
The link doesn't simply say "...the XML processor must ... discarding any leading and trailing space" - there is a discussion about attribute being a string (CDATA), or tokenized sequence (NMTOKEN) which would be DTD definitions (as opposed to XML schema), and there are examples of leading white space. It seems that leading white space can be preserved if explicitly represented as  
Vassil Peytchev (Oct 12 2020 at 15:05):
XML schema adds more to the above, but essentially, if an XML attribute is declared to be of type string
, then white space must be preserved. If it is not, then it is the lack of schema awareness that unnecessarily removes the leading white space, not because of XML schema.
See
for string the value of whiteSpace
is preserve
;
and
https://www.w3.org/TR/2004/REC-xmlschema-1-20041028/structures.html#d0e1654
Marc de Graauw (Oct 12 2020 at 15:46):
Yes, you're right, @Vassil Peytchev and XML + Schema does allow it. But FHIR datatypes (http://build.fhir.org/datatypes.html#string) says: "Leading and Trailing whitespace is allowed, but SHOULD be removed when using the XML format".
Vassil Peytchev (Oct 12 2020 at 17:27):
And digging further, the fhir-base.xsd defines a derived type for the actual attribute, string-primitive
. Then the following XML schema rule applies:
For all ·atomic· datatypes other than string (and types ·derived· by ·restriction· from it) the value of
whiteSpace
iscollapse
and cannot be changed by a schema author.
collapse
is what prohibits leading (or trailing) white space. The only purpose of string-primitive is to define the minLength to 1, and I am wondering if that is necessary (even though it is shown as an example in the XML schema spec). If we change the following
<xs:complexType name="string">
<xs:annotation>
<xs:documentation xml:lang="en">A sequence of Unicode characters</xs:documentation>
<xs:documentation xml:lang="en">Note that FHIR strings SHALL NOT exceed 1MB in size</xs:documentation>
<xs:documentation xml:lang="en">If the element is present, it must have either a @value, an @id, or extensions</xs:documentation>
</xs:annotation>
<xs:complexContent>
<xs:extension base="Element">
<xs:attribute name="value" type="string-primitive" use="optional"/>
</xs:extension>
</xs:complexContent>
</xs:complexType>
to
<xs:complexType name="string">
<xs:annotation>
<xs:documentation xml:lang="en">A sequence of Unicode characters</xs:documentation>
<xs:documentation xml:lang="en">Note that FHIR strings SHALL NOT exceed 1MB in size</xs:documentation>
<xs:documentation xml:lang="en">If the element is present, it must have either a @value, an @id, or extensions</xs:documentation>
</xs:annotation>
<xs:complexContent>
<xs:extension base="Element">
<xs:attribute name="value" type="xs:string" use="optional"/>
</xs:extension>
</xs:complexContent>
</xs:complexType>
the JSON and XML representations will not have this discrepancy.
Grahame Grieve (Oct 13 2020 at 06:22):
so you see why I said it's a schema related issue. And it will work for some conformant XML processors, and not for others. WHich is why we decided a long time ago to not prohibit it but just to warn about it.
Marc de Graauw (Oct 13 2020 at 07:52):
@Vassil Peytchev "...(the) purpose of string-primitive is to define the minLength to 1, and I am wondering if that is necessary..." - string-primitive implements: (http://build.fhir.org/xml.html): "Attributes cannot be empty. Either they are absent, or they are present with at least one character of non-whitespace content."
@Grahame Grieve "...it will work for some conformant XML processors, and not for others" - Yes, I tested yesterday with Expat, and leading whitespace is preserved there.
Vassil Peytchev (Oct 13 2020 at 14:41):
My question was whether it is necessary to use a schema construct to enforce the rule that attributes are either "absent or they are present with st least one character of non-whites pace content."
Can that rule be enforced via schematron or directly at the FHIR processing level?
Grahame Grieve (Oct 13 2020 at 20:56):
I don't know what's 'necessary' since schema doesn't enforce all validations. It does what it can. Why should it not do this one?
Vassil Peytchev (Oct 14 2020 at 03:09):
Maybe I am missing the background of this point. From the FHIR specification perspective, without regards to the wire format, what is the desired property of the string data type:
- Strings shall not have leading or trailing whitespace
- Strings can have leading but no trailing whitespace
- Strings can have leading or trailing whitespace
For each of 1, 2, or 3 I am assuming that there is the desired property of at least one non-whitespace character.
Grahame Grieve (Oct 14 2020 at 03:18):
desired by who?
Grahame Grieve (Oct 14 2020 at 03:18):
the simplest thing do say would be
"String values SHALL not have leading or trailing blanks"
Grahame Grieve (Oct 14 2020 at 03:19):
but of course, we can't say that.
various databases already have strings with leading and trailing blanks. Also, markdown is a string
Grahame Grieve (Oct 14 2020 at 03:20):
the next thing we could say is 'leading and trailing whitespace SHALL be respected', but there's lots of leading XML parsers out there that don't. And these are well into production too
Grahame Grieve (Oct 14 2020 at 03:20):
so we have to allow both leading and trailing whitespace, but we warn implementers about the pitfalls
Vassil Peytchev (Oct 14 2020 at 03:29):
In this case the pitfalls are at least partially (mostly) due to the FHIR schema which states that string values shall not have leading or trailing spaces. If the data type of the fhir:string/@value attribute were of type xs:string, there would be fewer processors that would not respect the white space.
Grahame Grieve (Oct 14 2020 at 05:06):
where does it say that?
Vassil Peytchev (Oct 14 2020 at 16:36):
In my misreading of the XML Schema spec. I am going to try some examples on a few platforms to see if there is a possible improvement in the schema definition to respect white space.
Vassil Peytchev (Oct 15 2020 at 19:17):
So far I have confirmed that .NET (core 3.1), Java OpenJDK 11 (default Xerces parser implementation), and Java JAXB all preserve leading spaces when parsing an attribute. This was tested with and without schema validation. Earlier, there was a confirmation that Expat also preserves leading white space. What other leading XML parsers would need to be tested? nodejs?
@Jorn Duwel, is there more specific information on the use case where the leading space is not preserved?
Vassil Peytchev (Oct 15 2020 at 19:23):
At the very least, the sentence
Implementers SHOULD trim leading and trailing whitespace before writing and SHOULD trim leading and trailing whitespace when reading attribute values (for XML schema conformance)
is not accurate regarding XML schema conformance.
Jorn Duwel (Oct 19 2020 at 13:55):
@Vassil Peytchev , I was out of office for a few days. I will try to get more information on the parser used. Thanks for your efforts btw.
Grahame Grieve (Oct 20 2020 at 03:56):
is not accurate regarding XML schema conformance
I think it is. We could perhaps clarify that tools are not necessarily conformant here. Where it really gets complicated is with digital signatures
Vassil Peytchev (Oct 20 2020 at 04:20):
I am looking at that point next, but I still want to find those production parsers that strip spaces.
The schema can actually enforce the "at least one non-whitespace character" rule using a regex pattern (similar to the markdown one).
Last updated: Apr 12 2022 at 19:14 UTC