Stream: implementers
Topic: Unicode Regex Question
Grahame Grieve (Jul 29 2018 at 01:50):
I'm trying to improve the string regex. For the non-unicode characters, this is the correct regex:
Grahame Grieve (Jul 29 2018 at 01:50):
[\S\r\n\t ]*
Grahame Grieve (Jul 29 2018 at 01:51):
that is, any character, or end of line or tab or space
Grahame Grieve (Jul 29 2018 at 01:52):
but I can't be convinced from reading the doco what that means for unicode whitespace characters
Grahame Grieve (Jul 29 2018 at 01:52):
maybe this is better:
Grahame Grieve (Jul 29 2018 at 01:53):
[\r\n\t\x{0020}-\x{FFFF}]*
Grahame Grieve (Jul 29 2018 at 01:53):
or
Grahame Grieve (Jul 29 2018 at 01:53):
[\r\n\t\u0020-\uFFFF]*
Grahame Grieve (Jul 29 2018 at 01:53):
depending your flavor.... ;-(
Grahame Grieve (Jul 29 2018 at 01:54):
I think that the latter is the intent, but is there a correct standard syntax?
Grahame Grieve (Jul 29 2018 at 01:54):
all very messy....
Michel Rutten (Jul 30 2018 at 09:33):
According to the dotNET documentation, \S
(non-whitespace) is a shortcut for [^\f\n\r\t\v\x85\p{Z}]
where `\p{Z}' represents a Unicode block/category that "matches any separator character":
https://docs.microsoft.com/en-us/dotnet/standard/base-types/character-classes-in-regular-expressions#non-white-space-character-s
Grahame Grieve (Jul 30 2018 at 10:33):
\x85? weird
Grahame Grieve (Jul 30 2018 at 10:33):
does that help though?
Michel Rutten (Jul 30 2018 at 11:03):
According to MSDN, \x85
represents "The ellipsis or NEXT LINE (NEL) character (…), \u0085.":
https://docs.microsoft.com/en-us/dotnet/standard/base-types/character-classes-in-regular-expressions
Michel Rutten (Jul 30 2018 at 11:04):
Never encountered this one in the wild
John Silva (Jul 30 2018 at 17:37):
What about for non- dotNet implementations of RegEx parsers? This website has a RegEx tester (https://regex101.com/ ) and they list \s as:
\s matches any whitespace character (equal to [\r\n\t\f\v ])
[The ellipsis might be auto-generated by MS-Word when you type three '.' together; I think Word does something similar when you use single quotes, it turns them into left apostrophe and right apostrophe.]
Grahame Grieve (Jul 30 2018 at 18:39):
so it seems to me that regex is just basically broken when it comes to unicode.
John Silva (Jul 30 2018 at 18:53):
Not sure that's true -- some simple Google searches seem to indicate there is more RegEx Unicode support than we typically think of (or have to deal with).
This MS page talks about the difference between how .NET and JavaScript (ECMAScript) handle \s:
https://docs.microsoft.com/en-us/dotnet/standard/base-types/character-classes-in-regular-expressions#white-space-character-s
If ECMAScript-compliant behavior is specified, \s is equivalent to [ \f\n\r\t\v]. For information on ECMAScript regular expressions, see the "ECMAScript Matching Behavior" section in Regular Expression Options.
Then there's this page that talks about a bunch of 'fancy' RegEx patterns for Unicode support:
https://www.regular-expressions.info/unicode.html
Then the question becomes; how much Unicode RegEx support does FHIR (or FHIR test servers) want/need to support?
Grahame Grieve (Jul 30 2018 at 18:56):
so it appears to me that the implementations are split between \p and \x
Grahame Grieve (Jul 30 2018 at 18:56):
and the specific question is this particular regex, not a general question
Last updated: Apr 12 2022 at 19:14 UTC