FHIR Chat · Unicode Regex Question · implementers

Stream: implementers

Topic: Unicode Regex Question


view this post on Zulip Grahame Grieve (Jul 29 2018 at 01:50):

I'm trying to improve the string regex. For the non-unicode characters, this is the correct regex:

view this post on Zulip Grahame Grieve (Jul 29 2018 at 01:50):

[\S\r\n\t ]*

view this post on Zulip Grahame Grieve (Jul 29 2018 at 01:51):

that is, any character, or end of line or tab or space

view this post on Zulip Grahame Grieve (Jul 29 2018 at 01:52):

but I can't be convinced from reading the doco what that means for unicode whitespace characters

view this post on Zulip Grahame Grieve (Jul 29 2018 at 01:52):

maybe this is better:

view this post on Zulip Grahame Grieve (Jul 29 2018 at 01:53):

[\r\n\t\x{0020}-\x{FFFF}]*

view this post on Zulip Grahame Grieve (Jul 29 2018 at 01:53):

or

view this post on Zulip Grahame Grieve (Jul 29 2018 at 01:53):

[\r\n\t\u0020-\uFFFF]*

view this post on Zulip Grahame Grieve (Jul 29 2018 at 01:53):

depending your flavor.... ;-(

view this post on Zulip Grahame Grieve (Jul 29 2018 at 01:54):

I think that the latter is the intent, but is there a correct standard syntax?

view this post on Zulip Grahame Grieve (Jul 29 2018 at 01:54):

all very messy....

view this post on Zulip Michel Rutten (Jul 30 2018 at 09:33):

According to the dotNET documentation, \S (non-whitespace) is a shortcut for [^\f\n\r\t\v\x85\p{Z}]where `\p{Z}' represents a Unicode block/category that "matches any separator character":
https://docs.microsoft.com/en-us/dotnet/standard/base-types/character-classes-in-regular-expressions#non-white-space-character-s

view this post on Zulip Grahame Grieve (Jul 30 2018 at 10:33):

\x85? weird

view this post on Zulip Grahame Grieve (Jul 30 2018 at 10:33):

does that help though?

view this post on Zulip Michel Rutten (Jul 30 2018 at 11:03):

According to MSDN, \x85 represents "The ellipsis or NEXT LINE (NEL) character (…), \u0085.":
https://docs.microsoft.com/en-us/dotnet/standard/base-types/character-classes-in-regular-expressions

view this post on Zulip Michel Rutten (Jul 30 2018 at 11:04):

Never encountered this one in the wild

view this post on Zulip John Silva (Jul 30 2018 at 17:37):

What about for non- dotNet implementations of RegEx parsers? This website has a RegEx tester (https://regex101.com/ ) and they list \s as:
\s matches any whitespace character (equal to [\r\n\t\f\v ])

[The ellipsis might be auto-generated by MS-Word when you type three '.' together; I think Word does something similar when you use single quotes, it turns them into left apostrophe and right apostrophe.]

view this post on Zulip Grahame Grieve (Jul 30 2018 at 18:39):

so it seems to me that regex is just basically broken when it comes to unicode.

view this post on Zulip John Silva (Jul 30 2018 at 18:53):

Not sure that's true -- some simple Google searches seem to indicate there is more RegEx Unicode support than we typically think of (or have to deal with).

This MS page talks about the difference between how .NET and JavaScript (ECMAScript) handle \s:
https://docs.microsoft.com/en-us/dotnet/standard/base-types/character-classes-in-regular-expressions#white-space-character-s

If ECMAScript-compliant behavior is specified, \s is equivalent to [ \f\n\r\t\v]. For information on ECMAScript regular expressions, see the "ECMAScript Matching Behavior" section in Regular Expression Options.

Then there's this page that talks about a bunch of 'fancy' RegEx patterns for Unicode support:

https://www.regular-expressions.info/unicode.html

Then the question becomes; how much Unicode RegEx support does FHIR (or FHIR test servers) want/need to support?

view this post on Zulip Grahame Grieve (Jul 30 2018 at 18:56):

so it appears to me that the implementations are split between \p and \x

view this post on Zulip Grahame Grieve (Jul 30 2018 at 18:56):

and the specific question is this particular regex, not a general question


Last updated: Apr 12 2022 at 19:14 UTC