FHIR Chat · Unicode Regex Question

According to the dotNET documentation, \S (non-whitespace) is a shortcut for [^\f\n\r\t\v\x85\p{Z}]where `\p{Z}' represents a Unicode block/category that "matches any separator character":
https://docs.microsoft.com/en-us/dotnet/standard/base-types/character-classes-in-regular-expressions#non-white-space-character-s

Grahame Grieve (Jul 30 2018 at 10:33):

\x85? weird

Grahame Grieve (Jul 30 2018 at 10:33):

does that help though?

Michel Rutten (Jul 30 2018 at 11:03):

According to MSDN, \x85 represents "The ellipsis or NEXT LINE (NEL) character (…), \u0085.":
https://docs.microsoft.com/en-us/dotnet/standard/base-types/character-classes-in-regular-expressions

Michel Rutten (Jul 30 2018 at 11:04):

Never encountered this one in the wild

John Silva (Jul 30 2018 at 17:37):

What about for non- dotNet implementations of RegEx parsers? This website has a RegEx tester (https://regex101.com/ ) and they list \s as:
\s matches any whitespace character (equal to [\r\n\t\f\v ])

[The ellipsis might be auto-generated by MS-Word when you type three '.' together; I think Word does something similar when you use single quotes, it turns them into left apostrophe and right apostrophe.]

Grahame Grieve (Jul 30 2018 at 18:39):

so it seems to me that regex is just basically broken when it comes to unicode.

John Silva (Jul 30 2018 at 18:53):

Not sure that's true -- some simple Google searches seem to indicate there is more RegEx Unicode support than we typically think of (or have to deal with).

This MS page talks about the difference between how .NET and JavaScript (ECMAScript) handle \s:
https://docs.microsoft.com/en-us/dotnet/standard/base-types/character-classes-in-regular-expressions#white-space-character-s

If ECMAScript-compliant behavior is specified, \s is equivalent to [ \f\n\r\t\v]. For information on ECMAScript regular expressions, see the "ECMAScript Matching Behavior" section in Regular Expression Options.

Then there's this page that talks about a bunch of 'fancy' RegEx patterns for Unicode support:

https://www.regular-expressions.info/unicode.html

Then the question becomes; how much Unicode RegEx support does FHIR (or FHIR test servers) want/need to support?

Grahame Grieve (Jul 30 2018 at 18:56):

so it appears to me that the implementations are split between \p and \x

Grahame Grieve (Jul 30 2018 at 18:56):

and the specific question is this particular regex, not a general question

Last updated: Apr 12 2022 at 19:14 UTC

Main menu

FHIR Chat · Unicode Regex Question · implementers

Stream: implementers

Topic: Unicode Regex Question

Grahame Grieve (Jul 29 2018 at 01:50):

Grahame Grieve (Jul 29 2018 at 01:50):

Grahame Grieve (Jul 29 2018 at 01:51):

Grahame Grieve (Jul 29 2018 at 01:52):

Grahame Grieve (Jul 29 2018 at 01:52):

Grahame Grieve (Jul 29 2018 at 01:53):

Grahame Grieve (Jul 29 2018 at 01:53):

Grahame Grieve (Jul 29 2018 at 01:53):

Grahame Grieve (Jul 29 2018 at 01:53):

Grahame Grieve (Jul 29 2018 at 01:54):

Grahame Grieve (Jul 29 2018 at 01:54):

Michel Rutten (Jul 30 2018 at 09:33):

Grahame Grieve (Jul 30 2018 at 10:33):

Grahame Grieve (Jul 30 2018 at 10:33):

Michel Rutten (Jul 30 2018 at 11:03):

Michel Rutten (Jul 30 2018 at 11:04):

John Silva (Jul 30 2018 at 17:37):

Grahame Grieve (Jul 30 2018 at 18:39):

John Silva (Jul 30 2018 at 18:53):

Grahame Grieve (Jul 30 2018 at 18:56):

Grahame Grieve (Jul 30 2018 at 18:56):