FHIR Chat · funky letters

Stream: implementers

Topic: funky letters

Jens Villadsen (May 11 2020 at 18:16):

image.png @Grahame Grieve I like one more medal, please ;)

Jens Villadsen (May 11 2020 at 18:17):

found on http://hl7.org/fhir/valueset-security-labels.html

Jens Villadsen (May 11 2020 at 18:18):

make that two:

image.png

Jens Villadsen (May 11 2020 at 18:18):

three, actually:
image.png

Jens Villadsen (May 11 2020 at 18:19):

make that four, actually
image.png

Jens Villadsen (May 11 2020 at 18:20):

HL7v3 FTW

Jens Villadsen (May 11 2020 at 18:24):

this is getting too easy: http://hl7.org/fhir/v3/ActReason/cs.html

image.png

Jens Villadsen (May 11 2020 at 18:25):

image.png

Jens Villadsen (May 11 2020 at 21:44):

@Grahame Grieve I expect it once again to be 'consistent with the magnitude of your discovery' :joy:

Grahame Grieve (May 12 2020 at 00:25):

here. have a medal:

image.png

Grahame Grieve (May 12 2020 at 00:33):

looks to me like all these issues come from the underlying v3 source and @Ted Klein all the ones I checked are erroneous in UTG too

Jens Villadsen (May 12 2020 at 06:51):

image.png

Ted Klein (May 12 2020 at 19:58):

Sigh. This is a VERY well known and LONG TERM issue with the v3 material. Due to the weak process and tooling for v3 maintenance before UTG, text from Microsoft Word documents was read by the tools and inserted directly into the database. Unfortunately, Microsoft Word has these irritating objects 'smart quotes' which when read by the java code and imported to the xml, were dutifully brought in and the ASCII renders as the funky things you see. We had a number of regular submitters of MANY new code systems and codes in V3 who for YEARS submitted proposals with these problems. We had no automatic tooling to detect or fix them, so it was up to me to manually recognize them, and for the original submitter to report them back in the manual QA review to be fixed later (and this was often skipped or done less well than hoped). The upshot is that the v3 content is rife with this garbage, randomly sprinkled throughout. When we imported the V3 coreif to UTG with java software, guess what? Garbage in, garbage out. So...they are in UTG as well. Now the new process will be able to both prevent this from happening in the future, and allow these to be fixed by those who see them. This is exactly the kind of content scrubbing I hope the UTG crowd-source availability of changes will enable. So...feel free to sign up for the Pilot and fix them as you see them and are irrupted by them - have at it!

Jens Villadsen (May 12 2020 at 21:03):

I don't know what that means. Do you expect anything else from me besides reporting it and me expecting someone to fix it?

Grahame Grieve (May 12 2020 at 21:08):

I think it means that your expectation may not be fulfilled :sad:

Vassil Peytchev (May 12 2020 at 21:15):

Jens Villadsen said:

I don't know what that means. Do you expect anything else from me besides reporting it and me expecting someone to fix it?

I think this:

So...feel free to sign up for the Pilot and fix them as you see them and are irrupted by them - have at it!

It would be nice to know what the Pilot is, and how to sign up for it.

Lloyd McKenzie (May 12 2020 at 23:39):

Rather than taking a whack-a-mole approach and filing individual change requests for each place we see this happening, could we not (now that the stuff lives in a publicly accessible Git repository with a nice text-based source format) not have someone write a little script to go through all the files, identify the needed fixes, and submit a pull request that would fix all the issues everywhere at once?

Lloyd McKenzie (May 12 2020 at 23:40):

It seems that fixing this sort of systemic issue would be better handled this way than using the normal "code system by code system" change process. We could subject the script that made the changes to review by a couple of folks and scan the resulting diff and if we're comfortable with it, just merge it as a technical correction.

Jose Costa Teixeira (May 13 2020 at 05:56):

is it not possible just to do it manually? If it's just a work that needs to be done once, it should be easy to spot the funky letters...

Frank Oemig (May 13 2020 at 08:13):

Before writing complex scripts why not starting with pulling and identifying about how many places we are talking?

Jose Costa Teixeira (May 13 2020 at 08:24):

we can actually make a script to identify how many places we are talking :grinning_face_with_smiling_eyes:

Jose Costa Teixeira (May 13 2020 at 08:25):

my first idea is :
1 download,
2 open notepad++, do a global search,
3 see if it's possible to do a global replace

Jens Villadsen (May 13 2020 at 09:00):

@Vassil Peytchev - I'd rather not spend more time on this since I've pinpointed where the bugs are and what should be done. I could create a range of change requests as pointed out by @Lloyd McKenzie - but as I understand, there's no reason to do so - as it is a systemic bug. Can't the source be fixed and then we can all carry on?

Grahame Grieve (May 13 2020 at 09:01):

@Ted Klein this is really a policy question. the actual source fix is not hard. It's a search and replace in n++

Ted Klein (May 13 2020 at 11:37):

All of these are possible and not terribly difficult. Frank's point is. a good one - we don't know where all these are, nor do we know all the different ones - there are at least 4 or 5 of them of various kinds scattered about. Someone could fix all of them at once for sure...but I do not have the list of al the bad characters, nor the list of where they appear. This is also not a funded effort so someone needs to decide they have the time to just do it.

Jose Costa Teixeira (May 13 2020 at 11:38):

where is the repo?

Jens Villadsen (May 13 2020 at 11:38):

Jose Costa Teixeira said:

where is the repo?

that's the spirit

Grahame Grieve (May 13 2020 at 11:39):

https://github.com/HL7/UTG

Ted Klein (May 13 2020 at 11:40):

The characters that originally came in and were auto-inserted (in some cases years ago) were the ones for the Microsoft open-and-close single and double smart quotes, bullet symbols, there might be some from other symbols in the original submitted text like section make and copyright marks and registered trademark symbols, and I know years ago I fixed at least one that was a single character arrow symbol. There may be others.

Jens Villadsen (May 13 2020 at 11:40):

Grahame Grieve said:

https://github.com/HL7/UTG

and PR's are welcome?

Grahame Grieve (May 13 2020 at 11:40):

the ones that started the thread are UTF mis-encodings

Grahame Grieve (May 13 2020 at 11:41):

sure. PRs welcome

Ted Klein (May 13 2020 at 11:41):

all changes since the initial release has been done must be documented with Provenance resources for the history.

Grahame Grieve (May 13 2020 at 11:41):

even fixing UTF encoding mistakes? I think that's going too far

Ted Klein (May 13 2020 at 11:41):

sigh. THAT is a policy question for sure.

Ted Klein (May 13 2020 at 11:43):

There are some vocab cochairs that believe that ANY change to ANY text in concept description must not only be documented in history but also trigger a major number increment in the code system version. I lost that argument yesterday and will bring it back on the Vocab WG call tomorrow.

Jens Villadsen (May 13 2020 at 11:43):

Ted Klein said:

There are some vocab cochairs that believe that ANY change to ANY text in concept description must not only be documented in history but also trigger a major number increment in the code system version. I lost that argument yesterday and will bring it back on the Vocab WG call tomorrow.

maybe it's time to find new co-chairs then

Grahame Grieve (May 13 2020 at 11:44):

let me know when to join. Making major release numbers trivially irrelevant doesn't actually help maintain your semantic integrity

Grahame Grieve (May 13 2020 at 11:44):

but Jens has a point.

Ted Klein (May 13 2020 at 11:45):

Note that although folks see these and are irritated by it, these characters have been published for YEARS in the coremif, so there is less urgency to fix them immediately than one might imagine IMHO.

Jens Villadsen (May 13 2020 at 11:48):

sure - everything has a priority.

image.png

But what was done in the past may not always be an argument for sticking to it in the future

Ted Klein (May 13 2020 at 11:49):

Agreed. I would just like ti fix these damn things for once and for all. I think we have a little time before the next release by the end of this week to sneak things in before the formal processes really kick in. Vocab call tomorrow: https://confluence.hl7.org/pages/viewpage.action?pageId=82910534

Jens Villadsen (May 13 2020 at 11:51):

So search and replace in eg. vscode/emacs/n++/whatever in https://github.com/HL7/UTG on â€œ / â€to what? What was the intended characters?

Ted Klein (May 13 2020 at 11:51):

of course not Jens...but there are HUNDREDS of small things that we are working on fixing, and this is just one of them. If a fix was in hand this minutes, we could get them done today. But once my calls start in about 2 hours, I am out of time for things like this today.

Jose Costa Teixeira (May 13 2020 at 12:37):

https://www.i18nqa.com/debug/utf8-debug.html

Jose Costa Teixeira (May 13 2020 at 12:44):

I count 400 occurrences, I think that makes it 50 medals.

Jose Costa Teixeira (May 13 2020 at 12:49):

and at least notepad++ can make even the complicated changes
(for example when â€˜ is followed by an a and then the accent is taken to the letter, showing as â€ã)

Jose Costa Teixeira (May 13 2020 at 12:50):

â€œ = opening double quote  --> "
â€? = closing double quote  --> "
â€™ = closing single quote /apostrophe --> '
â€˜ = opening single quote  --> '

Jens Villadsen (May 13 2020 at 13:22):

@Jose Costa Teixeira are you gonna do the PR then?

Jose Costa Teixeira (May 13 2020 at 13:24):

If the governance allows (IMO the fact that these things were there for so long is a sign that this is obviously under the radar or below the "important change" threshold) and if nobody else wants to, I will

Jose Costa Teixeira (May 13 2020 at 13:24):

I just took a long cigarette break to see how difficult it was. It isn't.

Jose Costa Teixeira (May 13 2020 at 13:25):

let me know if I should do it (making the branch and PR will take longer than the replacement)

Jens Villadsen (May 13 2020 at 13:27):

image.png

@Jose Costa Teixeira - saving the world in healthcare - one commit at a time

Ted Klein (May 13 2020 at 14:01):

NOt all the fixes are identical. The garbage that was originally double quotes (open or close) needs to be replaced with double quote, the ones that were single quotes (open and close) need to be replaced with single quotes, the ones that were bullets need either removal or replacement with structured lists (<li> in markup, etc). Not sure if there re others. If al we needed was simple deletion I would have done that automatically a long time ago. The original things were approved by vote in harmonization.

Jose Costa Teixeira (May 13 2020 at 14:07):

Yes, so I think you will advise when to fix what Ted Klein and then I (or others) can jump in.

Jose Costa Teixeira (May 13 2020 at 14:07):

I'd suggest to slice this in a few chunks.

opening and closing double and single quotes and apostrophes (simple, automated, I guess easy to approve, can be a quick improvement)
bullets (may be much more impactful - but there is no rush)
whatever else may appear

Ted Klein (May 13 2020 at 14:50):

Yes exactly what we should do

Jose Costa Teixeira (May 13 2020 at 15:43):

@Ted Klein Permission to create a PR for phase 1 , for at least the single and double quotes?
(I can do it somewhere this week)

Ted Klein (May 13 2020 at 16:01):

Jose - great! Couple of items: the patch number in <version> of the corrected resources needs to be incremented by 1 (e.g. <version value="2.0.0"/> must be updated to <version value="2.0.1"/>. n. Grahame disagrees, but I am feeling that even such miniscule changes should be documented with Provenance resource instances for the history tracking. Ideally, a new additional bundle in input/sourceOfTruth/history of "hxutg1-1-6"; if you look at 1-0-5 all the naming conventions are pretty obvious and algorithmic - these bundles were all created by scripts for the release. If we can fix the single and double rotes (should be 95% of them) with this, then lets do it. Provenance.recorded.reason.text can be "Fixed bad UTF characters from the coremif V3 source." You can be author, Vocabulary WG as custodian. That will do it, properly and completely.

Ted Klein (May 13 2020 at 20:26):

I just noticed one thing when looking at this in UTG: the example that started this thread is NOT from UTG content, but from FHIR R4 content, and the value set has NOT been migrated to UTG. Thus my comments about the need for version and Provenance may be invalid, as those are specific to the governance policy of the content at http://terminology.hl7.org, and likely differs from what is done with the things only in FHIR.

Grahame Grieve (May 13 2020 at 20:32):

the code system from which that content comes has

Jose Costa Teixeira (May 16 2020 at 17:53):

ok a few issues:

Jose Costa Teixeira (May 16 2020 at 17:54):

this is not in the resources, but in java code and in a big file called something.coremif - is there any version that should be replaced there?

Jose Costa Teixeira (May 16 2020 at 17:55):

Don't we have to escape these quotes and double quotes?

Grahame Grieve (May 16 2020 at 21:33):

don't need to worry about the coremif in the main build - that's being taken out. Need to worry about the source in UTG (https://github.com/HL7/UTG)

Jose Costa Teixeira (May 16 2020 at 21:50):

The java source?

Jose Costa Teixeira (May 16 2020 at 21:51):

The funky letters are in java files and the coremif

Jose Costa Teixeira (May 16 2020 at 21:51):

Most in coremif

Grahame Grieve (May 16 2020 at 22:00):

coremif is going away. ignore it. And look in UTG.

Grahame Grieve (May 16 2020 at 22:00):

but where in the java?

Jose Costa Teixeira (May 17 2020 at 04:12):

the only places where I find these characters are in the coremif and in .java files (I see now they are in comments, at least most of them are) e.g.
UTG\framework\java\implementations\java\org.hl7.fhir.dstu2016may\src\org\hl7\fhir\dstu2016may\model\codesystems\V3ActCode.java
line 5053

Grahame Grieve (May 17 2020 at 09:22):

ok. generated code. You can fix those if you want. It does turn up in auto-complete etc

Last updated: Apr 12 2022 at 19:14 UTC

Main menu

FHIR Chat · funky letters · implementers

Stream: implementers

Topic: funky letters

Jens Villadsen (May 11 2020 at 18:16):

Jens Villadsen (May 11 2020 at 18:17):

Jens Villadsen (May 11 2020 at 18:18):

Jens Villadsen (May 11 2020 at 18:18):

Jens Villadsen (May 11 2020 at 18:19):

Jens Villadsen (May 11 2020 at 18:20):

Jens Villadsen (May 11 2020 at 18:24):

Jens Villadsen (May 11 2020 at 18:25):

Jens Villadsen (May 11 2020 at 21:44):

Grahame Grieve (May 12 2020 at 00:25):

Grahame Grieve (May 12 2020 at 00:33):

Jens Villadsen (May 12 2020 at 06:51):

Ted Klein (May 12 2020 at 19:58):

Jens Villadsen (May 12 2020 at 21:03):

Grahame Grieve (May 12 2020 at 21:08):

Vassil Peytchev (May 12 2020 at 21:15):

Lloyd McKenzie (May 12 2020 at 23:39):

Lloyd McKenzie (May 12 2020 at 23:40):

Jose Costa Teixeira (May 13 2020 at 05:56):

Frank Oemig (May 13 2020 at 08:13):

Jose Costa Teixeira (May 13 2020 at 08:24):

Jose Costa Teixeira (May 13 2020 at 08:25):

Jens Villadsen (May 13 2020 at 09:00):

Grahame Grieve (May 13 2020 at 09:01):

Ted Klein (May 13 2020 at 11:37):

Jose Costa Teixeira (May 13 2020 at 11:38):

Jens Villadsen (May 13 2020 at 11:38):

Grahame Grieve (May 13 2020 at 11:39):

Ted Klein (May 13 2020 at 11:40):

Jens Villadsen (May 13 2020 at 11:40):

Grahame Grieve (May 13 2020 at 11:40):

Grahame Grieve (May 13 2020 at 11:41):

Ted Klein (May 13 2020 at 11:41):

Grahame Grieve (May 13 2020 at 11:41):

Ted Klein (May 13 2020 at 11:41):

Ted Klein (May 13 2020 at 11:43):

Jens Villadsen (May 13 2020 at 11:43):

Grahame Grieve (May 13 2020 at 11:44):

Grahame Grieve (May 13 2020 at 11:44):

Ted Klein (May 13 2020 at 11:45):

Jens Villadsen (May 13 2020 at 11:48):

Ted Klein (May 13 2020 at 11:49):

Jens Villadsen (May 13 2020 at 11:51):

Ted Klein (May 13 2020 at 11:51):

Jose Costa Teixeira (May 13 2020 at 12:37):

Jose Costa Teixeira (May 13 2020 at 12:44):

Jose Costa Teixeira (May 13 2020 at 12:49):

Jose Costa Teixeira (May 13 2020 at 12:50):

Jens Villadsen (May 13 2020 at 13:22):

Jose Costa Teixeira (May 13 2020 at 13:24):

Jose Costa Teixeira (May 13 2020 at 13:24):

Jose Costa Teixeira (May 13 2020 at 13:25):

Jens Villadsen (May 13 2020 at 13:27):

Ted Klein (May 13 2020 at 14:01):

Jose Costa Teixeira (May 13 2020 at 14:07):

Jose Costa Teixeira (May 13 2020 at 14:07):

Ted Klein (May 13 2020 at 14:50):

Jose Costa Teixeira (May 13 2020 at 15:43):

Ted Klein (May 13 2020 at 16:01):

Ted Klein (May 13 2020 at 20:26):

Grahame Grieve (May 13 2020 at 20:32):

Jose Costa Teixeira (May 16 2020 at 17:53):

Jose Costa Teixeira (May 16 2020 at 17:54):

Jose Costa Teixeira (May 16 2020 at 17:55):

Grahame Grieve (May 16 2020 at 21:33):

Jose Costa Teixeira (May 16 2020 at 21:50):

Jose Costa Teixeira (May 16 2020 at 21:51):

Jose Costa Teixeira (May 16 2020 at 21:51):

Grahame Grieve (May 16 2020 at 22:00):

Grahame Grieve (May 16 2020 at 22:00):

Jose Costa Teixeira (May 17 2020 at 04:12):

Grahame Grieve (May 17 2020 at 09:22):