FHIR Chat · funky letters · implementers

Stream: implementers

Topic: funky letters


view this post on Zulip Jens Villadsen (May 11 2020 at 18:16):

image.png @Grahame Grieve I like one more medal, please ;)

view this post on Zulip Jens Villadsen (May 11 2020 at 18:17):

found on http://hl7.org/fhir/valueset-security-labels.html

view this post on Zulip Jens Villadsen (May 11 2020 at 18:18):

make that two:

image.png

view this post on Zulip Jens Villadsen (May 11 2020 at 18:18):

three, actually:
image.png

view this post on Zulip Jens Villadsen (May 11 2020 at 18:19):

make that four, actually
image.png

view this post on Zulip Jens Villadsen (May 11 2020 at 18:20):

HL7v3 FTW

view this post on Zulip Jens Villadsen (May 11 2020 at 18:24):

this is getting too easy: http://hl7.org/fhir/v3/ActReason/cs.html

image.png

view this post on Zulip Jens Villadsen (May 11 2020 at 18:25):

image.png

view this post on Zulip Jens Villadsen (May 11 2020 at 21:44):

@Grahame Grieve I expect it once again to be 'consistent with the magnitude of your discovery' :joy:

view this post on Zulip Grahame Grieve (May 12 2020 at 00:25):

here. have a medal:

image.png

view this post on Zulip Grahame Grieve (May 12 2020 at 00:33):

looks to me like all these issues come from the underlying v3 source and @Ted Klein all the ones I checked are erroneous in UTG too

view this post on Zulip Jens Villadsen (May 12 2020 at 06:51):

image.png

view this post on Zulip Ted Klein (May 12 2020 at 19:58):

Sigh. This is a VERY well known and LONG TERM issue with the v3 material. Due to the weak process and tooling for v3 maintenance before UTG, text from Microsoft Word documents was read by the tools and inserted directly into the database. Unfortunately, Microsoft Word has these irritating objects 'smart quotes' which when read by the java code and imported to the xml, were dutifully brought in and the ASCII renders as the funky things you see. We had a number of regular submitters of MANY new code systems and codes in V3 who for YEARS submitted proposals with these problems. We had no automatic tooling to detect or fix them, so it was up to me to manually recognize them, and for the original submitter to report them back in the manual QA review to be fixed later (and this was often skipped or done less well than hoped). The upshot is that the v3 content is rife with this garbage, randomly sprinkled throughout. When we imported the V3 coreif to UTG with java software, guess what? Garbage in, garbage out. So...they are in UTG as well. Now the new process will be able to both prevent this from happening in the future, and allow these to be fixed by those who see them. This is exactly the kind of content scrubbing I hope the UTG crowd-source availability of changes will enable. So...feel free to sign up for the Pilot and fix them as you see them and are irrupted by them - have at it!

view this post on Zulip Jens Villadsen (May 12 2020 at 21:03):

I don't know what that means. Do you expect anything else from me besides reporting it and me expecting someone to fix it?

view this post on Zulip Grahame Grieve (May 12 2020 at 21:08):

I think it means that your expectation may not be fulfilled :sad:

view this post on Zulip Vassil Peytchev (May 12 2020 at 21:15):

Jens Villadsen said:

I don't know what that means. Do you expect anything else from me besides reporting it and me expecting someone to fix it?

I think this:

So...feel free to sign up for the Pilot and fix them as you see them and are irrupted by them - have at it!

It would be nice to know what the Pilot is, and how to sign up for it.

view this post on Zulip Lloyd McKenzie (May 12 2020 at 23:39):

Rather than taking a whack-a-mole approach and filing individual change requests for each place we see this happening, could we not (now that the stuff lives in a publicly accessible Git repository with a nice text-based source format) not have someone write a little script to go through all the files, identify the needed fixes, and submit a pull request that would fix all the issues everywhere at once?

view this post on Zulip Lloyd McKenzie (May 12 2020 at 23:40):

It seems that fixing this sort of systemic issue would be better handled this way than using the normal "code system by code system" change process. We could subject the script that made the changes to review by a couple of folks and scan the resulting diff and if we're comfortable with it, just merge it as a technical correction.

view this post on Zulip Jose Costa Teixeira (May 13 2020 at 05:56):

is it not possible just to do it manually? If it's just a work that needs to be done once, it should be easy to spot the funky letters...

view this post on Zulip Frank Oemig (May 13 2020 at 08:13):

Before writing complex scripts why not starting with pulling and identifying about how many places we are talking?

view this post on Zulip Jose Costa Teixeira (May 13 2020 at 08:24):

we can actually make a script to identify how many places we are talking :grinning_face_with_smiling_eyes:

view this post on Zulip Jose Costa Teixeira (May 13 2020 at 08:25):

my first idea is :
1 download,
2 open notepad++, do a global search,
3 see if it's possible to do a global replace

view this post on Zulip Jens Villadsen (May 13 2020 at 09:00):

@Vassil Peytchev - I'd rather not spend more time on this since I've pinpointed where the bugs are and what should be done. I could create a range of change requests as pointed out by @Lloyd McKenzie - but as I understand, there's no reason to do so - as it is a systemic bug. Can't the source be fixed and then we can all carry on?

view this post on Zulip Grahame Grieve (May 13 2020 at 09:01):

@Ted Klein this is really a policy question. the actual source fix is not hard. It's a search and replace in n++

view this post on Zulip Ted Klein (May 13 2020 at 11:37):

All of these are possible and not terribly difficult. Frank's point is. a good one - we don't know where all these are, nor do we know all the different ones - there are at least 4 or 5 of them of various kinds scattered about. Someone could fix all of them at once for sure...but I do not have the list of al the bad characters, nor the list of where they appear. This is also not a funded effort so someone needs to decide they have the time to just do it.

view this post on Zulip Jose Costa Teixeira (May 13 2020 at 11:38):

where is the repo?

view this post on Zulip Jens Villadsen (May 13 2020 at 11:38):

Jose Costa Teixeira said:

where is the repo?

that's the spirit

view this post on Zulip Grahame Grieve (May 13 2020 at 11:39):

https://github.com/HL7/UTG

view this post on Zulip Ted Klein (May 13 2020 at 11:40):

The characters that originally came in and were auto-inserted (in some cases years ago) were the ones for the Microsoft open-and-close single and double smart quotes, bullet symbols, there might be some from other symbols in the original submitted text like section make and copyright marks and registered trademark symbols, and I know years ago I fixed at least one that was a single character arrow symbol. There may be others.

view this post on Zulip Jens Villadsen (May 13 2020 at 11:40):

Grahame Grieve said:

https://github.com/HL7/UTG

and PR's are welcome?

view this post on Zulip Grahame Grieve (May 13 2020 at 11:40):

the ones that started the thread are UTF mis-encodings

view this post on Zulip Grahame Grieve (May 13 2020 at 11:41):

sure. PRs welcome

view this post on Zulip Ted Klein (May 13 2020 at 11:41):

all changes since the initial release has been done must be documented with Provenance resources for the history.

view this post on Zulip Grahame Grieve (May 13 2020 at 11:41):

even fixing UTF encoding mistakes? I think that's going too far

view this post on Zulip Ted Klein (May 13 2020 at 11:41):

sigh. THAT is a policy question for sure.

view this post on Zulip Ted Klein (May 13 2020 at 11:43):

There are some vocab cochairs that believe that ANY change to ANY text in concept description must not only be documented in history but also trigger a major number increment in the code system version. I lost that argument yesterday and will bring it back on the Vocab WG call tomorrow.

view this post on Zulip Jens Villadsen (May 13 2020 at 11:43):

Ted Klein said:

There are some vocab cochairs that believe that ANY change to ANY text in concept description must not only be documented in history but also trigger a major number increment in the code system version. I lost that argument yesterday and will bring it back on the Vocab WG call tomorrow.

maybe it's time to find new co-chairs then

view this post on Zulip Grahame Grieve (May 13 2020 at 11:44):

let me know when to join. Making major release numbers trivially irrelevant doesn't actually help maintain your semantic integrity

view this post on Zulip Grahame Grieve (May 13 2020 at 11:44):

but Jens has a point.

view this post on Zulip Ted Klein (May 13 2020 at 11:45):

Note that although folks see these and are irritated by it, these characters have been published for YEARS in the coremif, so there is less urgency to fix them immediately than one might imagine IMHO.

view this post on Zulip Jens Villadsen (May 13 2020 at 11:48):

sure - everything has a priority.

image.png

But what was done in the past may not always be an argument for sticking to it in the future

view this post on Zulip Ted Klein (May 13 2020 at 11:49):

Agreed. I would just like ti fix these damn things for once and for all. I think we have a little time before the next release by the end of this week to sneak things in before the formal processes really kick in. Vocab call tomorrow: https://confluence.hl7.org/pages/viewpage.action?pageId=82910534

view this post on Zulip Jens Villadsen (May 13 2020 at 11:51):

So search and replace in eg. vscode/emacs/n++/whatever in https://github.com/HL7/UTG on “ / â€to what? What was the intended characters?

view this post on Zulip Ted Klein (May 13 2020 at 11:51):

of course not Jens...but there are HUNDREDS of small things that we are working on fixing, and this is just one of them. If a fix was in hand this minutes, we could get them done today. But once my calls start in about 2 hours, I am out of time for things like this today.

view this post on Zulip Jose Costa Teixeira (May 13 2020 at 12:37):

https://www.i18nqa.com/debug/utf8-debug.html

view this post on Zulip Jose Costa Teixeira (May 13 2020 at 12:44):

I count 400 occurrences, I think that makes it 50 medals.

view this post on Zulip Jose Costa Teixeira (May 13 2020 at 12:49):

and at least notepad++ can make even the complicated changes
(for example when ‘ is followed by an a and then the accent is taken to the letter, showing as â€ã)

view this post on Zulip Jose Costa Teixeira (May 13 2020 at 12:50):

“ = opening double quote  --> "
� = closing double quote  --> "
’ = closing single quote /apostrophe --> '
‘ = opening single quote  --> '

view this post on Zulip Jens Villadsen (May 13 2020 at 13:22):

@Jose Costa Teixeira are you gonna do the PR then?

view this post on Zulip Jose Costa Teixeira (May 13 2020 at 13:24):

If the governance allows (IMO the fact that these things were there for so long is a sign that this is obviously under the radar or below the "important change" threshold) and if nobody else wants to, I will

view this post on Zulip Jose Costa Teixeira (May 13 2020 at 13:24):

I just took a long cigarette break to see how difficult it was. It isn't.

view this post on Zulip Jose Costa Teixeira (May 13 2020 at 13:25):

let me know if I should do it (making the branch and PR will take longer than the replacement)

view this post on Zulip Jens Villadsen (May 13 2020 at 13:27):

image.png

@Jose Costa Teixeira - saving the world in healthcare - one commit at a time

view this post on Zulip Ted Klein (May 13 2020 at 14:01):

NOt all the fixes are identical. The garbage that was originally double quotes (open or close) needs to be replaced with double quote, the ones that were single quotes (open and close) need to be replaced with single quotes, the ones that were bullets need either removal or replacement with structured lists (<li> in markup, etc). Not sure if there re others. If al we needed was simple deletion I would have done that automatically a long time ago. The original things were approved by vote in harmonization.

view this post on Zulip Jose Costa Teixeira (May 13 2020 at 14:07):

Yes, so I think you will advise when to fix what Ted Klein and then I (or others) can jump in.

view this post on Zulip Jose Costa Teixeira (May 13 2020 at 14:07):

I'd suggest to slice this in a few chunks.

  1. opening and closing double and single quotes and apostrophes (simple, automated, I guess easy to approve, can be a quick improvement)
  2. bullets (may be much more impactful - but there is no rush)
  3. whatever else may appear

view this post on Zulip Ted Klein (May 13 2020 at 14:50):

Yes exactly what we should do

view this post on Zulip Jose Costa Teixeira (May 13 2020 at 15:43):

@Ted Klein Permission to create a PR for phase 1 , for at least the single and double quotes?
(I can do it somewhere this week)

view this post on Zulip Ted Klein (May 13 2020 at 16:01):

Jose - great! Couple of items: the patch number in <version> of the corrected resources needs to be incremented by 1 (e.g. <version value="2.0.0"/> must be updated to <version value="2.0.1"/>. n. Grahame disagrees, but I am feeling that even such miniscule changes should be documented with Provenance resource instances for the history tracking. Ideally, a new additional bundle in input/sourceOfTruth/history of "hxutg1-1-6"; if you look at 1-0-5 all the naming conventions are pretty obvious and algorithmic - these bundles were all created by scripts for the release. If we can fix the single and double rotes (should be 95% of them) with this, then lets do it. Provenance.recorded.reason.text can be "Fixed bad UTF characters from the coremif V3 source." You can be author, Vocabulary WG as custodian. That will do it, properly and completely.

view this post on Zulip Ted Klein (May 13 2020 at 20:26):

I just noticed one thing when looking at this in UTG: the example that started this thread is NOT from UTG content, but from FHIR R4 content, and the value set has NOT been migrated to UTG. Thus my comments about the need for version and Provenance may be invalid, as those are specific to the governance policy of the content at http://terminology.hl7.org, and likely differs from what is done with the things only in FHIR.

view this post on Zulip Grahame Grieve (May 13 2020 at 20:32):

the code system from which that content comes has

view this post on Zulip Jose Costa Teixeira (May 16 2020 at 17:53):

ok a few issues:

view this post on Zulip Jose Costa Teixeira (May 16 2020 at 17:54):

  1. this is not in the resources, but in java code and in a big file called something.coremif - is there any version that should be replaced there?

view this post on Zulip Jose Costa Teixeira (May 16 2020 at 17:55):

  1. Don't we have to escape these quotes and double quotes?

view this post on Zulip Grahame Grieve (May 16 2020 at 21:33):

don't need to worry about the coremif in the main build - that's being taken out. Need to worry about the source in UTG (https://github.com/HL7/UTG)

view this post on Zulip Jose Costa Teixeira (May 16 2020 at 21:50):

The java source?

view this post on Zulip Jose Costa Teixeira (May 16 2020 at 21:51):

The funky letters are in java files and the coremif

view this post on Zulip Jose Costa Teixeira (May 16 2020 at 21:51):

Most in coremif

view this post on Zulip Grahame Grieve (May 16 2020 at 22:00):

coremif is going away. ignore it. And look in UTG.

view this post on Zulip Grahame Grieve (May 16 2020 at 22:00):

but where in the java?

view this post on Zulip Jose Costa Teixeira (May 17 2020 at 04:12):

the only places where I find these characters are in the coremif and in .java files (I see now they are in comments, at least most of them are) e.g.
UTG\framework\java\implementations\java\org.hl7.fhir.dstu2016may\src\org\hl7\fhir\dstu2016may\model\codesystems\V3ActCode.java
line 5053

view this post on Zulip Grahame Grieve (May 17 2020 at 09:22):

ok. generated code. You can fix those if you want. It does turn up in auto-complete etc


Last updated: Apr 12 2022 at 19:14 UTC