FHIR Chat · Term server is down... · IG creation

Stream: IG creation

Topic: Term server is down...


view this post on Zulip Eric Haas (May 24 2021 at 22:03):

Root directory: /scratch/ig-build-temp-R5AB25/repo                               (00:02.0820)
Core Package hl7.fhir.r4.core#4.0.1
Installing hl7.fhir.r4.core#4.0.1 to the package cache
  Fetching:....................................................................................................
  Installing: .................................................................................................... done.
Terminology Cache is at /scratch/ig-build-temp-R5AB25/repo/input-cache/txcache. Trimming now (00:16.0772)
Connect to Terminology Server at http://tx.fhir.org                              (00:16.0775)
Publishing Content Failed: Unable to connect to terminology server. Error = Error fetching the server's capability statement: connect timed out (00:33.0961)

view this post on Zulip Lloyd McKenzie (May 24 2021 at 22:22):

@Rob Hausam @Mark Iantorno @Grahame Grieve

view this post on Zulip Grahame Grieve (May 24 2021 at 23:17):

server looks ok to me

view this post on Zulip Oliver Egger (May 28 2021 at 08:25):

term server looks down again:
org.hl7.fhir.exceptions.FHIRException: Unable to connect to terminology server. Error = Error fetching the server's capability statement: timeout

anyone up for a restart?

view this post on Zulip Rob Hausam (May 28 2021 at 12:30):

@Oliver Egger The server appears to be up. Are you still having issues?

view this post on Zulip Pétur Valdimarsson (May 28 2021 at 12:53):

I can report intermittent problems here (Sweden) as well. So far all but 1 builds failed due to timeouts to http://tx.fhir.org The user interface for it behaves in the same way, mixes timeouts with delayed responses. Last attempt was during the writing of this message.

view this post on Zulip Michaela Ziegler (May 28 2021 at 12:53):

still having issues with connecting to the terminology server

view this post on Zulip Grahame Grieve (May 28 2021 at 14:36):

it's coming back up

view this post on Zulip Rob Hausam (May 28 2021 at 15:34):

Hopefully that restart helped. Let us know if you are still seeing issues.

view this post on Zulip Jose Costa Teixeira (May 30 2021 at 22:06):

Seems to be down for me

view this post on Zulip Lloyd McKenzie (May 30 2021 at 22:11):

@Rob Hausam @Grahame Grieve @Mark Iantorno

view this post on Zulip Rob Hausam (May 30 2021 at 22:12):

I'll take a look.

view this post on Zulip Rob Hausam (May 30 2021 at 22:21):

Server is back up now.

view this post on Zulip Barbro Vessman (Jun 08 2021 at 10:17):

I have issues publishing: image.png

view this post on Zulip Rob Hausam (Jun 08 2021 at 11:29):

Restarting the service now.

view this post on Zulip Barbro Vessman (Jun 08 2021 at 12:45):

Thank you very much @Rob Hausam . Now it works!

view this post on Zulip Chris Moesel (Jun 16 2021 at 22:05):

Terminology server seems to be down again:

org.hl7.fhir.exceptions.FHIRException: Unable to connect to terminology server. Error = Error fetching the server's capability statement: timeout

view this post on Zulip Grahame Grieve (Jun 16 2021 at 22:20):

coming back up

view this post on Zulip Matthew Tiller (Jun 29 2021 at 16:00):

Is the terminology server down again?

view this post on Zulip Rob Hausam (Jun 29 2021 at 16:14):

Restarting the service.

view this post on Zulip Matthew Tiller (Jun 29 2021 at 16:36):

thank you sir

view this post on Zulip Rob Hausam (Jun 29 2021 at 17:00):

Apologies, but I'm going to restart the server now to install some software - expect it to be down for about 10 minutes.

view this post on Zulip Rob Hausam (Jun 29 2021 at 17:25):

Back up now.

view this post on Zulip Sarah Gaunt (Jun 30 2021 at 07:44):

Think it's down again:
Publishing Content Failed: Unable to connect to terminology server. Error = Error fetching the server's capability statement: connect timed out (01:40.0962)

view this post on Zulip Rob Hausam (Jun 30 2021 at 07:45):

It will be back up soon.

view this post on Zulip Rob Hausam (Jun 30 2021 at 07:45):

I'm finishing up with loading some content.

view this post on Zulip Rob Hausam (Jun 30 2021 at 07:57):

Should be back up now.

view this post on Zulip Sarah Gaunt (Jun 30 2021 at 08:10):

Thanks @Rob Hausam

view this post on Zulip Roeland Luykx (Jul 05 2021 at 05:53):

The terminology server is down. Is there a regular time when the server is not online due to maintenance?
Regularly when i like to build in the morning (CET) then the server is not online...

view this post on Zulip Torben M. Hagensen (Jul 05 2021 at 06:43):

Can anyone please restart the server

view this post on Zulip Diana_Ovelgoenne (Jul 05 2021 at 07:18):

x2

view this post on Zulip Roeland Luykx (Jul 05 2021 at 08:03):

@Rob Hausam

view this post on Zulip Christian Nau (Jul 05 2021 at 09:42):

Seems still to be down.
Is there a workaround, to be able to build local IG packages?

view this post on Zulip Christian Nau (Jul 05 2021 at 09:43):

@Rob Hausam @Grahame Grieve @Mark Iantorno can someone please restart the server? :)

view this post on Zulip Roeland Luykx (Jul 05 2021 at 09:57):

@Christian Nau yes, with the parameter -tx n/a

view this post on Zulip Christian Nau (Jul 05 2021 at 10:25):

Thank you @Roeland Luykx !!

view this post on Zulip Diana_Ovelgoenne (Jul 05 2021 at 11:31):

despite using -tx n/a I get the error Publishing Content Failed: Attempt to use Terminology server when no Terminology server is available

view this post on Zulip Roeland Luykx (Jul 05 2021 at 11:33):

@Diana_Ovelgoenne this is for sure if you need to have the terminology server available... lets hope on the server be back soon!

view this post on Zulip Mark Iantorno (Jul 05 2021 at 13:03):

on it now

view this post on Zulip Mark Iantorno (Jul 05 2021 at 13:05):

just restarted it, give it a couple min

view this post on Zulip Rob Hausam (Jul 05 2021 at 13:09):

Just saw this. Thanks, @Mark Iantorno. I should be able to get back to finishing setting up the monitor today.

view this post on Zulip Mark Iantorno (Jul 05 2021 at 13:12):

it's up again

view this post on Zulip Sarah Gaunt (Jul 08 2021 at 21:55):

Seems to be down again:

Connect to Terminology Server at http://tx.fhir.org                              (00:11.0648)
Publishing Content Failed: Unable to connect to terminology server. Error = Error fetching the server's capability statement: connect timed out

view this post on Zulip Sarah Gaunt (Jul 08 2021 at 21:56):

And I also get that same error someone mentioned above even when I use -tx n/a
Publishing Content Failed: Attempt to use Terminology server when no Terminology server is available

view this post on Zulip Rob Hausam (Jul 08 2021 at 21:58):

The monitor service is running. So that should take care of it - I'm checking if it actually is or not.

view this post on Zulip Sarah Gaunt (Jul 08 2021 at 21:58):

Thanks @Rob Hausam , re-running to see if it works.

view this post on Zulip Sarah Gaunt (Jul 08 2021 at 21:59):

No, still failing - will wait.

view this post on Zulip Rob Hausam (Jul 08 2021 at 22:00):

So actually the monitor wasn't started as a service the last time - so it wasn't running, but it is now. So this wasn't a perfect test, but it should be up shortly.

view this post on Zulip Rob Hausam (Jul 08 2021 at 22:10):

it's back up now

view this post on Zulip Lloyd McKenzie (Jul 08 2021 at 22:12):

Sounds like we need a monitor for the monitor service... ;)

view this post on Zulip Sarah Gaunt (Jul 08 2021 at 22:20):

Works now @Rob Hausam thanks!

view this post on Zulip Diana_Ovelgoenne (Jul 12 2021 at 10:10):

Server is down again @Rob Hausam

view this post on Zulip Mark Iantorno (Jul 12 2021 at 11:56):

Just restarted it

view this post on Zulip Martin Morrey (Jul 12 2021 at 12:13):

Or @Mark Iantorno ? Would be good to get this working again as soon as possible. Thanks!

view this post on Zulip Mark Iantorno (Jul 12 2021 at 12:38):

Yeah, I restarted it. It should be up now

view this post on Zulip Mark Iantorno (Jul 12 2021 at 12:39):

if you're ever wondering if it's up or down, you can quickly check by looking at the indicators at the top right of https://validator.fhir.org/

view this post on Zulip Mark Iantorno (Jul 12 2021 at 12:40):

there are two indicators, one for terminology and one for packages2

view this post on Zulip Martin Morrey (Jul 12 2021 at 12:48):

That's great. Thank-you :smile:

view this post on Zulip Rob Hausam (Jul 12 2021 at 13:56):

The server monitor and its auto-restart capability isn't (yet) working quite as expected in all situations. But with a few additional code tweaks I expect that it will be there soon.

view this post on Zulip Diana_Ovelgoenne (Jul 13 2021 at 06:34):

Server is down again @Mark Iantorno @Rob Hausam

view this post on Zulip Martin Morrey (Jul 13 2021 at 10:42):

Still down @Mark Iantorno . image.png

view this post on Zulip Janaka Peiris (Jul 13 2021 at 10:47):

is there a way to bypass tx server ? it seems to be down, time to time.

view this post on Zulip Michaela Ziegler (Jul 13 2021 at 10:57):

with the IG publisher: add -tx in your command line
https://confluence.hl7.org/pages/viewpage.action?pageId=35718627#IGPublisherDocumentation-Runningincommandlinemode

view this post on Zulip Mark Kramer (Jul 13 2021 at 11:31):

@Michaela Ziegler can you clarify what argument you might use for the -tx to avoid the tx server?

view this post on Zulip Diana_Ovelgoenne (Jul 13 2021 at 11:38):

-tx n/a but I found out last week that if your IG has Bindings, then it doesn't matter if you put the parameter, the Publisher will still try to connect to the terminology server

view this post on Zulip Mark Kramer (Jul 13 2021 at 11:43):

It would be nice if there was a mode where it would only go to the txcache.

view this post on Zulip Mark Iantorno (Jul 13 2021 at 12:04):

You're upset the tx server is down, I'm thrilled you're using the monitor on validator.fhir.org

view this post on Zulip Mark Iantorno (Jul 13 2021 at 12:05):

just restarted it

view this post on Zulip Diana_Ovelgoenne (Jul 13 2021 at 12:06):

@Mark Iantorno checking it on validator all day long :smile: still that doesn't help to bring it up :frown: we need someone on Europe to be able to restart it too.

view this post on Zulip Mark Iantorno (Jul 13 2021 at 12:07):

yeah, when Grahame is back we have good coverage. He's just away right now

view this post on Zulip Mark Iantorno (Jul 13 2021 at 12:07):

Thanks for your patience

view this post on Zulip Rob Hausam (Jul 13 2021 at 13:03):

Yes, thanks. There is still hope for the monitor, too. :) I'll get with @Gino Canessa again and see if we can get the logic updates worked out.

view this post on Zulip David Hay (Jul 18 2021 at 06:55):

Just out of interest, what prevents using any terminology server from being used in the IG Publisher? I did try using ontoserver - just for fun - and it seemed to work ok (though it was by no means an exhaustive test...

view this post on Zulip Lloyd McKenzie (Jul 18 2021 at 13:59):

Nothing whatsoever. HL7 publications need to use the HL7 server unless they get FMG permission otherwise, but others are free to use whatever server they wish.

view this post on Zulip John Moehrke (Jul 19 2021 at 20:14):

the validator.fhir.org says that the terminology server is green... but my IG build says different.

view this post on Zulip Rob Hausam (Jul 19 2021 at 20:42):

I'm checking.

view this post on Zulip Rob Hausam (Jul 19 2021 at 21:06):

back up

view this post on Zulip John Moehrke (Jul 19 2021 at 21:07):

thanks

view this post on Zulip Sarah Gaunt (Jul 20 2021 at 02:47):

Looks like it's down again...

view this post on Zulip Lloyd McKenzie (Jul 20 2021 at 02:48):

@Mark Iantorno @Rob Hausam

view this post on Zulip Rob Hausam (Jul 20 2021 at 02:49):

Yes, I saw that. The monitor is working on restarting it. :)

view this post on Zulip Sarah Gaunt (Jul 20 2021 at 02:50):

Is that what the pink means?

view this post on Zulip Rob Hausam (Jul 20 2021 at 02:53):

I'm not sure if that's supposed to be pink - or just a pale red. :) It will be back up soon.

view this post on Zulip Sarah Gaunt (Jul 20 2021 at 02:59):

Thanks @Rob Hausam

view this post on Zulip Rob Hausam (Jul 20 2021 at 03:07):

It's taking longer than usual, as it's restarted multiple times before completely finishing the previous restart process. But it shouldn't be too much longer now.

view this post on Zulip Rob Hausam (Jul 20 2021 at 03:28):

The service has come back up a few times now but then it almost immediately seems to hang again. I'm going to restart the entire maching and see if that gets it back to normal (it should).

view this post on Zulip Sarah Gaunt (Jul 20 2021 at 03:57):

No worries - took it as a sign to get off my a$$ and do a workout!

view this post on Zulip Rob Hausam (Jul 20 2021 at 03:58):

It looks like you may need to get a really good workout today! ;)

view this post on Zulip Sarah Gaunt (Jul 20 2021 at 03:59):

Maybe will walk the dogs now then as I see it's still not behaving!

view this post on Zulip Rob Hausam (Jul 20 2021 at 04:00):

Even after a full machine restart, it's still not responding correctly. :(

view this post on Zulip Rob Hausam (Jul 20 2021 at 04:12):

It seems to be running normally now - finally!

view this post on Zulip Rob Hausam (Jul 20 2021 at 04:14):

Oops, spoke too soon - trying again.

view this post on Zulip Giorgio Cangioli (Jul 20 2021 at 07:01):

It seems it is down...

view this post on Zulip Peter Jordan (Jul 20 2021 at 07:23):

Couldn't resist this... https://www.youtube.com/watch?v=LODkVkpaVQA

view this post on Zulip Rob Hausam (Jul 20 2021 at 07:49):

Yes, it's still down. And I'm still trying to work on it. It's not coming back, even with restarts, in the typical way that it has previously. :( The server VM isn't running out of disk space or RAM or other resources - so I'm at a bit of a loss at the moment as to why it's behaving this way.

view this post on Zulip Rob Hausam (Jul 20 2021 at 08:57):

For some reason now we're getting an exception when the FHIR server is trying to launch the 'Telnet Server', and the launch fails. I've pretty much exhausted everything that I am able to do for now (especially at this time during the night/morning). I'm reaching out to @Jose Costa Teixeira, @Grahame Grieve, @Mark Iantorno and @Gino Canessa and we'll see what we can do to get this functioning again - as soon as possible!

view this post on Zulip Mark Iantorno (Jul 20 2021 at 12:24):

I have restarted the service and tx.fhir.org is working as expected

view this post on Zulip Mark Iantorno (Jul 20 2021 at 12:25):

please let me know if it goes down again

view this post on Zulip David deRoode (Jul 20 2021 at 12:49):

tatement: connect timed out``` @**Mark Iantorno**

view this post on Zulip Igor Sirkovich (Jul 20 2021 at 12:55):

@Mark Iantorno , I keep getting "Unable to connect to terminology server. Error = Error fetching the server's capability statement: connect timed out (00:50.0163)"

view this post on Zulip Mark Iantorno (Jul 20 2021 at 12:55):

Yeah turns out that there is actually something more complicated going on

view this post on Zulip Igor Sirkovich (Jul 20 2021 at 12:55):

Also, hapy.fhir.org is down since last night - I'm not sure if this is related

view this post on Zulip Lloyd McKenzie (Jul 20 2021 at 15:14):

@Igor Sirkovich - should be no relationship, but @James Agnew, FYI

view this post on Zulip Rob Hausam (Jul 20 2021 at 15:51):

Several of us are working to see if we can get this fixed - but at the moment there is no immediate solution. One option for now may be to direct the IG Publisher to a different terminology server. @David Hay has recently tried it using Ontoserver, and apparently that seemed to work. I may give that a try myself and see how that works. Theoretically that should be fine, as long as all of the terminology content that you need is on the server - which may be problematic in some cases.

view this post on Zulip John Moehrke (Jul 20 2021 at 16:26):

do we want to try a distributed denial of service against that server? (Aka, we all try local and ci builds at the same time)?

view this post on Zulip Rob Hausam (Jul 20 2021 at 16:58):

Hopefully it won't lead to anything close to DDoS, but launching the IG Publisher this way using Ontoserver seems to work:

java -jar <path to ig publisher>/publisher.jar -ig . -tx https://r4.ontoserver.csiro.au/fhir

I was able to build the IPS IG with what appears to be the same QA output that I had before with tx.fhir.org. I didn't see a way to specify a different tx server using the _genonce.sh or _genonce.bat scripts as currently written (but if needed I'm sure it should be pretty easy to rewrite them to support that).

view this post on Zulip Max Masnick (Jul 20 2021 at 17:28):

For _genonce.sh, this command will work with a slight modification to the script (see below): ./_genonce.sh -tx https://r4.ontoserver.csiro.au/fhir

But you first need to modify _genonce.sh to comment out lines 5-13 as the script tries to access tx.fhir.org to check for internet access:

# curl -sSf tx.fhir.org > /dev/null

# if [ $? -eq 0 ]; then
#   echo "Online"
#   txoption=""
# else
#   echo "Offline"
#   txoption="-tx n/a"
# fi

view this post on Zulip Rob Hausam (Jul 20 2021 at 17:28):

Yes, that makes sense.

view this post on Zulip Max Masnick (Jul 20 2021 at 17:29):

I wonder if we should switch to using http://captive.apple.com to check for internet access (that's the URL that Apple devices use to see if they are on a network that can resolve public internet addresses)

view this post on Zulip Rob Hausam (Jul 20 2021 at 17:31):

Yes, I think that would also make sense - and completely separate the "internet check" from the tx server specification.

view this post on Zulip Max Masnick (Jul 20 2021 at 17:38):

This change is proposed in https://github.com/HL7/ig-publisher-scripts/pull/5

If anyone wants to switch to using captive.apple.com in their scripts, you can grab the fixed scripts from here until this is merged in.

view this post on Zulip Chris Moesel (Jul 20 2021 at 18:18):

Sorry, a little late to this conversation, but... I think the intent of the check is not just to see if it can access the internet (despite the message it prints to the console), but to check to see if it can access the terminology server. It's immediately followed by this:

if [ $? -eq 0 ]; then
    echo "Online"
    txoption=""
else
    echo "Offline"
    txoption="-tx n/a"
fi

The -tx n/a means "don't use a terminology server". So if we use captive.apple.com, then if the internet is available but the terminology server is down, it will try to use the terminology server during the build (since it doesn't go into the else clause). I'm not sure this is what we want.

view this post on Zulip Brian Kaney (Jul 20 2021 at 19:06):

One thing we may want to do if have an optional override of an ENV var in these scripts. It would be nice to be able define alternatives or internal mirrors for the terminology server.

view this post on Zulip Sarah Gaunt (Jul 20 2021 at 21:16):

Getting confused here! I need the CI build to work - is there still a problem there? Or is the above discussing how to fix it for the CI build?

view this post on Zulip Lloyd McKenzie (Jul 20 2021 at 21:19):

Still discussing work-arounds.

view this post on Zulip Rob Hausam (Jul 21 2021 at 00:38):

Here's an update on where we are with the tx.fhir.org terminology service. We've managed to get the server up and running again today - but not on a continuous and consistent basis. Looking at it with @Mark Iantorno and then further with @Jose Costa Teixeira and @Gino Canessa, we determined that one issue was that the database was able to grab too much of the available RAM, and we were then able to limit that which helped to solve a part of the problem. It didn't solve the underlying issue that the server process itself is using excessive and progressive amounts of RAM that eventually exceed the available limits, resulting in periodic and now quite frequent server crashes and need for restarts. Fortunately, Gino has been able to make some further updates to the monitor logic so that it can handle those situations more gracefully and can successfully restart the server (in all or at least nearly all cases). We don't yet know why the service memory use seems to have increased so significantly quite recently, so that still needs more investigation and ultimate mitigation.

But with all of that, we've now reached a semi-stable situation where we seem maybe to have a temporary path forward. What we're seeing now is that when the service is restarted it is able to run for a variable amount of time, which is seeming mostly to be from about 10 to 30 minutes (again not sure why there is that variability, but there is). Once the server hits the memory limit and crashes, that will be detected and the service will be restarted, and the restart process seems to consistently take about 7 minutes. So the overall server uptime is fluctuating between about 59% to 81%. That's not good at all, but it might provide enough availability for us to work with for the most part until we can come up with a better long term solution. If that kind of availability isn't sufficient in some cases, then for now probably the best alternative will be to explore using an alternative terminology server (like Ontoserver) for the IG builds (at least for the local ones) - I'm assuming that the CI build probably may need to stay with tx.fhir.org, as long as it remains usable at all (@Josh Mandel?).

Let me or any of us know if you have more questions, and particularly good suggestions. Thanks for everyone's patience!

view this post on Zulip Sarah Gaunt (Jul 21 2021 at 00:44):

Thanks for the update and working so hard on this @Rob Hausam !!

view this post on Zulip Michael Lawley (Jul 21 2021 at 00:49):

For the record, we are very happy for people to use https://r4.ontoserver.csiro.au/fhir as an alternative to tx.fhir.org
It's an open endpoint, so if it doesn't contain the resources you need, then you're free to upload them

view this post on Zulip Josh Mandel (Jul 21 2021 at 00:49):

Thanks @Rob Hausam (we can certainly point the CI bulld to another tx server if it's helpful.)

view this post on Zulip Rob Hausam (Jul 21 2021 at 00:50):

Thank you @Michael Lawley! Maybe we should consider doing that, @Josh Mandel?

view this post on Zulip Josh Mandel (Jul 21 2021 at 00:54):

Sure. Let me know what tweaks you want to how we invoke the publisher in the IG build pipeline (-tx argument as above?)

view this post on Zulip Rob Hausam (Jul 21 2021 at 00:58):

Yes, -tx https://r4.ontoserver.csiro.au/fhir should do it. I'm not sure what content tweaks we may need, but I believe the FHIR 4.0.1 content should all already be there? Not sure about the CI content, though, that seems unlikely to be there now, and I'm not sure how to update and maintain what's needed for that (other than manually when people need it, as @Michael Lawley said)?

view this post on Zulip Rob Hausam (Jul 21 2021 at 00:58):

I think we might as well give it a try.

view this post on Zulip Michael Lawley (Jul 21 2021 at 01:08):

Let me know when you do - I'll keep a close eye on our dashboard to see the impact

view this post on Zulip Peter Jordan (Jul 21 2021 at 01:12):

Something else that might be considered is to simplify the $validate-code operation requests that are being made to Terminology Servers, such as tx.fhir.org. Aside from some specific requirements from the Build process, I'm not sure that it's necessary to use POSTS with relatively large payloads, including custom parameters that aren't required for simple ValueSet based validation where only a simple GET would suffice. At least, perhaps have a parameter that indicates simple validation only?

view this post on Zulip Josh Mandel (Jul 21 2021 at 01:15):

OK, we'll see how this goes. (Added a variable to control the TX server in IG build, and set this to https://r4.ontoserver.csiro.au/fhir in the current trigger.)

view this post on Zulip Rob Hausam (Jul 21 2021 at 01:21):

Thanks, @Josh Mandel. I am curious what the volume and impact of this will be, @Michael Lawley. Everyone, please let us know how this change is working for you - particularly if you see issues (and hopefully there won't be many, if any, of those - or they will be easy to solve).

view this post on Zulip Sarah Gaunt (Jul 21 2021 at 01:26):

Just redelivered my last payload from yesterday, will keep you posted...

view this post on Zulip Sarah Gaunt (Jul 21 2021 at 01:29):

It worked, it worked! Haven't checked QA yet, but happy it ran without crashing!

view this post on Zulip Michael Lawley (Jul 21 2021 at 01:29):

How was speed from your perspective?

view this post on Zulip Josh Mandel (Jul 21 2021 at 01:31):

Looks like my firs test in the auto-build pipeline built successfully against https://r4.ontoserver.csiro.au/fhir

view this post on Zulip Sarah Gaunt (Jul 21 2021 at 01:32):

Well, Im not actually sure it was hitting your server, @Michael Lawley - I think this was something I sent right after @Rob Hausam said the tx.fhir.org was back up and before moving to the au one.

view this post on Zulip Sarah Gaunt (Jul 21 2021 at 01:33):

It took 42 mins.... I just assumed I hadn't hit the rebuild button properly or something, as it took so long.

view this post on Zulip Sarah Gaunt (Jul 21 2021 at 01:33):

I should have another one soon. Unless maybe it's my IG that is killing everything! :fear:

view this post on Zulip Rob Hausam (Jul 21 2021 at 01:36):

Yes, it looks like your current one for case-reporting says Connect to Terminology Server at http://tx.fhir.org, @Sarah Gaunt.

view this post on Zulip Sarah Gaunt (Jul 21 2021 at 01:42):

Yes. Not sure why the IG is taking so long all of a sudden. I sent another payload after you switched over and it's not finished yet. Will have to do some detective work.

view this post on Zulip Rob Hausam (Jul 21 2021 at 01:43):

A lot of them really do take a long time - IPS has been, too.

view this post on Zulip Sarah Gaunt (Jul 21 2021 at 01:45):

Yeah, but yesterday it was taking less than 10 mins... Not 40+... I must have changed something that has made that happen.

view this post on Zulip Sarah Gaunt (Jul 21 2021 at 01:52):

IPS is taking 7 mins - that's not slow. :)

view this post on Zulip Lloyd McKenzie (Jul 21 2021 at 01:54):

I expect that the calls to the remote tx server take a reasonable amount longer than the ones to the local server.

view this post on Zulip Lloyd McKenzie (Jul 21 2021 at 01:54):

Can the core build also be adjusted to use ontoserver for now? There's no hope of a core build completing in < 45 minutes...

view this post on Zulip Rob Hausam (Jul 21 2021 at 01:55):

@Sarah Gaunt I agree - that's pretty fast. Maybe it's Michael's server that's speeding it up! I have a lot of errors now in this older version of IPS - but it looks like they are most likely from the latest IG Publisher updates, rather than the termininology server.

view this post on Zulip Michael Lawley (Jul 21 2021 at 01:55):

BTW, are local tx caches being cleared beforehand?

view this post on Zulip Rob Hausam (Jul 21 2021 at 01:55):

@Lloyd McKenzie I was thinking the same. But maybe @Josh Mandel already did that?

view this post on Zulip Michael Lawley (Jul 21 2021 at 01:56):

Main errors that I suspect people might see are missing content, and that we default SNOMED to the AU Edition

view this post on Zulip Josh Mandel (Jul 21 2021 at 01:57):

Can the core build also be adjusted to use ontoserver for now?

That's a question for @Mark Iantorno

view this post on Zulip Josh Mandel (Jul 21 2021 at 01:57):

Michael Lawley: BTW, are local tx caches being cleared beforehand?

In the auto-ig-builder pipeline, there is no local state saved across builds.

view this post on Zulip Josh Mandel (Jul 21 2021 at 01:58):

(Some IGs do check a "cache" folder into the repo, which always seems odd to me -- so there's that.)

view this post on Zulip Rob Hausam (Jul 21 2021 at 01:58):

@Michael Lawley That is something that I think people will want to/need to do. But I didn't do that this time for the IPS build. The SNOMED CT AU edition might be an issue for US IGs, but they should be declaring the US Edition, which I think you do support? And for universal ones like IPS the AU edition also includes the International Edition, so I don't think it should be a problem?

view this post on Zulip Michael Lawley (Jul 21 2021 at 01:58):

I'm not seeing any significant load yet

view this post on Zulip Rob Hausam (Jul 21 2021 at 01:59):

I'm stepping away (literally) for a few minutes, but it looks like things are pretty good so far.

view this post on Zulip Josh Mandel (Jul 21 2021 at 01:59):

(We have between 0 and 1 IG building in the pipeline over the past 30min, so _shouldn't_ see too much load from that source.)

view this post on Zulip Sarah Gaunt (Jul 21 2021 at 02:00):

Yes, I declare the US version of SNOMED so it shouldn't be an issue for me anyway.

view this post on Zulip Michael Lawley (Jul 21 2021 at 02:00):

We have the International Edition, but not a US one

view this post on Zulip Sarah Gaunt (Jul 21 2021 at 02:01):

Ah... Will see what happens then!

view this post on Zulip Sarah Gaunt (Jul 21 2021 at 02:10):

Fixed my speed issue - it didn't like some MD links I'd tried to put in a field description. Not sure why that added an extra 35 mins on, but I'll take it. And running on your server @Michael Lawley it took just over 8 mins.

view this post on Zulip Sarah Gaunt (Jul 21 2021 at 02:10):

I am getting errors like: The code 26643006 exists in the CodeSystem, but the display "Oral Route" is incorrect (from https://r4.ontoserver.csiro.au/fhir) for 'http://snomed.info/sct#26643006' now that weren't there before, but I'm going to ignore them for now!

view this post on Zulip Sarah Gaunt (Jul 21 2021 at 02:13):

Weird that it doesn't like that though - both the international and US version have the same term for that conceptId: https://browser.ihtsdotools.org/?perspective=full&conceptId1=404684003&edition=MAIN/2021-01-31&release=&languages=en

view this post on Zulip Sarah Gaunt (Jul 21 2021 at 02:13):

Could be the capital on Route maybe?

view this post on Zulip Sarah Gaunt (Jul 21 2021 at 02:16):

I actually think it's catching valid terminology issues that weren't being caught before.

view this post on Zulip Sarah Gaunt (Jul 21 2021 at 02:19):

Maybe it's case sensitive where tx.fhir.org was not? So far all the ones I've checked have been case issues. (And by "all" I mean "two"! :-) )

view this post on Zulip Rob Hausam (Jul 21 2021 at 02:37):

Interesting - and maybe so. Your IG does specify the US Edition, @Sarah Gaunt?

view this post on Zulip Peter Jordan (Jul 21 2021 at 02:44):

@Michael Lawley does that International Edition use the US or UK English Language Reference Set?

view this post on Zulip Michael Lawley (Jul 21 2021 at 02:50):

Yes, @Sarah Gaunt it will be the case. We're quite conservative here because changing case for certain things is bad.

view this post on Zulip Michael Lawley (Jul 21 2021 at 02:53):

US English should be our default for SCT Int

view this post on Zulip Mark Iantorno (Jul 21 2021 at 03:04):

In terms of getting the core tooling to use the other server. This is possible, but does the ontoserver actually handle all the functionality necessary?

view this post on Zulip Mark Iantorno (Jul 21 2021 at 03:06):

Can someone who runs the onto server have a quick call with me to outline the server endpoints and functionality?

view this post on Zulip Jim Steel (Jul 21 2021 at 03:15):

Sure

view this post on Zulip Sarah Gaunt (Jul 21 2021 at 03:17):

@Rob Hausam Yes, it specifies the US Edition of SNOMED.

view this post on Zulip Sarah Gaunt (Jul 21 2021 at 03:22):

Definitely getting more terminology errors than before. Haven't checked them all out yet. I think stuff like the following might be because of the US vs AUS thing: The code '55751-2' from code system 'http://loinc.org' was not found in value set http://hl7.org/fhir/ValueSet/doc-typecodes (from https://r4.ontoserver.csiro.au/fhir) That's a US public health code: https://loinc.org/55751-2/ and I _think_ we are using it correctly...

Or maybe it's not checking all the codes (there are only the first 1000 listed in the value set) http://hl7.org/fhir/R4/valueset-doc-typecodes.html?

view this post on Zulip Jim Steel (Jul 21 2021 at 03:28):

The problem with that ValueSet (from Ontoserver's perspective) is that:

  • its definition is basically "all codes where SCALE_TYP = Doc"
  • SCALE_TYP is a code-typed property (according to https://www.hl7.org/fhir/loinc.html), but
  • Doc is the label, not the code
  • its possible it should be "all codes where SCALE_TYP = LP32888-7"

view this post on Zulip Jim Steel (Jul 21 2021 at 03:30):

@Rob Hausam Is that different from how HAPI/tx.fhir.org interpret LOINC properties?

view this post on Zulip Michael Lawley (Jul 21 2021 at 03:34):

I have just updated R4's copy of http://hl7.org/fhir/ValueSet/doc-typecodes to also include a filter as @Jim Steel suggests and the expansion now includes 55751-2

view this post on Zulip Sarah Gaunt (Jul 21 2021 at 03:34):

Sweet - thanks!

view this post on Zulip Sarah Gaunt (Jul 21 2021 at 03:35):

Getting this on all the jurisdiction elements: Code System URI 'urn:iso:std:iso:3166' is unknown so the code cannot be validated

view this post on Zulip Sarah Gaunt (Jul 21 2021 at 03:36):

Also: Code System URI 'http://unitsofmeasure.org' is unknown so the code cannot be validated

view this post on Zulip Rob Hausam (Jul 21 2021 at 03:36):

Has that completely answered the question? Technically, Jim is correct that the value should actually be the LP32888-7 code, rather than the label 'Doc'. The official LOINC server, also based on HAPI, I'm pretty sure now handles it all that way, using the LP codes (I worked on that with them for a while, but haven't checked it lately).

view this post on Zulip Rob Hausam (Jul 21 2021 at 03:37):

I wondered if UCUM was going to be covered, in the same way that it is in tx.fhir.org?

view this post on Zulip Michael Lawley (Jul 21 2021 at 03:37):

I was just about to suggest that a tracker is needed to correct the ValueSet definition.

view this post on Zulip Sarah Gaunt (Jul 21 2021 at 03:37):

That iso:3166 doesn't matter - that's not an error - just an informational message.

view this post on Zulip Jim Steel (Jul 21 2021 at 03:54):

I put up a copy of iso 3166 and bcp47 (languages)

view this post on Zulip Sarah Gaunt (Jul 21 2021 at 04:40):

Not sure why this is failing The value provided ('application/xml') is not in the value set http://hl7.org/fhir/ValueSet/mimetypes|4.0.1 (http://hl7.org/fhir/ValueSet/mimetypes), and a code is required from this value set) (error message = Error from server: Error:org.hl7.fhir.r5.model.CodeableConcept@408a7a97 )

I originally just had "xml" which failed, so changed it to "application/xml" which is still failing.

view this post on Zulip Michael Lawley (Jul 21 2021 at 05:24):

That ValueSet depends on Code system urn:ietf:bcp:13 which we don't have

view this post on Zulip Sarah Gaunt (Jul 21 2021 at 05:27):

Any idea on this Observation.value.ofType(CodeableConcept) (l285/c27) error Error from server: Error:org.hl7.fhir.r5.model.CodeableConcept@19fe697c - getting quite a few like that which weren't failing before.

view this post on Zulip Sarah Gaunt (Jul 21 2021 at 05:30):

Look like they are mostly SNOMED codes that are giving that error. Maybe something to do with the fact that I'm using the US version...

view this post on Zulip Sarah Gaunt (Jul 21 2021 at 05:31):

e.g. Condition.code (l18/c11) error Error from server: Error:org.hl7.fhir.r5.model.CodeableConcept@78f21c53
from

<code>
        <coding>
            <system value="http://snomed.info/sct"/>
            <code value="82272006"/>
            <display value="Common cold (disorder)"/>
        </coding>
    </code>

view this post on Zulip Max Masnick (Jul 21 2021 at 09:49):

Josh Mandel said:

(Some IGs do check a "cache" folder into the repo, which always seems odd to me -- so there's that.)

One reason to do this is that the output of the IG publisher changes depending on whether the cache/ folder is "hot" or not: https://github.com/HL7/fhir-ig-publisher/issues/231

view this post on Zulip Max Masnick (Jul 21 2021 at 10:16):

Chris Moesel said:

...
The -tx n/a means "don't use a terminology server". So if we use captive.apple.com, then if the internet is available but the terminology server is down, it will try to use the terminology server during the build (since it doesn't go into the else clause). I'm not sure this is what we want.

Thanks for catching this, Chris. I closed out that pull request and opened a new issue describing what I think the logic should be. Does this look right to everyone?

view this post on Zulip Michael Lawley (Jul 21 2021 at 10:24):

SNOMED US (20210301) is now in Ontoserver.

view this post on Zulip Michael Lawley (Jul 21 2021 at 10:26):

@Sarah Gaunt I don't know where that ugly error message is coming from though - the Ontoserver code doesn't reference anything in the org.hl7.fhir.r5.model package. I can only imagine its coming out of the validator itself. I'll try to dig into our logs to diagnose

view this post on Zulip Rob Hausam (Jul 21 2021 at 13:09):

@Michael Lawley Are you seeing any noticeable load yet (from the CI builds or otherwise)?

view this post on Zulip Rob Hausam (Jul 21 2021 at 13:10):

I have tx.fhir.org up again now and it's getting what appears to be a normal stream of activity.

view this post on Zulip David Pyke (Jul 21 2021 at 13:14):

So, assuming it stays up, can we somehow set ontoserver as the failover for when something like this happens again (or just any time tx goes down)?

view this post on Zulip Rob Hausam (Jul 21 2021 at 13:15):

That seems like it would be a reasonable short to medium term goal - not sure how much effort it would take to make that happen.

view this post on Zulip Michael Lawley (Jul 21 2021 at 13:23):

I'm not seeing any real load at all

view this post on Zulip Michael Lawley (Jul 21 2021 at 13:33):

By that I mean nothing that is affecting response time. We did see a distinct change in # requests 2.5 hrs ago:
image.png

view this post on Zulip AbdulMalik Shakir (Jul 21 2021 at 13:37):

Sarah Gaunt said:

Definitely getting more terminology errors than before. Haven't checked them all out yet. I think stuff like the following might be because of the US vs AUS thing: The code '55751-2' from code system 'http://loinc.org' was not found in value set http://hl7.org/fhir/ValueSet/doc-typecodes (from https://r4.ontoserver.csiro.au/fhir) That's a US public health code: https://loinc.org/55751-2/ and I _think_ we are using it correctly...

Or maybe it's not checking all the codes (there are only the first 1000 listed in the value set) http://hl7.org/fhir/R4/valueset-doc-typecodes.html?

Sarah Gaunt said:

Definitely getting more terminology errors than before. Haven't checked them all out yet. I think stuff like the following might be because of the US vs AUS thing: The code '55751-2' from code system 'http://loinc.org' was not found in value set http://hl7.org/fhir/ValueSet/doc-typecodes (from https://r4.ontoserver.csiro.au/fhir) That's a US public health code: https://loinc.org/55751-2/ and I _think_ we are using it correctly...

Or maybe it's not checking all the codes (there are only the first 1000 listed in the value set) http://hl7.org/fhir/R4/valueset-doc-typecodes.html?

@Sarah Gaunt I'm having a similar issue "The code '64297-5' from code system 'http://loinc.org' was not found in value set http://hl7.org/fhir/ValueSet/doc-typecodes". This is a new error not previously encountered. '64297' is a valid LOINC code for Death certificate. Is the doc-typecodes value set incomplete?

view this post on Zulip Mark Iantorno (Jul 21 2021 at 13:40):

Anyone using the validator or publisher can specify the terminology server they want to use on the command line when they use the tool. I am hesitant to actually go into the project and switch the dependency to, or add a dependency on a closed source, private server. Having it as a choice for users is one thing, but using it as a default backup or primary for all users is another thing. I have a meeting with @Rob Hausam today to discuss some options and details for the terminology server, so I can get a better idea of how it's set up, and I'm hoping to have a meeting with @Jim Steel later this week to talk more about the onto server.

view this post on Zulip Grahame Grieve (Jul 21 2021 at 22:03):

Ontoserver is not a viable replacement for tx.fhir.org at this time. It's on my list to find time to document the issues, but FMG can't approve IGs based on it at this time

view this post on Zulip Grahame Grieve (Jul 21 2021 at 22:04):

apologies for the fact that tx.fhir.org suddenly fell over; I've been reading the transcript and I still can't figure out what actually happened. And sad that it happened while I was off line for a week

view this post on Zulip Lloyd McKenzie (Jul 21 2021 at 22:05):

@Grahame Grieve - FMG can't approve IGs to go to ballot, to go to publication or either?

view this post on Zulip Lloyd McKenzie (Jul 21 2021 at 22:05):

We're hitting a point where approval to ballot is going to be pretty much non-optional...

view this post on Zulip Lloyd McKenzie (Jul 21 2021 at 22:06):

We could possibly hold off on non-ballot publications for a while unless they had some pressing urgency

view this post on Zulip Rob Hausam (Jul 21 2021 at 22:06):

It's not at all surprising that Ontoserver isn't a full replacement (certainly at present). But it was of help temporarily.

view this post on Zulip Max Masnick (Jul 22 2021 at 12:50):

Is it safe to switch build.fhir.org from Ontoserver back to tx.fhir.org? I'm seeing some QA errors which are resolved locally by switching back to tx.fhir.org.

view this post on Zulip Josh Mandel (Jul 22 2021 at 12:59):

Happy to try it. I could also add a flag on the webhook to make this configurable, but I'm not sure we want to expose that much flexibility.

view this post on Zulip Max Masnick (Jul 22 2021 at 13:13):

Rob Hausam said:

I have tx.fhir.org up again now and it's getting what appears to be a normal stream of activity.

I think ↑ means that tx.fhir.org is up again?

It would be great if we had something like https://www.basecampstatus.com/index.html so we could see if there were issues over time in addition to the traffic lights on https://validator.fhir.org

view this post on Zulip Max Masnick (Jul 22 2021 at 13:13):

(That status page is from https://www.atlassian.com/software/statuspage, which I think we can use for free)

view this post on Zulip Max Masnick (Jul 22 2021 at 13:15):

In any case it looks like it's up now, but I don't know if it's stable

view this post on Zulip Rob Hausam (Jul 22 2021 at 13:36):

I think we need to try using tx.fhir.org again. But since the code hasn't changed (yet), it's certainly possible (likely?) that we will see the same thing that we did before - at least at some point. @Ted Klein is running another UTG build right now on Ontoserver, to check a solutiion for an error that he has been getting since we made the switch. It would be good for that to complete first.

view this post on Zulip Rob Hausam (Jul 22 2021 at 13:37):

@Josh Mandel

view this post on Zulip Josh Mandel (Jul 22 2021 at 13:55):

The way the switchover is implemented, it does not affect any builds currently in progress. It will just affect the next build submitted to the auto build pipeline. The configuration for a given build does not change mid-flight.

view this post on Zulip Josh Mandel (Jul 22 2021 at 13:56):

Anyway, let me know when you want me to switch back over to tx.fhir.org and I will do so.

view this post on Zulip Rob Hausam (Jul 22 2021 at 14:00):

I was pretty sure you had it configured that way. The UTG build still failed. We can switch it now, and I will keep a close eye on it. But if it does get out of hand again we may need to switch it back. I could also check first with the HQ folks and see if they are ready to be able to give us more resources on the server. But I'm thinking that maybe we should switch back now to the way that it was before first, to verify that we still are running into the same problems - before we invest in the additional resources. So I would say let's go ahead and do the switch now.

view this post on Zulip Rob Hausam (Jul 22 2021 at 14:03):

At this point the server has been running for 25 hours straight without a further problem - we'll see what happens.

view this post on Zulip Josh Mandel (Jul 22 2021 at 14:05):

OK, pushed the config update. (Re: server resources, I think adjustments can be made in Google Cloud Console directly if we wanted to allocate more RAM to the tx server.)

view this post on Zulip Rob Hausam (Jul 22 2021 at 14:07):

Thanks. Yes, I agree we could make the adjustments there. I don't believe I have any access for that - but presumably you do?

view this post on Zulip Josh Mandel (Jul 22 2021 at 15:33):

I do. We can discuss in the tx/internal chat.

view this post on Zulip John Moehrke (Jul 22 2021 at 18:31):

seems tx.hl7.org has been far more stable.. i think that it coorelates to @Rob Hausam is building IPS elsewhere... :-)

view this post on Zulip John Moehrke (Jul 22 2021 at 18:31):

:-)

view this post on Zulip Jean Duteau (Jul 28 2021 at 17:07):

looks down again... all of the CI builds are failing

view this post on Zulip David Pyke (Jul 28 2021 at 17:20):

Paging @Rob Hausam. Please pick up the white courtesy phone

view this post on Zulip Jose Costa Teixeira (Jul 28 2021 at 17:24):

@Gino Canessa the machine doesn't go bing!

view this post on Zulip David Pyke (Jul 28 2021 at 17:34):

giphy.webp

view this post on Zulip Gino Canessa (Jul 28 2021 at 17:36):

Yes, Mark is looking at it.

view this post on Zulip Gino Canessa (Jul 28 2021 at 17:42):

Ok, it's starting up. Should be online in a few minutes (10-ish).

view this post on Zulip Rob Hausam (Jul 28 2021 at 17:47):

Yeah. I picked up the phone a little late - but others were able to. :)

view this post on Zulip Barbro Vessman (Aug 23 2021 at 14:20):

Hello, term server seems to be down

view this post on Zulip Rob Hausam (Aug 23 2021 at 14:22):

Yes, it should be restarting in a moment.

view this post on Zulip Rob Hausam (Aug 23 2021 at 14:33):

It's back up now.

view this post on Zulip Barbro Vessman (Aug 23 2021 at 14:39):

Thank you @Rob Hausam !

view this post on Zulip Ramandeep Dhanoa (Aug 26 2021 at 20:03):

Hello, is it possible that the term server is down? I am getting this error "Attempt to use Terminology server when no Terminology server is available"

view this post on Zulip Lloyd McKenzie (Aug 26 2021 at 20:07):

@Rob Hausam @Grahame Grieve

view this post on Zulip Rob Hausam (Aug 26 2021 at 20:11):

@Ramandeep Dhanoa It is up now. And according to the monitor is has been steadily up for approx. the past 2 hours. Have you tried building again?

view this post on Zulip Ramandeep Dhanoa (Aug 26 2021 at 21:50):

I see, I will try to debug if something is messed up locally. Thanks @Rob Hausam

view this post on Zulip David Simons (Sep 07 2021 at 07:13):

image.png

Getting Error sending HTTP Post/Put Payload: tx.fhir.org:80 failed to respond from the hl7validator...

What does the 'pink' status mean, btw? (compared to red)

HTTP Caching of POST/PUT calls is not trivial, right?

view this post on Zulip Rob Hausam (Sep 07 2021 at 12:37):

@David Simons I'm not certain what that pink color means, either. But according to the monitor the service is ok, and it seems to be responding normally to me locally and it looks ok on the console output. Are you seeing issues with it on your end?

view this post on Zulip David Simons (Sep 07 2021 at 12:42):

Rob Hausam said:

Are you seeing issues with it on your end?

Thank you @Rob Hausam - we kept getting the above Errors for the last day - responding intermittently. currently tx.fhir.org seems to be responding again to our hl7validator calls

We really appreciate being able to use tx.fhir.org - yet the significant downtime is also forcing us to look into alternatives - which is not trivial though.

I'd rather see this addressed _behind_ the tx.fhir.org endpoint - with scalability and availability and caching measures.

view this post on Zulip Rob Hausam (Sep 07 2021 at 13:41):

@David Simons We completely agree with you on addressing the issue(s) on the server itself, behind the endpoint. @Grahame Grieve has been working on that, and I think that's made some improvements (regarding memory utilization, etc.). But you're certainly right that the level of reliability isn't yet where we all want it to be. And I'm pretty sure that Grahame has more that he is planning to do that should further improve it.

view this post on Zulip David Simons (Sep 07 2021 at 14:14):

Rob Hausam said:

I'm not certain what that pink color means, either.

Thanks - not a major issue.
Maybe use icons like Consumer Reports does - incl. arrows next to colors to distinguish

image.png

view this post on Zulip Grahame Grieve (Sep 07 2021 at 19:01):

it's up and been up continually, so I'm not sure what the issue is. If you want to watch it, #tx.fhir.org/notification

view this post on Zulip David Pyke (Sep 07 2021 at 19:20):

That channel is private...

view this post on Zulip Grahame Grieve (Sep 07 2021 at 19:22):

oh. I'll change that

view this post on Zulip David Simons (Sep 08 2021 at 11:26):

Thank you @Grahame Grieve - certainly want to help to find the root cause - and we will double-check if it is anything between our hl7validator and the tx.fhir.org endpoint.
It seems to be intermittent - and more time-out related than unreacheable.
Will gather some more data for us to share.

That said, https://validator.fhir.org/ is showing pink/red as I type this:
image.png
and getting this error right now:
image.png

We _really_ appreciate all the work you guys do - and our feedback is intended to be constructive!

===
from other thread - for completeness
image.png

view this post on Zulip Grahame Grieve (Sep 08 2021 at 11:31):

that's running the test cases?

view this post on Zulip David Simons (Sep 08 2021 at 14:56):

Grahame Grieve said:

that's running the test cases?

Hi @Grahame Grieve - no actually - that's running the hl7validator against our company's Innersource (internal) set of FHIR profiles - to validate them against the FHIR standard and any referenced (internal+external) profiles and terminologies.
And yes, this particular snippet is an example.xml Resource instance to test a structuredefinition.xml profile class.

view this post on Zulip David Simons (Sep 08 2021 at 15:03):

We're testing to see if it is related to network connectivity from our build pipeline on GitHub to tx.fhir.org - by running also from other local machines - and see how that compares...
I can access http://tx.fhir.org/ from my local browser, so that's good.

view this post on Zulip David Simons (Sep 08 2021 at 15:32):

This is what I mean with intermittently - only with some example Resources, the below are all very similar actually, so not all calls to tx fail, and not always the same ones, and seemingly only with stu3, not r4 - continuing to debug here...

2021-09-08T15:06:34.9123045Z Load FHIR v3.0 from hl7.fhir.r3.core#3.0.2 - 4017 resources (00:04.0228)
2021-09-08T15:06:36.0568771Z Load hl7.terminology#2.1.0 - 3767 resources (00:01.0144)
2021-09-08T15:06:38.0131260Z Terminology server http://tx.fhir.org - Version 1.9.382 (00:01.0956)
2021-09-08T15:06:38.5476364Z Load /usr/data/com.philips.fhir.stu3.common - 296 resources (00:00.0534)
2021-09-08T15:06:42.1180117Z Get set... go (00:03.0570)
2021-09-08T15:06:42.1191164Z Validating
2021-09-08T15:06:46.1211188Z Validate ACTBld.example.xml 00:03.0999
2021-09-08T15:06:48.0684432Z Validate ACTBldWithMultipleCodes.example.xml 00:01.0945
2021-09-08T15:06:51.2082604Z Validate APTTPPP.example.xml 00:03.0139
2021-09-08T15:06:56.3539933Z Validate BMIVitalSign.example.xml 00:05.0145
2021-09-08T15:06:57.7532603Z Validate BMIVitalSignWithMultipleCodes.example.xml 00:01.0394
2021-09-08T15:07:02.8003955Z Validate BNPSerPlsCnc.example.xml 00:05.0050
2021-09-08T15:07:04.0186035Z Validate BUNSerPlsCnc.example.xmltx.fhir.org:80 failed to respond (231ms / 2Kb for POST org.hl7.fhir.dstu3.model.ValueSet/$validate-code)
2021-09-08T15:07:05.2268610Z 00:02.0426
2021-09-08T15:07:14.2196766Z Validate BloodPressureVitalSign.example.xml 00:08.0992
2021-09-08T15:07:18.5999249Z Validate BodyTemperatureVitalSign.example.xml 00:04.0379
2021-09-08T15:07:22.3745163Z Validate CKMBSerPlmCnc.example.xml 00:03.0775
2021-09-08T15:07:26.4069934Z Validate CRPSerPlsCnc.example.xml 00:04.0032
2021-09-08T15:07:28.1204818Z Validate CalciumSerPlsCnc.example.xmltx.fhir.org:80 failed to respond (232ms / 563b for POST org.hl7.fhir.dstu3.model.CodeSystem/$validate-code)
2021-09-08T15:07:29.1415234Z 00:02.0734

view this post on Zulip David Simons (Sep 08 2021 at 16:26):

Going back in our build-logs - the first occurrence I find is 23-AUG-2021, but there has been no change in tx.fhir.org version 1.9.378 and hl7validator 5.4.6 around that time... The errors have changed from Error sending HTTP Post/Put Payload: Read timed out for 'http://loinc.org#8287-5' then, to Error sending HTTP Post/Put Payload: tx.fhir.org:80 failed to respond lately, with tx.fhir.org version 1.9.382 and hl7validator 5.4.6 today

view this post on Zulip David Simons (Sep 08 2021 at 16:44):

Last observation for now is that I also am now seeing errors for r4-based runs, surprisingly that I had not seen...

*FAILURE*: 2 errors, 0 warnings, 0 notes
  Error @ Observation.code.coding[0] (line 20, col13) : Error performing operation 'validate-code: unexpected end of stream on http://tx.fhir.org/...' (parameters = "") for 'http://loinc.org#14682-9'
  Error @ Observation.value.ofType(Quantity) (line 30, col18) : Error performing operation 'validate-code: unexpected end of stream on http://tx.fhir.org/...' (parameters = "") for 'http://unitsofmeasure.org#umol/L'

view this post on Zulip Grahame Grieve (Sep 08 2021 at 19:16):

failed to respond in 231ms? That seems... ambitious.

view this post on Zulip Grahame Grieve (Sep 08 2021 at 19:17):

how long do you give it?

view this post on Zulip David Simons (Sep 08 2021 at 19:35):

The key question is indeed what entity is giving up so quickly - we only run the hl7validator that is making these calls... :) (which has a 15s-30s timeout, right?)
Does it inherit some settings or get overruled from our GitHub Actions/Runners environment perhaps? Will look into that tomorrow - maybe that environment was changed by an admin, without us knowing... Indeed looks like something keeps cutting the connection off - just to soon at times.

view this post on Zulip Grahame Grieve (Sep 08 2021 at 20:34):

No. looking at the server... the server is having problems with phantom connections, and it's refusing to make more than 50 connections at once. That maybe what is the problem here. I'm not sure how to deal with systems making connections that are never going to do anything

view this post on Zulip David Simons (Sep 09 2021 at 08:00):

OK - we are already reducing the concurrent builds here - throttling back as much as possible - to avoid overload.
Maybe it is caused by java HTTPClient bugs in the hl7validator leaving stale/phanton connections - but that's guessing.
Will also reach out to our github selfhosted-runners admin to see if they put in a network constraint, recently that is hitting us.

view this post on Zulip Grahame Grieve (Sep 09 2021 at 09:20):

no I think it's other clients. It'll go for a day and suddenly it's getting hammered for a few minutes. It might survive or it might not.

view this post on Zulip Grahame Grieve (Sep 09 2021 at 09:20):

I'm still investigating

view this post on Zulip Grahame Grieve (Sep 09 2021 at 09:21):

but you know, investigating occasional failed connections is not the same as investigating 'down'

view this post on Zulip David Simons (Sep 09 2021 at 12:46):

do you need my IP addresses to confirm it is other users? or did you mean other client applications than hl7validator

view this post on Zulip Rob Hausam (Sep 09 2021 at 12:51):

I've been seeing very similar (seemingly rather random) patterns of query timeouts and intermittent failures. I think it's unlikely unique to @David Simons or your environment. @Grahame Grieve

view this post on Zulip David Simons (Sep 09 2021 at 14:04):

Interestingly enough - when it happens - randomly/intermittently - in 99% of the cases for us the timeouts occur during validation runs with -version 3.0. Hardly happens with -version 4.0, fwiw...
image.png

view this post on Zulip Grahame Grieve (Sep 09 2021 at 19:16):

code is exactly the same for both end-points

view this post on Zulip David Simons (Sep 09 2021 at 19:43):

Grahame Grieve said:

code is exactly the same for both end-points

on the tx.fhir.org server you mean - but I was referring to the hl7validator client - where afaik the TXClient code is quite different between 1/dstu2/stu3 and r4/r5, unfortunately. Thanks for your help!

view this post on Zulip Grahame Grieve (Sep 09 2021 at 20:13):

well, it is, but that's about to change. see https://github.com/hapifhir/org.hl7.fhir.core/pull/599

view this post on Zulip David Simons (Sep 10 2021 at 12:43):

ok, we're upgrading to https://github.com/hapifhir/org.hl7.fhir.core/releases/tag/5.5.3 :)

view this post on Zulip David Simons (Sep 10 2021 at 15:15):

David Simons said:

ok, we're upgrading to https://github.com/hapifhir/org.hl7.fhir.core/releases/tag/5.5.3 :)

As an update @Grahame Grieve - upgrading to the hl7validator 5.5.3 seems to have improved things a lot - maybe even resolved.
Haven't seen time-outs yet - will keep monitoring.

Upgrading the HTTP Client code for STU3 may have been the key.

view this post on Zulip David Simons (Sep 13 2021 at 08:43):

Thanks again @Grahame Grieve and team - looking much more stable for us now.

image.png

view this post on Zulip David Simons (Sep 13 2021 at 08:45):

PS: You may also want to update https://validator.fhir.org to 5.5.3
image.png
image.png

view this post on Zulip Grahame Grieve (Sep 13 2021 at 18:21):

@Mark Iantorno

view this post on Zulip Mark Iantorno (Sep 14 2021 at 14:34):

I will look when I get back. Just moved yesterday and my house is all in boxes. I'm back on Friday and will check it out.

view this post on Zulip Mark Iantorno (Sep 14 2021 at 14:35):

Thanks for using the online validator btw
I have a bit of work going in soon to add line numbers to the editor online when entering JSON and xml

view this post on Zulip Rick Geimer (Oct 12 2021 at 16:53):

Down again for me

view this post on Zulip Grahame Grieve (Oct 12 2021 at 18:29):

seems to be back now

view this post on Zulip Sarah Gaunt (Oct 14 2021 at 23:46):

Think it is down again: Publishing Content Failed: Unable to connect to terminology server. Error = Error fetching the server's capability statement: connect timed out (00:30.0682)

view this post on Zulip Sarah Gaunt (Oct 15 2021 at 02:20):

Looks like it's back - thanks to whoever fixed it!

view this post on Zulip Rob Hausam (Oct 15 2021 at 02:39):

I think Grahame may have been working on it.

view this post on Zulip Barbro Vessman (Oct 15 2021 at 09:50):

Term server seems to be down...

view this post on Zulip Rob Hausam (Oct 15 2021 at 10:56):

It should be back up now. I think Grahame has been working on it. It's had a lot of restarts so far this morning (my time). Let's see how it does now.

view this post on Zulip Barbro Vessman (Oct 15 2021 at 11:43):

Thank you, now it works! :sun_face:

view this post on Zulip Rick Geimer (Oct 18 2021 at 15:54):

Down again

view this post on Zulip Lloyd McKenzie (Oct 18 2021 at 16:03):

@Rob Hausam @Mark Iantorno

view this post on Zulip Rob Hausam (Oct 18 2021 at 16:29):

It should be back up now (automatically).

view this post on Zulip Rick Geimer (Oct 20 2021 at 16:47):

(deleted)

view this post on Zulip Grahame Grieve (Oct 20 2021 at 16:52):

working for me

view this post on Zulip Rob Hausam (Oct 20 2021 at 16:53):

I just restarted it.


Last updated: Apr 12 2022 at 19:14 UTC