Stream: IG creation
Topic: Term server is down...
Eric Haas (May 24 2021 at 22:03):
Root directory: /scratch/ig-build-temp-R5AB25/repo (00:02.0820)
Core Package hl7.fhir.r4.core#4.0.1
Installing hl7.fhir.r4.core#4.0.1 to the package cache
Fetching:....................................................................................................
Installing: .................................................................................................... done.
Terminology Cache is at /scratch/ig-build-temp-R5AB25/repo/input-cache/txcache. Trimming now (00:16.0772)
Connect to Terminology Server at http://tx.fhir.org (00:16.0775)
Publishing Content Failed: Unable to connect to terminology server. Error = Error fetching the server's capability statement: connect timed out (00:33.0961)
Lloyd McKenzie (May 24 2021 at 22:22):
@Rob Hausam @Mark Iantorno @Grahame Grieve
Grahame Grieve (May 24 2021 at 23:17):
server looks ok to me
Oliver Egger (May 28 2021 at 08:25):
term server looks down again:
org.hl7.fhir.exceptions.FHIRException: Unable to connect to terminology server. Error = Error fetching the server's capability statement: timeout
anyone up for a restart?
Rob Hausam (May 28 2021 at 12:30):
@Oliver Egger The server appears to be up. Are you still having issues?
Pétur Valdimarsson (May 28 2021 at 12:53):
I can report intermittent problems here (Sweden) as well. So far all but 1 builds failed due to timeouts to http://tx.fhir.org The user interface for it behaves in the same way, mixes timeouts with delayed responses. Last attempt was during the writing of this message.
Michaela Ziegler (May 28 2021 at 12:53):
still having issues with connecting to the terminology server
Grahame Grieve (May 28 2021 at 14:36):
it's coming back up
Rob Hausam (May 28 2021 at 15:34):
Hopefully that restart helped. Let us know if you are still seeing issues.
Jose Costa Teixeira (May 30 2021 at 22:06):
Seems to be down for me
Lloyd McKenzie (May 30 2021 at 22:11):
@Rob Hausam @Grahame Grieve @Mark Iantorno
Rob Hausam (May 30 2021 at 22:12):
I'll take a look.
Rob Hausam (May 30 2021 at 22:21):
Server is back up now.
Barbro Vessman (Jun 08 2021 at 10:17):
I have issues publishing: image.png
Rob Hausam (Jun 08 2021 at 11:29):
Restarting the service now.
Barbro Vessman (Jun 08 2021 at 12:45):
Thank you very much @Rob Hausam . Now it works!
Chris Moesel (Jun 16 2021 at 22:05):
Terminology server seems to be down again:
org.hl7.fhir.exceptions.FHIRException: Unable to connect to terminology server. Error = Error fetching the server's capability statement: timeout
Grahame Grieve (Jun 16 2021 at 22:20):
coming back up
Matthew Tiller (Jun 29 2021 at 16:00):
Is the terminology server down again?
Rob Hausam (Jun 29 2021 at 16:14):
Restarting the service.
Matthew Tiller (Jun 29 2021 at 16:36):
thank you sir
Rob Hausam (Jun 29 2021 at 17:00):
Apologies, but I'm going to restart the server now to install some software - expect it to be down for about 10 minutes.
Rob Hausam (Jun 29 2021 at 17:25):
Back up now.
Sarah Gaunt (Jun 30 2021 at 07:44):
Think it's down again:
Publishing Content Failed: Unable to connect to terminology server. Error = Error fetching the server's capability statement: connect timed out (01:40.0962)
Rob Hausam (Jun 30 2021 at 07:45):
It will be back up soon.
Rob Hausam (Jun 30 2021 at 07:45):
I'm finishing up with loading some content.
Rob Hausam (Jun 30 2021 at 07:57):
Should be back up now.
Sarah Gaunt (Jun 30 2021 at 08:10):
Thanks @Rob Hausam
Roeland Luykx (Jul 05 2021 at 05:53):
The terminology server is down. Is there a regular time when the server is not online due to maintenance?
Regularly when i like to build in the morning (CET) then the server is not online...
Torben M. Hagensen (Jul 05 2021 at 06:43):
Can anyone please restart the server
Diana_Ovelgoenne (Jul 05 2021 at 07:18):
x2
Roeland Luykx (Jul 05 2021 at 08:03):
@Rob Hausam
Christian Nau (Jul 05 2021 at 09:42):
Seems still to be down.
Is there a workaround, to be able to build local IG packages?
Christian Nau (Jul 05 2021 at 09:43):
@Rob Hausam @Grahame Grieve @Mark Iantorno can someone please restart the server? :)
Roeland Luykx (Jul 05 2021 at 09:57):
@Christian Nau yes, with the parameter -tx n/a
Christian Nau (Jul 05 2021 at 10:25):
Thank you @Roeland Luykx !!
Diana_Ovelgoenne (Jul 05 2021 at 11:31):
despite using -tx n/a I get the error Publishing Content Failed: Attempt to use Terminology server when no Terminology server is available
Roeland Luykx (Jul 05 2021 at 11:33):
@Diana_Ovelgoenne this is for sure if you need to have the terminology server available... lets hope on the server be back soon!
Mark Iantorno (Jul 05 2021 at 13:03):
on it now
Mark Iantorno (Jul 05 2021 at 13:05):
just restarted it, give it a couple min
Rob Hausam (Jul 05 2021 at 13:09):
Just saw this. Thanks, @Mark Iantorno. I should be able to get back to finishing setting up the monitor today.
Mark Iantorno (Jul 05 2021 at 13:12):
it's up again
Sarah Gaunt (Jul 08 2021 at 21:55):
Seems to be down again:
Connect to Terminology Server at http://tx.fhir.org (00:11.0648)
Publishing Content Failed: Unable to connect to terminology server. Error = Error fetching the server's capability statement: connect timed out
Sarah Gaunt (Jul 08 2021 at 21:56):
And I also get that same error someone mentioned above even when I use -tx n/a
Publishing Content Failed: Attempt to use Terminology server when no Terminology server is available
Rob Hausam (Jul 08 2021 at 21:58):
The monitor service is running. So that should take care of it - I'm checking if it actually is or not.
Sarah Gaunt (Jul 08 2021 at 21:58):
Thanks @Rob Hausam , re-running to see if it works.
Sarah Gaunt (Jul 08 2021 at 21:59):
No, still failing - will wait.
Rob Hausam (Jul 08 2021 at 22:00):
So actually the monitor wasn't started as a service the last time - so it wasn't running, but it is now. So this wasn't a perfect test, but it should be up shortly.
Rob Hausam (Jul 08 2021 at 22:10):
it's back up now
Lloyd McKenzie (Jul 08 2021 at 22:12):
Sounds like we need a monitor for the monitor service... ;)
Sarah Gaunt (Jul 08 2021 at 22:20):
Works now @Rob Hausam thanks!
Diana_Ovelgoenne (Jul 12 2021 at 10:10):
Server is down again @Rob Hausam
Mark Iantorno (Jul 12 2021 at 11:56):
Just restarted it
Martin Morrey (Jul 12 2021 at 12:13):
Or @Mark Iantorno ? Would be good to get this working again as soon as possible. Thanks!
Mark Iantorno (Jul 12 2021 at 12:38):
Yeah, I restarted it. It should be up now
Mark Iantorno (Jul 12 2021 at 12:39):
if you're ever wondering if it's up or down, you can quickly check by looking at the indicators at the top right of https://validator.fhir.org/
Mark Iantorno (Jul 12 2021 at 12:40):
there are two indicators, one for terminology and one for packages2
Martin Morrey (Jul 12 2021 at 12:48):
That's great. Thank-you :smile:
Rob Hausam (Jul 12 2021 at 13:56):
The server monitor and its auto-restart capability isn't (yet) working quite as expected in all situations. But with a few additional code tweaks I expect that it will be there soon.
Diana_Ovelgoenne (Jul 13 2021 at 06:34):
Server is down again @Mark Iantorno @Rob Hausam
Martin Morrey (Jul 13 2021 at 10:42):
Still down @Mark Iantorno . image.png
Janaka Peiris (Jul 13 2021 at 10:47):
is there a way to bypass tx server ? it seems to be down, time to time.
Michaela Ziegler (Jul 13 2021 at 10:57):
with the IG publisher: add -tx
in your command line
https://confluence.hl7.org/pages/viewpage.action?pageId=35718627#IGPublisherDocumentation-Runningincommandlinemode
Mark Kramer (Jul 13 2021 at 11:31):
@Michaela Ziegler can you clarify what argument you might use for the -tx to avoid the tx server?
Diana_Ovelgoenne (Jul 13 2021 at 11:38):
-tx n/a but I found out last week that if your IG has Bindings, then it doesn't matter if you put the parameter, the Publisher will still try to connect to the terminology server
Mark Kramer (Jul 13 2021 at 11:43):
It would be nice if there was a mode where it would only go to the txcache.
Mark Iantorno (Jul 13 2021 at 12:04):
You're upset the tx server is down, I'm thrilled you're using the monitor on validator.fhir.org
Mark Iantorno (Jul 13 2021 at 12:05):
just restarted it
Diana_Ovelgoenne (Jul 13 2021 at 12:06):
@Mark Iantorno checking it on validator all day long :smile: still that doesn't help to bring it up :frown: we need someone on Europe to be able to restart it too.
Mark Iantorno (Jul 13 2021 at 12:07):
yeah, when Grahame is back we have good coverage. He's just away right now
Mark Iantorno (Jul 13 2021 at 12:07):
Thanks for your patience
Rob Hausam (Jul 13 2021 at 13:03):
Yes, thanks. There is still hope for the monitor, too. :) I'll get with @Gino Canessa again and see if we can get the logic updates worked out.
David Hay (Jul 18 2021 at 06:55):
Just out of interest, what prevents using any terminology server from being used in the IG Publisher? I did try using ontoserver - just for fun - and it seemed to work ok (though it was by no means an exhaustive test...
Lloyd McKenzie (Jul 18 2021 at 13:59):
Nothing whatsoever. HL7 publications need to use the HL7 server unless they get FMG permission otherwise, but others are free to use whatever server they wish.
John Moehrke (Jul 19 2021 at 20:14):
the validator.fhir.org says that the terminology server is green... but my IG build says different.
Rob Hausam (Jul 19 2021 at 20:42):
I'm checking.
Rob Hausam (Jul 19 2021 at 21:06):
back up
John Moehrke (Jul 19 2021 at 21:07):
thanks
Sarah Gaunt (Jul 20 2021 at 02:47):
Looks like it's down again...
Lloyd McKenzie (Jul 20 2021 at 02:48):
@Mark Iantorno @Rob Hausam
Rob Hausam (Jul 20 2021 at 02:49):
Yes, I saw that. The monitor is working on restarting it. :)
Sarah Gaunt (Jul 20 2021 at 02:50):
Is that what the pink means?
Rob Hausam (Jul 20 2021 at 02:53):
I'm not sure if that's supposed to be pink - or just a pale red. :) It will be back up soon.
Sarah Gaunt (Jul 20 2021 at 02:59):
Thanks @Rob Hausam
Rob Hausam (Jul 20 2021 at 03:07):
It's taking longer than usual, as it's restarted multiple times before completely finishing the previous restart process. But it shouldn't be too much longer now.
Rob Hausam (Jul 20 2021 at 03:28):
The service has come back up a few times now but then it almost immediately seems to hang again. I'm going to restart the entire maching and see if that gets it back to normal (it should).
Sarah Gaunt (Jul 20 2021 at 03:57):
No worries - took it as a sign to get off my a$$ and do a workout!
Rob Hausam (Jul 20 2021 at 03:58):
It looks like you may need to get a really good workout today! ;)
Sarah Gaunt (Jul 20 2021 at 03:59):
Maybe will walk the dogs now then as I see it's still not behaving!
Rob Hausam (Jul 20 2021 at 04:00):
Even after a full machine restart, it's still not responding correctly. :(
Rob Hausam (Jul 20 2021 at 04:12):
It seems to be running normally now - finally!
Rob Hausam (Jul 20 2021 at 04:14):
Oops, spoke too soon - trying again.
Giorgio Cangioli (Jul 20 2021 at 07:01):
It seems it is down...
Peter Jordan (Jul 20 2021 at 07:23):
Couldn't resist this... https://www.youtube.com/watch?v=LODkVkpaVQA
Rob Hausam (Jul 20 2021 at 07:49):
Yes, it's still down. And I'm still trying to work on it. It's not coming back, even with restarts, in the typical way that it has previously. :( The server VM isn't running out of disk space or RAM or other resources - so I'm at a bit of a loss at the moment as to why it's behaving this way.
Rob Hausam (Jul 20 2021 at 08:57):
For some reason now we're getting an exception when the FHIR server is trying to launch the 'Telnet Server', and the launch fails. I've pretty much exhausted everything that I am able to do for now (especially at this time during the night/morning). I'm reaching out to @Jose Costa Teixeira, @Grahame Grieve, @Mark Iantorno and @Gino Canessa and we'll see what we can do to get this functioning again - as soon as possible!
Mark Iantorno (Jul 20 2021 at 12:24):
I have restarted the service and tx.fhir.org is working as expected
Mark Iantorno (Jul 20 2021 at 12:25):
please let me know if it goes down again
David deRoode (Jul 20 2021 at 12:49):
tatement: connect timed out``` @**Mark Iantorno**
Igor Sirkovich (Jul 20 2021 at 12:55):
@Mark Iantorno , I keep getting "Unable to connect to terminology server. Error = Error fetching the server's capability statement: connect timed out (00:50.0163)"
Mark Iantorno (Jul 20 2021 at 12:55):
Yeah turns out that there is actually something more complicated going on
Igor Sirkovich (Jul 20 2021 at 12:55):
Also, hapy.fhir.org is down since last night - I'm not sure if this is related
Lloyd McKenzie (Jul 20 2021 at 15:14):
@Igor Sirkovich - should be no relationship, but @James Agnew, FYI
Rob Hausam (Jul 20 2021 at 15:51):
Several of us are working to see if we can get this fixed - but at the moment there is no immediate solution. One option for now may be to direct the IG Publisher to a different terminology server. @David Hay has recently tried it using Ontoserver, and apparently that seemed to work. I may give that a try myself and see how that works. Theoretically that should be fine, as long as all of the terminology content that you need is on the server - which may be problematic in some cases.
John Moehrke (Jul 20 2021 at 16:26):
do we want to try a distributed denial of service against that server? (Aka, we all try local and ci builds at the same time)?
Rob Hausam (Jul 20 2021 at 16:58):
Hopefully it won't lead to anything close to DDoS, but launching the IG Publisher this way using Ontoserver seems to work:
java -jar <path to ig publisher>/publisher.jar -ig . -tx https://r4.ontoserver.csiro.au/fhir
I was able to build the IPS IG with what appears to be the same QA output that I had before with tx.fhir.org. I didn't see a way to specify a different tx server using the _genonce.sh
or _genonce.bat
scripts as currently written (but if needed I'm sure it should be pretty easy to rewrite them to support that).
Max Masnick (Jul 20 2021 at 17:28):
For _genonce.sh
, this command will work with a slight modification to the script (see below): ./_genonce.sh -tx https://r4.ontoserver.csiro.au/fhir
But you first need to modify _genonce.sh
to comment out lines 5-13 as the script tries to access tx.fhir.org to check for internet access:
# curl -sSf tx.fhir.org > /dev/null
# if [ $? -eq 0 ]; then
# echo "Online"
# txoption=""
# else
# echo "Offline"
# txoption="-tx n/a"
# fi
Rob Hausam (Jul 20 2021 at 17:28):
Yes, that makes sense.
Max Masnick (Jul 20 2021 at 17:29):
I wonder if we should switch to using http://captive.apple.com
to check for internet access (that's the URL that Apple devices use to see if they are on a network that can resolve public internet addresses)
Rob Hausam (Jul 20 2021 at 17:31):
Yes, I think that would also make sense - and completely separate the "internet check" from the tx server specification.
Max Masnick (Jul 20 2021 at 17:38):
This change is proposed in https://github.com/HL7/ig-publisher-scripts/pull/5
If anyone wants to switch to using captive.apple.com
in their scripts, you can grab the fixed scripts from here until this is merged in.
Chris Moesel (Jul 20 2021 at 18:18):
Sorry, a little late to this conversation, but... I think the intent of the check is not just to see if it can access the internet (despite the message it prints to the console), but to check to see if it can access the terminology server. It's immediately followed by this:
if [ $? -eq 0 ]; then
echo "Online"
txoption=""
else
echo "Offline"
txoption="-tx n/a"
fi
The -tx n/a
means "don't use a terminology server". So if we use captive.apple.com
, then if the internet is available but the terminology server is down, it will try to use the terminology server during the build (since it doesn't go into the else
clause). I'm not sure this is what we want.
Brian Kaney (Jul 20 2021 at 19:06):
One thing we may want to do if have an optional override of an ENV var in these scripts. It would be nice to be able define alternatives or internal mirrors for the terminology server.
Sarah Gaunt (Jul 20 2021 at 21:16):
Getting confused here! I need the CI build to work - is there still a problem there? Or is the above discussing how to fix it for the CI build?
Lloyd McKenzie (Jul 20 2021 at 21:19):
Still discussing work-arounds.
Rob Hausam (Jul 21 2021 at 00:38):
Here's an update on where we are with the tx.fhir.org terminology service. We've managed to get the server up and running again today - but not on a continuous and consistent basis. Looking at it with @Mark Iantorno and then further with @Jose Costa Teixeira and @Gino Canessa, we determined that one issue was that the database was able to grab too much of the available RAM, and we were then able to limit that which helped to solve a part of the problem. It didn't solve the underlying issue that the server process itself is using excessive and progressive amounts of RAM that eventually exceed the available limits, resulting in periodic and now quite frequent server crashes and need for restarts. Fortunately, Gino has been able to make some further updates to the monitor logic so that it can handle those situations more gracefully and can successfully restart the server (in all or at least nearly all cases). We don't yet know why the service memory use seems to have increased so significantly quite recently, so that still needs more investigation and ultimate mitigation.
But with all of that, we've now reached a semi-stable situation where we seem maybe to have a temporary path forward. What we're seeing now is that when the service is restarted it is able to run for a variable amount of time, which is seeming mostly to be from about 10 to 30 minutes (again not sure why there is that variability, but there is). Once the server hits the memory limit and crashes, that will be detected and the service will be restarted, and the restart process seems to consistently take about 7 minutes. So the overall server uptime is fluctuating between about 59% to 81%. That's not good at all, but it might provide enough availability for us to work with for the most part until we can come up with a better long term solution. If that kind of availability isn't sufficient in some cases, then for now probably the best alternative will be to explore using an alternative terminology server (like Ontoserver) for the IG builds (at least for the local ones) - I'm assuming that the CI build probably may need to stay with tx.fhir.org, as long as it remains usable at all (@Josh Mandel?).
Let me or any of us know if you have more questions, and particularly good suggestions. Thanks for everyone's patience!
Sarah Gaunt (Jul 21 2021 at 00:44):
Thanks for the update and working so hard on this @Rob Hausam !!
Michael Lawley (Jul 21 2021 at 00:49):
For the record, we are very happy for people to use https://r4.ontoserver.csiro.au/fhir as an alternative to tx.fhir.org
It's an open endpoint, so if it doesn't contain the resources you need, then you're free to upload them
Josh Mandel (Jul 21 2021 at 00:49):
Thanks @Rob Hausam (we can certainly point the CI bulld to another tx server if it's helpful.)
Rob Hausam (Jul 21 2021 at 00:50):
Thank you @Michael Lawley! Maybe we should consider doing that, @Josh Mandel?
Josh Mandel (Jul 21 2021 at 00:54):
Sure. Let me know what tweaks you want to how we invoke the publisher in the IG build pipeline (-tx
argument as above?)
Rob Hausam (Jul 21 2021 at 00:58):
Yes, -tx https://r4.ontoserver.csiro.au/fhir
should do it. I'm not sure what content tweaks we may need, but I believe the FHIR 4.0.1 content should all already be there? Not sure about the CI content, though, that seems unlikely to be there now, and I'm not sure how to update and maintain what's needed for that (other than manually when people need it, as @Michael Lawley said)?
Rob Hausam (Jul 21 2021 at 00:58):
I think we might as well give it a try.
Michael Lawley (Jul 21 2021 at 01:08):
Let me know when you do - I'll keep a close eye on our dashboard to see the impact
Peter Jordan (Jul 21 2021 at 01:12):
Something else that might be considered is to simplify the $validate-code operation requests that are being made to Terminology Servers, such as tx.fhir.org. Aside from some specific requirements from the Build process, I'm not sure that it's necessary to use POSTS with relatively large payloads, including custom parameters that aren't required for simple ValueSet based validation where only a simple GET would suffice. At least, perhaps have a parameter that indicates simple validation only?
Josh Mandel (Jul 21 2021 at 01:15):
OK, we'll see how this goes. (Added a variable to control the TX server in IG build, and set this to https://r4.ontoserver.csiro.au/fhir
in the current trigger.)
Rob Hausam (Jul 21 2021 at 01:21):
Thanks, @Josh Mandel. I am curious what the volume and impact of this will be, @Michael Lawley. Everyone, please let us know how this change is working for you - particularly if you see issues (and hopefully there won't be many, if any, of those - or they will be easy to solve).
Sarah Gaunt (Jul 21 2021 at 01:26):
Just redelivered my last payload from yesterday, will keep you posted...
Sarah Gaunt (Jul 21 2021 at 01:29):
It worked, it worked! Haven't checked QA yet, but happy it ran without crashing!
Michael Lawley (Jul 21 2021 at 01:29):
How was speed from your perspective?
Josh Mandel (Jul 21 2021 at 01:31):
Looks like my firs test in the auto-build pipeline built successfully against https://r4.ontoserver.csiro.au/fhir
Sarah Gaunt (Jul 21 2021 at 01:32):
Well, Im not actually sure it was hitting your server, @Michael Lawley - I think this was something I sent right after @Rob Hausam said the tx.fhir.org was back up and before moving to the au one.
Sarah Gaunt (Jul 21 2021 at 01:33):
It took 42 mins.... I just assumed I hadn't hit the rebuild button properly or something, as it took so long.
Sarah Gaunt (Jul 21 2021 at 01:33):
I should have another one soon. Unless maybe it's my IG that is killing everything! :fear:
Rob Hausam (Jul 21 2021 at 01:36):
Yes, it looks like your current one for case-reporting says Connect to Terminology Server at http://tx.fhir.org
, @Sarah Gaunt.
Sarah Gaunt (Jul 21 2021 at 01:42):
Yes. Not sure why the IG is taking so long all of a sudden. I sent another payload after you switched over and it's not finished yet. Will have to do some detective work.
Rob Hausam (Jul 21 2021 at 01:43):
A lot of them really do take a long time - IPS has been, too.
Sarah Gaunt (Jul 21 2021 at 01:45):
Yeah, but yesterday it was taking less than 10 mins... Not 40+... I must have changed something that has made that happen.
Sarah Gaunt (Jul 21 2021 at 01:52):
IPS is taking 7 mins - that's not slow. :)
Lloyd McKenzie (Jul 21 2021 at 01:54):
I expect that the calls to the remote tx server take a reasonable amount longer than the ones to the local server.
Lloyd McKenzie (Jul 21 2021 at 01:54):
Can the core build also be adjusted to use ontoserver for now? There's no hope of a core build completing in < 45 minutes...
Rob Hausam (Jul 21 2021 at 01:55):
@Sarah Gaunt I agree - that's pretty fast. Maybe it's Michael's server that's speeding it up! I have a lot of errors now in this older version of IPS - but it looks like they are most likely from the latest IG Publisher updates, rather than the termininology server.
Michael Lawley (Jul 21 2021 at 01:55):
BTW, are local tx caches being cleared beforehand?
Rob Hausam (Jul 21 2021 at 01:55):
@Lloyd McKenzie I was thinking the same. But maybe @Josh Mandel already did that?
Michael Lawley (Jul 21 2021 at 01:56):
Main errors that I suspect people might see are missing content, and that we default SNOMED to the AU Edition
Josh Mandel (Jul 21 2021 at 01:57):
Can the core build also be adjusted to use ontoserver for now?
That's a question for @Mark Iantorno
Josh Mandel (Jul 21 2021 at 01:57):
Michael Lawley: BTW, are local tx caches being cleared beforehand?
In the auto-ig-builder pipeline, there is no local state saved across builds.
Josh Mandel (Jul 21 2021 at 01:58):
(Some IGs do check a "cache" folder into the repo, which always seems odd to me -- so there's that.)
Rob Hausam (Jul 21 2021 at 01:58):
@Michael Lawley That is something that I think people will want to/need to do. But I didn't do that this time for the IPS build. The SNOMED CT AU edition might be an issue for US IGs, but they should be declaring the US Edition, which I think you do support? And for universal ones like IPS the AU edition also includes the International Edition, so I don't think it should be a problem?
Michael Lawley (Jul 21 2021 at 01:58):
I'm not seeing any significant load yet
Rob Hausam (Jul 21 2021 at 01:59):
I'm stepping away (literally) for a few minutes, but it looks like things are pretty good so far.
Josh Mandel (Jul 21 2021 at 01:59):
(We have between 0 and 1 IG building in the pipeline over the past 30min, so _shouldn't_ see too much load from that source.)
Sarah Gaunt (Jul 21 2021 at 02:00):
Yes, I declare the US version of SNOMED so it shouldn't be an issue for me anyway.
Michael Lawley (Jul 21 2021 at 02:00):
We have the International Edition, but not a US one
Sarah Gaunt (Jul 21 2021 at 02:01):
Ah... Will see what happens then!
Sarah Gaunt (Jul 21 2021 at 02:10):
Fixed my speed issue - it didn't like some MD links I'd tried to put in a field description. Not sure why that added an extra 35 mins on, but I'll take it. And running on your server @Michael Lawley it took just over 8 mins.
Sarah Gaunt (Jul 21 2021 at 02:10):
I am getting errors like: The code 26643006 exists in the CodeSystem, but the display "Oral Route" is incorrect (from https://r4.ontoserver.csiro.au/fhir) for 'http://snomed.info/sct#26643006'
now that weren't there before, but I'm going to ignore them for now!
Sarah Gaunt (Jul 21 2021 at 02:13):
Weird that it doesn't like that though - both the international and US version have the same term for that conceptId: https://browser.ihtsdotools.org/?perspective=full&conceptId1=404684003&edition=MAIN/2021-01-31&release=&languages=en
Sarah Gaunt (Jul 21 2021 at 02:13):
Could be the capital on Route maybe?
Sarah Gaunt (Jul 21 2021 at 02:16):
I actually think it's catching valid terminology issues that weren't being caught before.
Sarah Gaunt (Jul 21 2021 at 02:19):
Maybe it's case sensitive where tx.fhir.org was not? So far all the ones I've checked have been case issues. (And by "all" I mean "two"! :-) )
Rob Hausam (Jul 21 2021 at 02:37):
Interesting - and maybe so. Your IG does specify the US Edition, @Sarah Gaunt?
Peter Jordan (Jul 21 2021 at 02:44):
@Michael Lawley does that International Edition use the US or UK English Language Reference Set?
Michael Lawley (Jul 21 2021 at 02:50):
Yes, @Sarah Gaunt it will be the case. We're quite conservative here because changing case for certain things is bad.
Michael Lawley (Jul 21 2021 at 02:53):
US English should be our default for SCT Int
Mark Iantorno (Jul 21 2021 at 03:04):
In terms of getting the core tooling to use the other server. This is possible, but does the ontoserver actually handle all the functionality necessary?
Mark Iantorno (Jul 21 2021 at 03:06):
Can someone who runs the onto server have a quick call with me to outline the server endpoints and functionality?
Jim Steel (Jul 21 2021 at 03:15):
Sure
Sarah Gaunt (Jul 21 2021 at 03:17):
@Rob Hausam Yes, it specifies the US Edition of SNOMED.
Sarah Gaunt (Jul 21 2021 at 03:22):
Definitely getting more terminology errors than before. Haven't checked them all out yet. I think stuff like the following might be because of the US vs AUS thing: The code '55751-2' from code system 'http://loinc.org' was not found in value set http://hl7.org/fhir/ValueSet/doc-typecodes (from https://r4.ontoserver.csiro.au/fhir)
That's a US public health code: https://loinc.org/55751-2/ and I _think_ we are using it correctly...
Or maybe it's not checking all the codes (there are only the first 1000 listed in the value set) http://hl7.org/fhir/R4/valueset-doc-typecodes.html?
Jim Steel (Jul 21 2021 at 03:28):
The problem with that ValueSet (from Ontoserver's perspective) is that:
- its definition is basically "all codes where SCALE_TYP = Doc"
- SCALE_TYP is a code-typed property (according to https://www.hl7.org/fhir/loinc.html), but
- Doc is the label, not the code
- its possible it should be "all codes where SCALE_TYP = LP32888-7"
Jim Steel (Jul 21 2021 at 03:30):
@Rob Hausam Is that different from how HAPI/tx.fhir.org interpret LOINC properties?
Michael Lawley (Jul 21 2021 at 03:34):
I have just updated R4's copy of http://hl7.org/fhir/ValueSet/doc-typecodes
to also include a filter as @Jim Steel suggests and the expansion now includes 55751-2
Sarah Gaunt (Jul 21 2021 at 03:34):
Sweet - thanks!
Sarah Gaunt (Jul 21 2021 at 03:35):
Getting this on all the jurisdiction elements: Code System URI 'urn:iso:std:iso:3166' is unknown so the code cannot be validated
Sarah Gaunt (Jul 21 2021 at 03:36):
Also: Code System URI 'http://unitsofmeasure.org' is unknown so the code cannot be validated
Rob Hausam (Jul 21 2021 at 03:36):
Has that completely answered the question? Technically, Jim is correct that the value should actually be the LP32888-7
code, rather than the label 'Doc'. The official LOINC server, also based on HAPI, I'm pretty sure now handles it all that way, using the LP codes (I worked on that with them for a while, but haven't checked it lately).
Rob Hausam (Jul 21 2021 at 03:37):
I wondered if UCUM was going to be covered, in the same way that it is in tx.fhir.org?
Michael Lawley (Jul 21 2021 at 03:37):
I was just about to suggest that a tracker is needed to correct the ValueSet definition.
Sarah Gaunt (Jul 21 2021 at 03:37):
That iso:3166 doesn't matter - that's not an error - just an informational message.
Jim Steel (Jul 21 2021 at 03:54):
I put up a copy of iso 3166 and bcp47 (languages)
Sarah Gaunt (Jul 21 2021 at 04:40):
Not sure why this is failing The value provided ('application/xml') is not in the value set http://hl7.org/fhir/ValueSet/mimetypes|4.0.1 (http://hl7.org/fhir/ValueSet/mimetypes), and a code is required from this value set) (error message = Error from server: Error:org.hl7.fhir.r5.model.CodeableConcept@408a7a97 )
I originally just had "xml" which failed, so changed it to "application/xml" which is still failing.
Michael Lawley (Jul 21 2021 at 05:24):
That ValueSet depends on Code system urn:ietf:bcp:13 which we don't have
Sarah Gaunt (Jul 21 2021 at 05:27):
Any idea on this Observation.value.ofType(CodeableConcept) (l285/c27) error Error from server: Error:org.hl7.fhir.r5.model.CodeableConcept@19fe697c
- getting quite a few like that which weren't failing before.
Sarah Gaunt (Jul 21 2021 at 05:30):
Look like they are mostly SNOMED codes that are giving that error. Maybe something to do with the fact that I'm using the US version...
Sarah Gaunt (Jul 21 2021 at 05:31):
e.g. Condition.code (l18/c11) error Error from server: Error:org.hl7.fhir.r5.model.CodeableConcept@78f21c53
from
<code>
<coding>
<system value="http://snomed.info/sct"/>
<code value="82272006"/>
<display value="Common cold (disorder)"/>
</coding>
</code>
Max Masnick (Jul 21 2021 at 09:49):
Josh Mandel said:
(Some IGs do check a "cache" folder into the repo, which always seems odd to me -- so there's that.)
One reason to do this is that the output of the IG publisher changes depending on whether the cache/
folder is "hot" or not: https://github.com/HL7/fhir-ig-publisher/issues/231
Max Masnick (Jul 21 2021 at 10:16):
Chris Moesel said:
...
The-tx n/a
means "don't use a terminology server". So if we usecaptive.apple.com
, then if the internet is available but the terminology server is down, it will try to use the terminology server during the build (since it doesn't go into theelse
clause). I'm not sure this is what we want.
Thanks for catching this, Chris. I closed out that pull request and opened a new issue describing what I think the logic should be. Does this look right to everyone?
Michael Lawley (Jul 21 2021 at 10:24):
SNOMED US (20210301) is now in Ontoserver.
Michael Lawley (Jul 21 2021 at 10:26):
@Sarah Gaunt I don't know where that ugly error message is coming from though - the Ontoserver code doesn't reference anything in the org.hl7.fhir.r5.model
package. I can only imagine its coming out of the validator itself. I'll try to dig into our logs to diagnose
Rob Hausam (Jul 21 2021 at 13:09):
@Michael Lawley Are you seeing any noticeable load yet (from the CI builds or otherwise)?
Rob Hausam (Jul 21 2021 at 13:10):
I have tx.fhir.org up again now and it's getting what appears to be a normal stream of activity.
David Pyke (Jul 21 2021 at 13:14):
So, assuming it stays up, can we somehow set ontoserver as the failover for when something like this happens again (or just any time tx goes down)?
Rob Hausam (Jul 21 2021 at 13:15):
That seems like it would be a reasonable short to medium term goal - not sure how much effort it would take to make that happen.
Michael Lawley (Jul 21 2021 at 13:23):
I'm not seeing any real load at all
Michael Lawley (Jul 21 2021 at 13:33):
By that I mean nothing that is affecting response time. We did see a distinct change in # requests 2.5 hrs ago:
image.png
AbdulMalik Shakir (Jul 21 2021 at 13:37):
Sarah Gaunt said:
Definitely getting more terminology errors than before. Haven't checked them all out yet. I think stuff like the following might be because of the US vs AUS thing:
The code '55751-2' from code system 'http://loinc.org' was not found in value set http://hl7.org/fhir/ValueSet/doc-typecodes (from https://r4.ontoserver.csiro.au/fhir)
That's a US public health code: https://loinc.org/55751-2/ and I _think_ we are using it correctly...Or maybe it's not checking all the codes (there are only the first 1000 listed in the value set) http://hl7.org/fhir/R4/valueset-doc-typecodes.html?
Sarah Gaunt said:
Definitely getting more terminology errors than before. Haven't checked them all out yet. I think stuff like the following might be because of the US vs AUS thing:
The code '55751-2' from code system 'http://loinc.org' was not found in value set http://hl7.org/fhir/ValueSet/doc-typecodes (from https://r4.ontoserver.csiro.au/fhir)
That's a US public health code: https://loinc.org/55751-2/ and I _think_ we are using it correctly...Or maybe it's not checking all the codes (there are only the first 1000 listed in the value set) http://hl7.org/fhir/R4/valueset-doc-typecodes.html?
@Sarah Gaunt I'm having a similar issue "The code '64297-5' from code system 'http://loinc.org' was not found in value set http://hl7.org/fhir/ValueSet/doc-typecodes". This is a new error not previously encountered. '64297' is a valid LOINC code for Death certificate. Is the doc-typecodes value set incomplete?
Mark Iantorno (Jul 21 2021 at 13:40):
Anyone using the validator or publisher can specify the terminology server they want to use on the command line when they use the tool. I am hesitant to actually go into the project and switch the dependency to, or add a dependency on a closed source, private server. Having it as a choice for users is one thing, but using it as a default backup or primary for all users is another thing. I have a meeting with @Rob Hausam today to discuss some options and details for the terminology server, so I can get a better idea of how it's set up, and I'm hoping to have a meeting with @Jim Steel later this week to talk more about the onto server.
Grahame Grieve (Jul 21 2021 at 22:03):
Ontoserver is not a viable replacement for tx.fhir.org at this time. It's on my list to find time to document the issues, but FMG can't approve IGs based on it at this time
Grahame Grieve (Jul 21 2021 at 22:04):
apologies for the fact that tx.fhir.org suddenly fell over; I've been reading the transcript and I still can't figure out what actually happened. And sad that it happened while I was off line for a week
Lloyd McKenzie (Jul 21 2021 at 22:05):
@Grahame Grieve - FMG can't approve IGs to go to ballot, to go to publication or either?
Lloyd McKenzie (Jul 21 2021 at 22:05):
We're hitting a point where approval to ballot is going to be pretty much non-optional...
Lloyd McKenzie (Jul 21 2021 at 22:06):
We could possibly hold off on non-ballot publications for a while unless they had some pressing urgency
Rob Hausam (Jul 21 2021 at 22:06):
It's not at all surprising that Ontoserver isn't a full replacement (certainly at present). But it was of help temporarily.
Max Masnick (Jul 22 2021 at 12:50):
Is it safe to switch build.fhir.org from Ontoserver back to tx.fhir.org? I'm seeing some QA errors which are resolved locally by switching back to tx.fhir.org.
Josh Mandel (Jul 22 2021 at 12:59):
Happy to try it. I could also add a flag on the webhook to make this configurable, but I'm not sure we want to expose that much flexibility.
Max Masnick (Jul 22 2021 at 13:13):
Rob Hausam said:
I have tx.fhir.org up again now and it's getting what appears to be a normal stream of activity.
I think ↑ means that tx.fhir.org is up again?
It would be great if we had something like https://www.basecampstatus.com/index.html so we could see if there were issues over time in addition to the traffic lights on https://validator.fhir.org
Max Masnick (Jul 22 2021 at 13:13):
(That status page is from https://www.atlassian.com/software/statuspage, which I think we can use for free)
Max Masnick (Jul 22 2021 at 13:15):
In any case it looks like it's up now, but I don't know if it's stable
Rob Hausam (Jul 22 2021 at 13:36):
I think we need to try using tx.fhir.org again. But since the code hasn't changed (yet), it's certainly possible (likely?) that we will see the same thing that we did before - at least at some point. @Ted Klein is running another UTG build right now on Ontoserver, to check a solutiion for an error that he has been getting since we made the switch. It would be good for that to complete first.
Rob Hausam (Jul 22 2021 at 13:37):
@Josh Mandel
Josh Mandel (Jul 22 2021 at 13:55):
The way the switchover is implemented, it does not affect any builds currently in progress. It will just affect the next build submitted to the auto build pipeline. The configuration for a given build does not change mid-flight.
Josh Mandel (Jul 22 2021 at 13:56):
Anyway, let me know when you want me to switch back over to tx.fhir.org and I will do so.
Rob Hausam (Jul 22 2021 at 14:00):
I was pretty sure you had it configured that way. The UTG build still failed. We can switch it now, and I will keep a close eye on it. But if it does get out of hand again we may need to switch it back. I could also check first with the HQ folks and see if they are ready to be able to give us more resources on the server. But I'm thinking that maybe we should switch back now to the way that it was before first, to verify that we still are running into the same problems - before we invest in the additional resources. So I would say let's go ahead and do the switch now.
Rob Hausam (Jul 22 2021 at 14:03):
At this point the server has been running for 25 hours straight without a further problem - we'll see what happens.
Josh Mandel (Jul 22 2021 at 14:05):
OK, pushed the config update. (Re: server resources, I think adjustments can be made in Google Cloud Console directly if we wanted to allocate more RAM to the tx server.)
Rob Hausam (Jul 22 2021 at 14:07):
Thanks. Yes, I agree we could make the adjustments there. I don't believe I have any access for that - but presumably you do?
Josh Mandel (Jul 22 2021 at 15:33):
I do. We can discuss in the tx/internal chat.
John Moehrke (Jul 22 2021 at 18:31):
seems tx.hl7.org has been far more stable.. i think that it coorelates to @Rob Hausam is building IPS elsewhere... :-)
John Moehrke (Jul 22 2021 at 18:31):
:-)
Jean Duteau (Jul 28 2021 at 17:07):
looks down again... all of the CI builds are failing
David Pyke (Jul 28 2021 at 17:20):
Paging @Rob Hausam. Please pick up the white courtesy phone
Jose Costa Teixeira (Jul 28 2021 at 17:24):
@Gino Canessa the machine doesn't go bing!
David Pyke (Jul 28 2021 at 17:34):
Gino Canessa (Jul 28 2021 at 17:36):
Yes, Mark is looking at it.
Gino Canessa (Jul 28 2021 at 17:42):
Ok, it's starting up. Should be online in a few minutes (10-ish).
Rob Hausam (Jul 28 2021 at 17:47):
Yeah. I picked up the phone a little late - but others were able to. :)
Barbro Vessman (Aug 23 2021 at 14:20):
Hello, term server seems to be down
Rob Hausam (Aug 23 2021 at 14:22):
Yes, it should be restarting in a moment.
Rob Hausam (Aug 23 2021 at 14:33):
It's back up now.
Barbro Vessman (Aug 23 2021 at 14:39):
Thank you @Rob Hausam !
Ramandeep Dhanoa (Aug 26 2021 at 20:03):
Hello, is it possible that the term server is down? I am getting this error "Attempt to use Terminology server when no Terminology server is available"
Lloyd McKenzie (Aug 26 2021 at 20:07):
@Rob Hausam @Grahame Grieve
Rob Hausam (Aug 26 2021 at 20:11):
@Ramandeep Dhanoa It is up now. And according to the monitor is has been steadily up for approx. the past 2 hours. Have you tried building again?
Ramandeep Dhanoa (Aug 26 2021 at 21:50):
I see, I will try to debug if something is messed up locally. Thanks @Rob Hausam
David Simons (Sep 07 2021 at 07:13):
Getting Error sending HTTP Post/Put Payload: tx.fhir.org:80 failed to respond
from the hl7validator...
What does the 'pink' status mean, btw? (compared to red)
HTTP Caching of POST/PUT calls is not trivial, right?
Rob Hausam (Sep 07 2021 at 12:37):
@David Simons I'm not certain what that pink color means, either. But according to the monitor the service is ok, and it seems to be responding normally to me locally and it looks ok on the console output. Are you seeing issues with it on your end?
David Simons (Sep 07 2021 at 12:42):
Rob Hausam said:
Are you seeing issues with it on your end?
Thank you @Rob Hausam - we kept getting the above Errors for the last day - responding intermittently. currently tx.fhir.org seems to be responding again to our hl7validator calls
We really appreciate being able to use tx.fhir.org - yet the significant downtime is also forcing us to look into alternatives - which is not trivial though.
I'd rather see this addressed _behind_ the tx.fhir.org endpoint - with scalability and availability and caching measures.
Rob Hausam (Sep 07 2021 at 13:41):
@David Simons We completely agree with you on addressing the issue(s) on the server itself, behind the endpoint. @Grahame Grieve has been working on that, and I think that's made some improvements (regarding memory utilization, etc.). But you're certainly right that the level of reliability isn't yet where we all want it to be. And I'm pretty sure that Grahame has more that he is planning to do that should further improve it.
David Simons (Sep 07 2021 at 14:14):
Rob Hausam said:
I'm not certain what that pink color means, either.
Thanks - not a major issue.
Maybe use icons like Consumer Reports does - incl. arrows next to colors to distinguish
Grahame Grieve (Sep 07 2021 at 19:01):
it's up and been up continually, so I'm not sure what the issue is. If you want to watch it, #tx.fhir.org/notification
David Pyke (Sep 07 2021 at 19:20):
That channel is private...
Grahame Grieve (Sep 07 2021 at 19:22):
oh. I'll change that
David Simons (Sep 08 2021 at 11:26):
Thank you @Grahame Grieve - certainly want to help to find the root cause - and we will double-check if it is anything between our hl7validator and the tx.fhir.org endpoint.
It seems to be intermittent - and more time-out related than unreacheable.
Will gather some more data for us to share.
That said, https://validator.fhir.org/ is showing pink/red as I type this:
image.png
and getting this error right now:
image.png
We _really_ appreciate all the work you guys do - and our feedback is intended to be constructive!
===
from other thread - for completeness
image.png
Grahame Grieve (Sep 08 2021 at 11:31):
that's running the test cases?
David Simons (Sep 08 2021 at 14:56):
Grahame Grieve said:
that's running the test cases?
Hi @Grahame Grieve - no actually - that's running the hl7validator against our company's Innersource (internal) set of FHIR profiles - to validate them against the FHIR standard and any referenced (internal+external) profiles and terminologies.
And yes, this particular snippet is an example.xml Resource instance to test a structuredefinition.xml profile class.
David Simons (Sep 08 2021 at 15:03):
We're testing to see if it is related to network connectivity from our build pipeline on GitHub to tx.fhir.org - by running also from other local machines - and see how that compares...
I can access http://tx.fhir.org/ from my local browser, so that's good.
David Simons (Sep 08 2021 at 15:32):
This is what I mean with intermittently - only with some example Resources, the below are all very similar actually, so not all calls to tx fail, and not always the same ones, and seemingly only with stu3, not r4 - continuing to debug here...
2021-09-08T15:06:34.9123045Z Load FHIR v3.0 from hl7.fhir.r3.core#3.0.2 - 4017 resources (00:04.0228)
2021-09-08T15:06:36.0568771Z Load hl7.terminology#2.1.0 - 3767 resources (00:01.0144)
2021-09-08T15:06:38.0131260Z Terminology server http://tx.fhir.org - Version 1.9.382 (00:01.0956)
2021-09-08T15:06:38.5476364Z Load /usr/data/com.philips.fhir.stu3.common - 296 resources (00:00.0534)
2021-09-08T15:06:42.1180117Z Get set... go (00:03.0570)
2021-09-08T15:06:42.1191164Z Validating
2021-09-08T15:06:46.1211188Z Validate ACTBld.example.xml 00:03.0999
2021-09-08T15:06:48.0684432Z Validate ACTBldWithMultipleCodes.example.xml 00:01.0945
2021-09-08T15:06:51.2082604Z Validate APTTPPP.example.xml 00:03.0139
2021-09-08T15:06:56.3539933Z Validate BMIVitalSign.example.xml 00:05.0145
2021-09-08T15:06:57.7532603Z Validate BMIVitalSignWithMultipleCodes.example.xml 00:01.0394
2021-09-08T15:07:02.8003955Z Validate BNPSerPlsCnc.example.xml 00:05.0050
2021-09-08T15:07:04.0186035Z Validate BUNSerPlsCnc.example.xmltx.fhir.org:80 failed to respond (231ms / 2Kb for POST org.hl7.fhir.dstu3.model.ValueSet/$validate-code)
2021-09-08T15:07:05.2268610Z 00:02.0426
2021-09-08T15:07:14.2196766Z Validate BloodPressureVitalSign.example.xml 00:08.0992
2021-09-08T15:07:18.5999249Z Validate BodyTemperatureVitalSign.example.xml 00:04.0379
2021-09-08T15:07:22.3745163Z Validate CKMBSerPlmCnc.example.xml 00:03.0775
2021-09-08T15:07:26.4069934Z Validate CRPSerPlsCnc.example.xml 00:04.0032
2021-09-08T15:07:28.1204818Z Validate CalciumSerPlsCnc.example.xmltx.fhir.org:80 failed to respond (232ms / 563b for POST org.hl7.fhir.dstu3.model.CodeSystem/$validate-code)
2021-09-08T15:07:29.1415234Z 00:02.0734
David Simons (Sep 08 2021 at 16:26):
Going back in our build-logs - the first occurrence I find is 23-AUG-2021, but there has been no change in tx.fhir.org version 1.9.378 and hl7validator 5.4.6 around that time... The errors have changed from Error sending HTTP Post/Put Payload: Read timed out for 'http://loinc.org#8287-5'
then, to Error sending HTTP Post/Put Payload: tx.fhir.org:80 failed to respond
lately, with tx.fhir.org version 1.9.382 and hl7validator 5.4.6 today
David Simons (Sep 08 2021 at 16:44):
Last observation for now is that I also am now seeing errors for r4-based runs, surprisingly that I had not seen...
*FAILURE*: 2 errors, 0 warnings, 0 notes
Error @ Observation.code.coding[0] (line 20, col13) : Error performing operation 'validate-code: unexpected end of stream on http://tx.fhir.org/...' (parameters = "") for 'http://loinc.org#14682-9'
Error @ Observation.value.ofType(Quantity) (line 30, col18) : Error performing operation 'validate-code: unexpected end of stream on http://tx.fhir.org/...' (parameters = "") for 'http://unitsofmeasure.org#umol/L'
Grahame Grieve (Sep 08 2021 at 19:16):
failed to respond in 231ms? That seems... ambitious.
Grahame Grieve (Sep 08 2021 at 19:17):
how long do you give it?
David Simons (Sep 08 2021 at 19:35):
The key question is indeed what entity is giving up so quickly - we only run the hl7validator that is making these calls... :) (which has a 15s-30s timeout, right?)
Does it inherit some settings or get overruled from our GitHub Actions/Runners environment perhaps? Will look into that tomorrow - maybe that environment was changed by an admin, without us knowing... Indeed looks like something keeps cutting the connection off - just to soon at times.
Grahame Grieve (Sep 08 2021 at 20:34):
No. looking at the server... the server is having problems with phantom connections, and it's refusing to make more than 50 connections at once. That maybe what is the problem here. I'm not sure how to deal with systems making connections that are never going to do anything
David Simons (Sep 09 2021 at 08:00):
OK - we are already reducing the concurrent builds here - throttling back as much as possible - to avoid overload.
Maybe it is caused by java HTTPClient bugs in the hl7validator leaving stale/phanton connections - but that's guessing.
Will also reach out to our github selfhosted-runners admin to see if they put in a network constraint, recently that is hitting us.
Grahame Grieve (Sep 09 2021 at 09:20):
no I think it's other clients. It'll go for a day and suddenly it's getting hammered for a few minutes. It might survive or it might not.
Grahame Grieve (Sep 09 2021 at 09:20):
I'm still investigating
Grahame Grieve (Sep 09 2021 at 09:21):
but you know, investigating occasional failed connections is not the same as investigating 'down'
David Simons (Sep 09 2021 at 12:46):
do you need my IP addresses to confirm it is other users? or did you mean other client applications than hl7validator
Rob Hausam (Sep 09 2021 at 12:51):
I've been seeing very similar (seemingly rather random) patterns of query timeouts and intermittent failures. I think it's unlikely unique to @David Simons or your environment. @Grahame Grieve
David Simons (Sep 09 2021 at 14:04):
Interestingly enough - when it happens - randomly/intermittently - in 99% of the cases for us the timeouts occur during validation runs with -version 3.0
. Hardly happens with -version 4.0
, fwiw...
image.png
Grahame Grieve (Sep 09 2021 at 19:16):
code is exactly the same for both end-points
David Simons (Sep 09 2021 at 19:43):
Grahame Grieve said:
code is exactly the same for both end-points
on the tx.fhir.org server you mean - but I was referring to the hl7validator client - where afaik the TXClient code is quite different between 1/dstu2/stu3 and r4/r5, unfortunately. Thanks for your help!
Grahame Grieve (Sep 09 2021 at 20:13):
well, it is, but that's about to change. see https://github.com/hapifhir/org.hl7.fhir.core/pull/599
David Simons (Sep 10 2021 at 12:43):
ok, we're upgrading to https://github.com/hapifhir/org.hl7.fhir.core/releases/tag/5.5.3 :)
David Simons (Sep 10 2021 at 15:15):
David Simons said:
ok, we're upgrading to https://github.com/hapifhir/org.hl7.fhir.core/releases/tag/5.5.3 :)
As an update @Grahame Grieve - upgrading to the hl7validator 5.5.3 seems to have improved things a lot - maybe even resolved.
Haven't seen time-outs yet - will keep monitoring.
Upgrading the HTTP Client code for STU3 may have been the key.
David Simons (Sep 13 2021 at 08:43):
Thanks again @Grahame Grieve and team - looking much more stable for us now.
David Simons (Sep 13 2021 at 08:45):
PS: You may also want to update https://validator.fhir.org to 5.5.3
image.png
image.png
Grahame Grieve (Sep 13 2021 at 18:21):
@Mark Iantorno
Mark Iantorno (Sep 14 2021 at 14:34):
I will look when I get back. Just moved yesterday and my house is all in boxes. I'm back on Friday and will check it out.
Mark Iantorno (Sep 14 2021 at 14:35):
Thanks for using the online validator btw
I have a bit of work going in soon to add line numbers to the editor online when entering JSON and xml
Rick Geimer (Oct 12 2021 at 16:53):
Down again for me
Grahame Grieve (Oct 12 2021 at 18:29):
seems to be back now
Sarah Gaunt (Oct 14 2021 at 23:46):
Think it is down again: Publishing Content Failed: Unable to connect to terminology server. Error = Error fetching the server's capability statement: connect timed out (00:30.0682)
Sarah Gaunt (Oct 15 2021 at 02:20):
Looks like it's back - thanks to whoever fixed it!
Rob Hausam (Oct 15 2021 at 02:39):
I think Grahame may have been working on it.
Barbro Vessman (Oct 15 2021 at 09:50):
Term server seems to be down...
Rob Hausam (Oct 15 2021 at 10:56):
It should be back up now. I think Grahame has been working on it. It's had a lot of restarts so far this morning (my time). Let's see how it does now.
Barbro Vessman (Oct 15 2021 at 11:43):
Thank you, now it works! :sun_face:
Rick Geimer (Oct 18 2021 at 15:54):
Down again
Lloyd McKenzie (Oct 18 2021 at 16:03):
@Rob Hausam @Mark Iantorno
Rob Hausam (Oct 18 2021 at 16:29):
It should be back up now (automatically).
Rick Geimer (Oct 20 2021 at 16:47):
(deleted)
Grahame Grieve (Oct 20 2021 at 16:52):
working for me
Rob Hausam (Oct 20 2021 at 16:53):
I just restarted it.
Last updated: Apr 12 2022 at 19:14 UTC