Stream: IG creation
Topic: tx server down?
David Pyke (Jun 10 2021 at 14:07):
I'm getting timeouts on connect
David Pyke (Jun 10 2021 at 14:08):
http://build.fhir.org/ig/Carequality/CEQSubscription/branches/master/failure/build.log
David deRoode (Jun 10 2021 at 14:23):
Me too
Rob Hausam (Jun 10 2021 at 14:25):
I'll check
Rob Hausam (Jun 10 2021 at 14:35):
Restarted - service is back up now.
Josh Mandel (Jun 10 2021 at 14:36):
When the server crashes, is the entire thing locked? Would it be possible to have a watchdog process that restarts it automatically when some conditions are met?
Rob Hausam (Jun 10 2021 at 14:41):
@Josh Mandel It's the service that hangs (I suspect it could be due to memory leak issues?). The server itself is (usually) fine. In lieu of having the underlying problem resolved, I agree that a watchdog process that can follow the service activity and detect when it has stopped and then automatically restart the service, as you suggest, would probably be an immense help. If the watchdog could notify me, Mark and Grahame and then restart the service, that would probably be ideal.
David Pyke (Jun 10 2021 at 14:42):
So, if it can't fetch the capabilitystatement, it restarts the service.
Rob Hausam (Jun 10 2021 at 14:42):
Yes, sure - that should work.
Rob Hausam (Jun 10 2021 at 14:45):
The watchdog could be set to ping periodically for the CapabilityStatement (I don't think the ping would need to be terminology specific).
Josh Mandel (Jun 10 2021 at 16:20):
Cool. Is this a linux server or windows? How are deployments/updates managed?
Grahame Grieve (Jun 10 2021 at 16:26):
windows. I install a windows service
David Pyke (Jun 10 2021 at 16:35):
Well, there's your problem... Windows
Gino Canessa (Jun 10 2021 at 16:45):
What kind of notification would you want when it goes down?
David Pyke (Jun 10 2021 at 16:54):
A fax. Need to be backwards compatible.
Gino Canessa (Jun 10 2021 at 16:57):
Hmm.. I'd suggest some e-Fax solution then; the international call charges are going to be nuts otherwise.
David Pyke (Jun 10 2021 at 16:58):
Maybe an email to fax gateway
Rob Hausam (Jun 10 2021 at 17:53):
@Gino Canessa Maybe we could take it a step up and go with email? :) Or a Zulip notification. So we can be aware of the circumstances and monitor how often this problem actually occurs. Eventually we should see about fixing the underlying issue.
David Pyke (Jun 10 2021 at 17:54):
I can meet you half way and use ICQ
Gino Canessa (Jun 10 2021 at 18:06):
@Rob Hausam sure - I'm just doing something basic, but email is straightforward. I don't want to do too much, because it quickly becomes better to bite the bullet and use a proper monitoring package.
Rob Hausam (Jun 10 2021 at 18:25):
Yes, that makes sense. Best to keep it simple.
Gino Canessa (Jun 10 2021 at 21:57):
@Rob Hausam @Grahame Grieve I have a basic service that monitors a url, restarts a service (and kills a process if it can't stop it, etc.). I'm starting the notification code, but realized the time and I won't be able to get back to it until next week. Do you have a preference on what it has now vs. waiting until next week?
Rob Hausam (Jun 10 2021 at 22:20):
@Gino Canessa I think the notification is definitely a very nice to have. But I, at least, think that going ahead and deploying what you have now makes sense. It would definitely help us to provide an improved level of service. It definitely seems that as the demands on the server have apparently been increasing (I haven't actually had a look at the stats) that the issue has been showing up more frequently.
Peter Jordan (Jun 10 2021 at 23:09):
A good, lightweight availability test is [base]/$versions which is called every 5 mins on my Server (hosted on Azure).
Gino Canessa (Jun 15 2021 at 18:59):
@Rob Hausam Sorry for the delay in getting back, I missed it last week and am just getting back to this. Josh setup a GH for this, so it's at: https://github.com/FHIR/terminology-service-liveness-monitor . You can clone the repo and build, or I did a release for v0.0.1 which has the binaries.
The documentation explains, but the TLDR:
- A console app that can also be installed as a service
- Configuration can be done either via
appsettings.json
(in the executable directory) or via environment variables (standard translation for the property names - Microsoft.Extensions.Configuration)WindowsServiceName
: the name of the service that we are monitoring (e.g., the one to start/stop)ProcessName
: if a value is present, the monitor will kill that process if it takes too long to stopServiceTestUrl
: the URL to GET, service will be restarted if the request failsServiceStopDelaySeconds
: number of seconds to wait before killing the process (if set inProcessName
)PollIntervalSeconds
: number of seconds between checks to theServiceTestUrl
The Email configuration can be ignored for now, since I haven't implemented it yet.
edit: It can be run from the command line (requires admin) or installed as a service via something like sc create terminology-service-liveness-monitor "binPath=<path to executable>\terminology-service-liveness-monitor.exe" start=demand
I'll add the email as soon as I have a chance, but this should at least keep the service running until someone sets up a proper monitoring system.
Cheers!
Rob Hausam (Jun 15 2021 at 19:40):
Thanks, @Gino Canessa. That is fantastic! I should be able to work on getting it set up on the server later today.
Grahame Grieve (Jun 16 2021 at 22:22):
@Gino Canessa the server takes about 5 minutes to start up. is there a start delay?
Gino Canessa (Jun 16 2021 at 22:24):
Hmm.. I don’t think I put that in. It should wait until it has a success after starting to do anything, but I don’t remember if I put that in or just thought about it. I’ll check next chance I get
Rob Hausam (Jun 16 2021 at 23:29):
@Gino Canessa @Grahame Grieve Actually, I'm thinking from what I've seen that the start time is probably closer to 10 min (at least to be safe about it). So I wouln't do any monitoring until at least that amount of time after restart has passed.
Gino Canessa (Jun 16 2021 at 23:48):
My thought is that once a start is issued, failures are ignored (can add a max timeout here if we want). Once there is a success, monitoring resumes as normal
Rob Hausam (Jun 17 2021 at 01:10):
That seems like probably the most sensible and simplest approach.
Christian Nau (Jun 23 2021 at 10:40):
Hi, during IG build I'm getting this error:
Error performing operation 'validate-code: Failed to connect to tx.fhir.org/104.196.166.17:80' (parameters = "") for 'http://unitsofmeasure.org#cm[H2O]'
Is the related to the server being down? If I go to http://tx.fhir.org/r4/CodeSystem/$lookup?system=http://unitsofmeasure.org&code=cm[H2O] I get an response. What could be the reason for the error in my IG build?
Grahame Grieve (Jun 23 2021 at 11:11):
have you cleared out your txCache
Christian Nau (Jun 23 2021 at 11:17):
yes, I did.
Christian Nau (Jun 23 2021 at 11:18):
...if you are talking about the directory ig-root/input-cache/txcache
Grahame Grieve (Jun 23 2021 at 11:22):
and that particular operation is consistently failing?
Jose Costa Teixeira (Feb 02 2022 at 08:53):
@Grahame Grieve IG Publisher is complaining
Grahame Grieve (Feb 02 2022 at 09:24):
what about?
Jose Costa Teixeira (Feb 02 2022 at 09:47):
Timeouts
Jose Costa Teixeira (Feb 02 2022 at 09:47):
It was timing out for a while, then worked again, now it's hesitating
Oliver Egger (Feb 02 2022 at 11:43):
if's failing now completely with the ci-build: org.hl7.fhir.exceptions.FHIRException: Unable to connect to terminology server. Error = Error fetching the server's capability statement: Error parsing response message: This does not appear to be a FHIR resource (wrong namespace '') (@ /)
Oliver Egger (Feb 02 2022 at 11:44):
e.g. http://build.fhir.org/ig/ahdis/ch-crl/branches/master/failure/build.log
Jose Costa Teixeira (Feb 02 2022 at 12:43):
@Rob Hausam can you help?
Rob Hausam (Feb 02 2022 at 12:54):
yes - let me check
Rob Hausam (Feb 02 2022 at 13:20):
I'm working on restarting it. It's not coming up very quickly, but it may just need a little longer.
Rob Hausam (Feb 02 2022 at 13:36):
The startup had a glitch, but it should be coming back.
Rob Hausam (Feb 02 2022 at 13:50):
The FHIR server may be having an SSL issue. I'll restart the VM and see how that works.
Rob Hausam (Feb 02 2022 at 14:47):
We should finally be back up. Sorry it took so long!
Grahame Grieve (Feb 02 2022 at 20:00):
what happened?
Rob Hausam (Feb 02 2022 at 20:38):
Not clear. The monitor didn't detect that it was down, but the console was frozen and the server wasn't responding to requests. I was preparing to lead and then was leading the IPS call at the same time, and I didn't end up checking if memory was maxed out (or something like that). Restarting the service didn't restore it. But after rebooting the VM and then restarting the service a couple of additional times, it resumed normal operation and seems to have been functioning normally since then.
Yan Heras (Feb 23 2022 at 22:15):
Is the terminology server currently down? ... Caused by: java.net.ConnectException: Failed to connect to tx.fhir.org/104.196.166.17:80
Grahame Grieve (Feb 23 2022 at 22:17):
yes
Last updated: Apr 12 2022 at 19:14 UTC