FHIR Chat · tx server down? · IG creation

Stream: IG creation

Topic: tx server down?


view this post on Zulip David Pyke (Jun 10 2021 at 14:07):

I'm getting timeouts on connect

view this post on Zulip David Pyke (Jun 10 2021 at 14:08):

http://build.fhir.org/ig/Carequality/CEQSubscription/branches/master/failure/build.log

view this post on Zulip David deRoode (Jun 10 2021 at 14:23):

Me too

view this post on Zulip Rob Hausam (Jun 10 2021 at 14:25):

I'll check

view this post on Zulip Rob Hausam (Jun 10 2021 at 14:35):

Restarted - service is back up now.

view this post on Zulip Josh Mandel (Jun 10 2021 at 14:36):

When the server crashes, is the entire thing locked? Would it be possible to have a watchdog process that restarts it automatically when some conditions are met?

view this post on Zulip Rob Hausam (Jun 10 2021 at 14:41):

@Josh Mandel It's the service that hangs (I suspect it could be due to memory leak issues?). The server itself is (usually) fine. In lieu of having the underlying problem resolved, I agree that a watchdog process that can follow the service activity and detect when it has stopped and then automatically restart the service, as you suggest, would probably be an immense help. If the watchdog could notify me, Mark and Grahame and then restart the service, that would probably be ideal.

view this post on Zulip David Pyke (Jun 10 2021 at 14:42):

So, if it can't fetch the capabilitystatement, it restarts the service.

view this post on Zulip Rob Hausam (Jun 10 2021 at 14:42):

Yes, sure - that should work.

view this post on Zulip Rob Hausam (Jun 10 2021 at 14:45):

The watchdog could be set to ping periodically for the CapabilityStatement (I don't think the ping would need to be terminology specific).

view this post on Zulip Josh Mandel (Jun 10 2021 at 16:20):

Cool. Is this a linux server or windows? How are deployments/updates managed?

view this post on Zulip Grahame Grieve (Jun 10 2021 at 16:26):

windows. I install a windows service

view this post on Zulip David Pyke (Jun 10 2021 at 16:35):

Well, there's your problem... Windows

view this post on Zulip Gino Canessa (Jun 10 2021 at 16:45):

What kind of notification would you want when it goes down?

view this post on Zulip David Pyke (Jun 10 2021 at 16:54):

A fax. Need to be backwards compatible.

view this post on Zulip Gino Canessa (Jun 10 2021 at 16:57):

Hmm.. I'd suggest some e-Fax solution then; the international call charges are going to be nuts otherwise.

view this post on Zulip David Pyke (Jun 10 2021 at 16:58):

Maybe an email to fax gateway

view this post on Zulip Rob Hausam (Jun 10 2021 at 17:53):

@Gino Canessa Maybe we could take it a step up and go with email? :) Or a Zulip notification. So we can be aware of the circumstances and monitor how often this problem actually occurs. Eventually we should see about fixing the underlying issue.

view this post on Zulip David Pyke (Jun 10 2021 at 17:54):

I can meet you half way and use ICQ

view this post on Zulip Gino Canessa (Jun 10 2021 at 18:06):

@Rob Hausam sure - I'm just doing something basic, but email is straightforward. I don't want to do too much, because it quickly becomes better to bite the bullet and use a proper monitoring package.

view this post on Zulip Rob Hausam (Jun 10 2021 at 18:25):

Yes, that makes sense. Best to keep it simple.

view this post on Zulip Gino Canessa (Jun 10 2021 at 21:57):

@Rob Hausam @Grahame Grieve I have a basic service that monitors a url, restarts a service (and kills a process if it can't stop it, etc.). I'm starting the notification code, but realized the time and I won't be able to get back to it until next week. Do you have a preference on what it has now vs. waiting until next week?

view this post on Zulip Rob Hausam (Jun 10 2021 at 22:20):

@Gino Canessa I think the notification is definitely a very nice to have. But I, at least, think that going ahead and deploying what you have now makes sense. It would definitely help us to provide an improved level of service. It definitely seems that as the demands on the server have apparently been increasing (I haven't actually had a look at the stats) that the issue has been showing up more frequently.

view this post on Zulip Peter Jordan (Jun 10 2021 at 23:09):

A good, lightweight availability test is [base]/$versions which is called every 5 mins on my Server (hosted on Azure).

view this post on Zulip Gino Canessa (Jun 15 2021 at 18:59):

@Rob Hausam Sorry for the delay in getting back, I missed it last week and am just getting back to this. Josh setup a GH for this, so it's at: https://github.com/FHIR/terminology-service-liveness-monitor . You can clone the repo and build, or I did a release for v0.0.1 which has the binaries.

The documentation explains, but the TLDR:

  • A console app that can also be installed as a service
  • Configuration can be done either via appsettings.json (in the executable directory) or via environment variables (standard translation for the property names - Microsoft.Extensions.Configuration)
    • WindowsServiceName: the name of the service that we are monitoring (e.g., the one to start/stop)
    • ProcessName: if a value is present, the monitor will kill that process if it takes too long to stop
    • ServiceTestUrl: the URL to GET, service will be restarted if the request fails
    • ServiceStopDelaySeconds: number of seconds to wait before killing the process (if set in ProcessName)
    • PollIntervalSeconds: number of seconds between checks to the ServiceTestUrl

The Email configuration can be ignored for now, since I haven't implemented it yet.

edit: It can be run from the command line (requires admin) or installed as a service via something like sc create terminology-service-liveness-monitor "binPath=<path to executable>\terminology-service-liveness-monitor.exe" start=demand

I'll add the email as soon as I have a chance, but this should at least keep the service running until someone sets up a proper monitoring system.

Cheers!

view this post on Zulip Rob Hausam (Jun 15 2021 at 19:40):

Thanks, @Gino Canessa. That is fantastic! I should be able to work on getting it set up on the server later today.

view this post on Zulip Grahame Grieve (Jun 16 2021 at 22:22):

@Gino Canessa the server takes about 5 minutes to start up. is there a start delay?

view this post on Zulip Gino Canessa (Jun 16 2021 at 22:24):

Hmm.. I don’t think I put that in. It should wait until it has a success after starting to do anything, but I don’t remember if I put that in or just thought about it. I’ll check next chance I get

view this post on Zulip Rob Hausam (Jun 16 2021 at 23:29):

@Gino Canessa @Grahame Grieve Actually, I'm thinking from what I've seen that the start time is probably closer to 10 min (at least to be safe about it). So I wouln't do any monitoring until at least that amount of time after restart has passed.

view this post on Zulip Gino Canessa (Jun 16 2021 at 23:48):

My thought is that once a start is issued, failures are ignored (can add a max timeout here if we want). Once there is a success, monitoring resumes as normal

view this post on Zulip Rob Hausam (Jun 17 2021 at 01:10):

That seems like probably the most sensible and simplest approach.

view this post on Zulip Christian Nau (Jun 23 2021 at 10:40):

Hi, during IG build I'm getting this error:

Error performing operation 'validate-code: Failed to connect to tx.fhir.org/104.196.166.17:80' (parameters = "") for 'http://unitsofmeasure.org#cm[H2O]'

Is the related to the server being down? If I go to http://tx.fhir.org/r4/CodeSystem/$lookup?system=http://unitsofmeasure.org&code=cm[H2O] I get an response. What could be the reason for the error in my IG build?

view this post on Zulip Grahame Grieve (Jun 23 2021 at 11:11):

have you cleared out your txCache

view this post on Zulip Christian Nau (Jun 23 2021 at 11:17):

yes, I did.

view this post on Zulip Christian Nau (Jun 23 2021 at 11:18):

...if you are talking about the directory ig-root/input-cache/txcache

view this post on Zulip Grahame Grieve (Jun 23 2021 at 11:22):

and that particular operation is consistently failing?

view this post on Zulip Jose Costa Teixeira (Feb 02 2022 at 08:53):

@Grahame Grieve IG Publisher is complaining

view this post on Zulip Grahame Grieve (Feb 02 2022 at 09:24):

what about?

view this post on Zulip Jose Costa Teixeira (Feb 02 2022 at 09:47):

Timeouts

view this post on Zulip Jose Costa Teixeira (Feb 02 2022 at 09:47):

It was timing out for a while, then worked again, now it's hesitating

view this post on Zulip Oliver Egger (Feb 02 2022 at 11:43):

if's failing now completely with the ci-build: org.hl7.fhir.exceptions.FHIRException: Unable to connect to terminology server. Error = Error fetching the server's capability statement: Error parsing response message: This does not appear to be a FHIR resource (wrong namespace '') (@ /)

view this post on Zulip Oliver Egger (Feb 02 2022 at 11:44):

e.g. http://build.fhir.org/ig/ahdis/ch-crl/branches/master/failure/build.log

view this post on Zulip Jose Costa Teixeira (Feb 02 2022 at 12:43):

@Rob Hausam can you help?

view this post on Zulip Rob Hausam (Feb 02 2022 at 12:54):

yes - let me check

view this post on Zulip Rob Hausam (Feb 02 2022 at 13:20):

I'm working on restarting it. It's not coming up very quickly, but it may just need a little longer.

view this post on Zulip Rob Hausam (Feb 02 2022 at 13:36):

The startup had a glitch, but it should be coming back.

view this post on Zulip Rob Hausam (Feb 02 2022 at 13:50):

The FHIR server may be having an SSL issue. I'll restart the VM and see how that works.

view this post on Zulip Rob Hausam (Feb 02 2022 at 14:47):

We should finally be back up. Sorry it took so long!

view this post on Zulip Grahame Grieve (Feb 02 2022 at 20:00):

what happened?

view this post on Zulip Rob Hausam (Feb 02 2022 at 20:38):

Not clear. The monitor didn't detect that it was down, but the console was frozen and the server wasn't responding to requests. I was preparing to lead and then was leading the IPS call at the same time, and I didn't end up checking if memory was maxed out (or something like that). Restarting the service didn't restore it. But after rebooting the VM and then restarting the service a couple of additional times, it resumed normal operation and seems to have been functioning normally since then.

view this post on Zulip Yan Heras (Feb 23 2022 at 22:15):

Is the terminology server currently down? ... Caused by: java.net.ConnectException: Failed to connect to tx.fhir.org/104.196.166.17:80

view this post on Zulip Grahame Grieve (Feb 23 2022 at 22:17):

yes


Last updated: Apr 12 2022 at 19:14 UTC