Stream: crucible
Topic: crucible hanging
Ben Spencer (Mar 05 2019 at 13:23):
Hello again
We've deployed a public server and are attempting to run crucible against it. It seems to consistently get stuck after about half an hour, not always in exactly the same place. Once it gets stuck we can see from the logs that we're no longer receiving http requests from it. I've run a local crucible in docker-compose against it and that runs the tests to completion.
https://projectcrucible.org/servers/5c791f4404ebd07fa1000000
Last request from that run was at 13:07:42
Robert Scanlon (Mar 05 2019 at 14:06):
Hmmm, thanks Ben, we'll take a look. Looks like it is hanging on 'Resource Test Supply Delivery' right now. Screen-Shot-2019-03-05-at-9.05.00-AM.png
Robert Scanlon (Mar 05 2019 at 14:09):
In the past we have had some problems when running the full barrage of tests, though it wasn't consistent enough to replicate and track down the issue. I believe our running hypothesis was either the JSON parsing or XML parsing libraries caused some kind of fatal error to occur.
Robert Scanlon (Mar 05 2019 at 14:09):
From what you can tell, is it always failing on the same test?
Robert Scanlon (Mar 05 2019 at 14:10):
We also should have a cleanup job that identifies stalled tests, kills them, and resumes when this type of thing happens, but if it has been stuck awhile then perhaps that isn't working either
Ben Spencer (Mar 05 2019 at 14:17):
From what I can tell, it's not always failing on the same test, but it does seem to be roughly in the same place, somewhere in the Base Resources / Clinical Resources / Financial Resources sections
Let me know if I can provide any more information from our end.
Robert Scanlon (Mar 05 2019 at 17:40):
Turns out our process that identifies stalled jobs and restarts them had been disabled due a cron issue, I re-enabled it and now the test will get marked as a fatal 'Crucible Error', and Crucible will pick up on the next test in the run. Screen-Shot-2019-03-05-at-12.37.05-PM.png
Robert Scanlon (Mar 05 2019 at 17:40):
Not ideal, but at least Crucible gracefully recovers now.
Robert Scanlon (Mar 05 2019 at 17:41):
... and the rest of the tests will be run.
Ben Spencer (Mar 06 2019 at 08:31):
thanks Robert!
Ben Spencer (Mar 06 2019 at 08:32):
is it likely that the error was caused by something on our end?
Robert Scanlon (Mar 06 2019 at 16:57):
We've seen this happen on other servers, so it is unlikely that you are doing anything wrong. Let me know if things get stuck again -- I expect you'll still see the occasional 'Unrecoverable Crucible Error', but the tests shouldn't hang indefinitely any more
Robert Scanlon (Mar 06 2019 at 16:58):
Thanks for reporting that issue!
Last updated: Apr 12 2022 at 19:14 UTC