Stream: committers
Topic: Auto Build not working for fhir-mCODE-ig
Chris Moesel (May 29 2019 at 13:53):
We've just setup a new IG at https://github.com/HL7/fhir-mCODE-ig
It's configured for the auto build, but it doesn't seem that builds are going through. I've also tried triggering builds manually and get this response: {"org":"HL7","repo":"fhir-mCODE-ig","branch":"master"}
. Still, however, I see no evidence of the build actually running.
Is there a way to debug this further? I can build the IG locally. @Josh Mandel?
Chris Moesel (May 29 2019 at 13:54):
BTW -- I've seen _other_ builds go through after triggering mine, so I don't believe it is a systemic issue with the auto build.
Josh Mandel (May 29 2019 at 20:44):
We don't have a good transparent way to enable you to debug, but I'll see what's happening in the logs. (BTW @Chris Moesel I'm always happy to get you onboarded into the k8s cluster where we manage this infra if you ever want/need to poke around.)
Josh Mandel (May 29 2019 at 20:47):
I re-triggered the github webhook to kick off another build (equivalent to what you already did by manual triggering). Things crank along pretty well for the first couple of minutes, with build.log showing lines like
No value set at Extension.valueCodeableConcept (url = 'http://hl7.org/fhir/ValueSet/service-category') No value set at Extension.valueCodeableConcept (url = 'http://hl7.org/fhir/ValueSet/service-type') No value set at Extension.valueCodeableConcept (url = 'http://hl7.org/fhir/ValueSet/service-type') No value set at Extension.extension-type.valueCodeableConcept (url = 'http://hl7.org/fhir/ValueSet/research-study-objective-type') No value set at Extension.extension-type.valueCodeableConcept (url = 'http://hl7.org/fhir/ValueSet/research-study-objective-type') No value set at Extension.extension-type.valueCodeableConcept (url = 'http://hl7.org/fhir/ValueSet/udi-entry-type') No value set at Extension.extension-type.valueCodeableConcept (url = 'http://hl7.org/fhir/ValueSet/udi-entry-type') No value set at Extension.valueCodeableConcept (url = 'https://www.ncbi.nlm.nih.gov/clinvar') No value set at Extension.valueCodeableConcept (url = 'https://www.ncbi.nlm.nih.gov/clinvar') -tx cache miss: $expand on Include from http://snomed.info/sct where concept is-a 49755003
Josh Mandel (May 29 2019 at 20:47):
Will see where this blows up.
Josh Mandel (May 29 2019 at 20:52):
I see, simple as:
Start Time: Wed, 29 May 2019 12:59:11 +0000 Active Deadline Seconds: 1800s Pods Statuses: 0 Running / 0 Succeeded / 1 Failed
Josh Mandel (May 29 2019 at 20:52):
We set a 30-minute deadline for all jobs at:
https://github.com/FHIR/auto-ig-builder/blob/master/triggers/ig-commit-trigger/job.json#L11
Grahame Grieve (May 29 2019 at 20:53):
we can't get an error message when that limit is hit?
Josh Mandel (May 29 2019 at 20:53):
We should definitely look into this; the challenge right now is that jobs are killed from outside the container where they run (it's an external process doing the watching/killing, which has no context about what the job is doing; it just enforces a timer using k8s internals).
Josh Mandel (May 29 2019 at 20:53):
So the best way to get this behavior would be to wrap our job runner explicitly...
Josh Mandel (May 29 2019 at 20:54):
Do you know how long this build is expected to take @Chris Moesel? I don't think we've come close to the current deadline with other IGs, so this is new territory.
Grahame Grieve (May 29 2019 at 20:56):
mCode is that big? why?
Josh Mandel (May 29 2019 at 20:58):
It looks like the slow thing is stuff like...
-tx cache miss: $expand on Include 25 codes from http://snomed.info/sct -tx cache miss: $expand on Include 3 codes from http://unitsofmeasure.org -tx cache miss: $expand on Include from http://snomed.info/sct where concept is-a 272485000 -tx cache miss: $expand on Include from http://snomed.info/sct where concept is-a 42343007 -tx cache miss: $expand on Include from http://snomed.info/sct where concept is-a 698247007 -tx cache miss: $expand on Include from http://snomed.info/sct where concept is-a 368009 -tx cache miss: $expand on Include from http://snomed.info/sct where concept is-a 39785005 -tx cache miss: $expand on Include from http://snomed.info/sct where concept is-a 400047006 -tx cache miss: $expand on Include from http://snomed.info/sct where concept is-a 38341003 -tx cache miss: $expand on Include from http://snomed.info/sct where concept is-a 44695005 -tx cache miss: $expand on Include from http://snomed.info/sct where concept is-a 118940003 -tx cache miss: $expand on Include from http://snomed.info/sct where concept is-a 413839001 -tx cache miss: $expand on Include from http://snomed.info/sct where concept is-a 73211009 -tx cache miss: $expand on Include from http://snomed.info/sct where concept is-a 40930008 -tx cache miss: $expand on Include from http://snomed.info/sct where concept is-a 42399005 -tx cache miss: $expand on Include from http://snomed.info/sct where concept is-a 235856003 -tx cache miss: $expand on Include from http://snomed.info/sct where concept is-a 13200003 -tx cache miss: $expand on Include from http://snomed.info/sct where concept is-a 86406008 -tx cache miss: $expand on Include from http://snomed.info/sct where concept is-a 3723001 -tx cache miss: $expand on Include from http://snomed.info/sct where concept is-a 64779008 -tx cache miss: $expand on Include from http://snomed.info/sct where concept is-a 414916001 -tx cache miss: $expand on Include from http://snomed.info/sct where concept is-a 89362005 -tx cache miss: $expand on Include from http://snomed.info/sct where concept is-a 237840007 -tx cache miss: $expand on Include from http://snomed.info/sct where concept is-a 271737000
Grahame Grieve (May 29 2019 at 20:59):
build took 1:45 for me
Grahame Grieve (May 29 2019 at 21:00):
whoops - ran the wrong guide :-(
Grahame Grieve (May 29 2019 at 21:06):
5:37 for me
Josh Mandel (May 29 2019 at 21:06):
Is that with a bunch of cached terminology files present in your local filesystem?
Josh Mandel (May 29 2019 at 21:06):
Because the slow part looks like all these "tx cache miss" steps (it's making very slow progress).
Chris Moesel (May 29 2019 at 21:08):
Sorry -- replied when filtered to only my mentions, so missed most of the conversation... Let me try again...
Chris Moesel (May 29 2019 at 21:12):
The first run without a cache takes a while, but I've still never had it take 30+ minutes. I've checked in the txCache
, so not sure about all those misses -- unless there is a dependency between the cache and the version of the publisher? I think I built the cache using a publisher that is a week or two old.
Grahame Grieve (May 29 2019 at 21:14):
no dependency. I don't understand why it's a problem
Chris Moesel (May 29 2019 at 21:15):
I have to run to a 7th grade chorus concert (wish me luck), but I'll kick off a local build later to see how long it takes for me. Just to have another data point. Thanks for looking at this @Josh Mandel and @Grahame Grieve -- we really appreciate it!
Chris Moesel (May 29 2019 at 21:15):
@Grahame Grieve -- when you run it are you sitting right across the room from the terminology server? ;-)
Grahame Grieve (May 29 2019 at 21:16):
@Josh Mandel where do this cache miss notifications appear in the log? I don't get them...
Grahame Grieve (May 29 2019 at 21:17):
I'm sitting right across the cloud from the server, just like everyone else. I don't get those cache misses
Josh Mandel (May 29 2019 at 21:35):
They appear in my build.log
file, which includes all the stdout and stderr of the java process.
Grahame Grieve (May 29 2019 at 21:43):
I mean, what other log entries preceed them - I'm trying to place them in the build sequence
Josh Mandel (May 29 2019 at 22:17):
I'm updating the build system to push logs before a build is killed -- stand by :)
Josh Mandel (May 29 2019 at 22:17):
It'll take a bit...
Josh Mandel (May 29 2019 at 22:18):
In the meantime: I'm seeing stuff like: https://gist.github.com/jmandel/653a2463153f0b3892eb25e6c73f8247
Josh Mandel (May 29 2019 at 22:19):
(Somehow, making the in-build timeouts work meant upgrading jdk8->jdk12 and python2->3 :p)
Josh Mandel (May 29 2019 at 22:49):
okay, building now and should die + send a notification in ~25 min, with full logs. :fingers_crossed:
Chris Moesel (May 30 2019 at 01:29):
FYI -- just pulled master for fhir-mCODE-ig and built w/ the latest IG publisher. No cache misses and finished in 00:04:14.0187sec on my 2016 Macbook Pro.
Chris Moesel (May 30 2019 at 01:40):
Realized that was with old output still in the path, so git clean -xfd
to clear all uncommitted and ignored files, and ran it again to be sure. This time was even better: 00:03:59.0590sec.
Josh Mandel (May 30 2019 at 01:54):
Hmm. I was ble to reproduce the long build times from my laptop-- though I didn't let it go to completion so I'll take some detailed timing tomorrow.
Josh Mandel (May 30 2019 at 01:56):
When you say "no cache misses", what does that mean in terms of *where" cache hits are occurring? Is this something local or is this a hit in the server? (I'm assuming it can't be local because you should not have any cache files locally after a clean )
Grahame Grieve (May 30 2019 at 02:07):
cache files are not deleted by clean - they are committed it github to keep the build times down
Grahame Grieve (May 30 2019 at 02:08):
the files in /txCache.. for some reason, it seems as though the caching algorithm is not working for mCode...
Chris Moesel (May 30 2019 at 02:38):
Right, since the txCache
is committed I was definitely using the cache.
for some reason, it seems as though the caching algorithm is not working for mCode...
It seems that there aren't a ton of IGs out there still being built off DSTU2, and even fewer built off Argonaut. It seemed like we were the guinea pigs for several DSTU2/Argonaut issues in the IG Publisher. Is it possible that mCODE is having some one-off problems due to it's DSTU2-ishness?
Grahame Grieve (May 30 2019 at 02:48):
no. you and I are having no problem, and the caching algorithm is all R5 infrastructure
Grahame Grieve (May 30 2019 at 02:49):
it smells like a linux thing right now. You building on windows or OSX?
Chris Moesel (May 30 2019 at 02:58):
OSX
Chris Moesel (May 30 2019 at 03:10):
I have an idea why the cache might not be working on Linux. Windows and Mac use case-insensitive file systems by default -- so if you ask for file abc.txt
, but it's really called ABC.txt
, both Windows and Mac will happily give it to you. Linux, on the other hand, uses a case-sensitive file system so will report that abc.txt
does not exist. My guess is that the cache manager might be looking up cache files using a different case than they're actually stored on disk. Just a hunch that would explain the cache misses.
Chris Moesel (May 30 2019 at 03:11):
@Grahame Grieve -- is it possible that could happen?
Grahame Grieve (May 30 2019 at 03:12):
of course, that was the first thing I thought of. I cannot see any grounds for that in the code
Grahame Grieve (May 30 2019 at 03:12):
you have the name of the folder correct in the ig.json, the file names are hard coded
Chris Moesel (May 30 2019 at 03:16):
I should have known -- one step ahead of me! Oh well...
Chris Moesel (May 30 2019 at 03:16):
Actually, probably 2 or 3 steps ahead of me. :stuck_out_tongue_wink:
Grahame Grieve (May 30 2019 at 03:19):
it's possible it's a character set issue? I'm going to add some debugging...
Grahame Grieve (May 30 2019 at 04:29):
ok @Chris Moesel and @Josh Mandel please run the new IG publisher (0.9.21) locally, and paste this bit of the log here:
Grahame Grieve (May 30 2019 at 04:29):
Load Terminology Cache from C:\work\org.hl7.fhir.igs\fhir-mCODE-ig\txCache (09.0082sec) - load all-systems.cache......done - load condition-category.cache..done - load https___ncit.nci.nih.gov.cache....done - load icd-10-cm.cache.....done - load lang.cache..done - load loinc.cache....................done - load mcode-basic-resource-type.cache..done - load snomed.cache........................................................................................done - load ucum.cache...........done - load usmcodeshrcorecsResearchSubjectTerminationReasonCS.cache..done - load usmcodeshrcorecsSettingCS.cache..done
Josh Mandel (May 30 2019 at 14:55):
$ grep -ri "load term" -A 100 build.log Load Terminology Cache from /scratch/ig-build-temp-KKFBIZ/repo/txCache (23.0354sec) Connect to Terminology Server at http://tx.fhir.org (23.0355sec) -tx cache miss: Connect to http://tx.fhir.org/r2 Initialization complete (24.0867sec) Load Content (24.0867sec) create package file at /scratch/ig-build-temp-KKFBIZ/repo/output/package.tgz Processing Conformance Resources (32.0134sec) process type: NamingSystem (32.0134sec) process type: CodeSystem (32.0135sec) process type: ValueSet (32.0135sec) process type: ConceptMap (32.0136sec) process type: DataElement (32.0136sec) process type: StructureDefinition (32.0136sec) process type: OperationDefinition (32.0137sec) process type: SearchParameter (32.0137sec) process type: CapabilityStatement (32.0138sec)
Josh Mandel (May 30 2019 at 14:57):
(Also, the cache files are... interesting. Why not json?)
Chris Moesel (May 30 2019 at 17:15):
Mine looks more like Grahame's:
Load Terminology Cache from /Users/cmoesel/dev/HL7/fhir-mCODE-ig/txCache (07.0828sec) - load lang.cache..done - load all-systems.cache......done - load usmcodeshrcorecsResearchSubjectTerminationReasonCS.cache..done - load icd-10-cm.cache.....done - load ucum.cache...........done - load loinc.cache....................done - load condition-category.cache..done - load usmcodeshrcorecsSettingCS.cache..done - load snomed.cache........................................................................................done - load https___ncit.nci.nih.gov.cache....done - load mcode-basic-resource-type.cache..done Connect to Terminology Server at http://tx.fhir.org (08.0213sec) -tx cache miss: Connect to http://tx.fhir.org/r2 Initialization complete (09.0413sec)
Grahame Grieve (May 30 2019 at 23:01):
@Josh Mandel so weird because very standard java File.list() is not seeing the files that exist:
for (String fn : new File(folder).list()) { if (fn.endsWith(".cache") && !fn.equals("validation.cache")) {
Grahame Grieve (May 30 2019 at 23:02):
in your case, folder is /scratch/ig-build-temp-KKFBIZ/repo/txCache - that exists, yes?
Josh Mandel (May 30 2019 at 23:02):
Yes it does.
Grahame Grieve (May 30 2019 at 23:03):
so how could that code not find any files?
Josh Mandel (May 30 2019 at 23:04):
Hmm.
Josh Mandel (May 30 2019 at 23:06):
well txCache
starts out empty.
Grahame Grieve (May 30 2019 at 23:06):
... because?
Josh Mandel (May 30 2019 at 23:06):
I'm wondering that.
Grahame Grieve (May 30 2019 at 23:07):
but if it's empty, that will be why that code doesn't see any files..
Josh Mandel (May 30 2019 at 23:07):
yes...
Josh Mandel (May 30 2019 at 23:07):
so it gets emptied.
Josh Mandel (May 30 2019 at 23:08):
About 30 seconds in.
Josh Mandel (May 30 2019 at 23:08):
So immediately post-clone:
bash-4.2# find . | grep -i txCache ./ig-build-temp-ARCYSH/repo/txCache ./ig-build-temp-ARCYSH/repo/txCache/snomed.cache ./ig-build-temp-ARCYSH/repo/txCache/lang.cache ./ig-build-temp-ARCYSH/repo/txCache/usmcodeshrcorecsSettingCS.cache ./ig-build-temp-ARCYSH/repo/txCache/icd-10-cm.cache ./ig-build-temp-ARCYSH/repo/txCache/https___ncit.nci.nih.gov.cache ./ig-build-temp-ARCYSH/repo/txCache/version.ctl ./ig-build-temp-ARCYSH/repo/txCache/usmcodeshrcorecsResearchSubjectTerminationReasonCS.cache ./ig-build-temp-ARCYSH/repo/txCache/loinc.cache ./ig-build-temp-ARCYSH/repo/txCache/condition-category.cache ./ig-build-temp-ARCYSH/repo/txCache/ucum.cache ./ig-build-temp-ARCYSH/repo/txCache/all-systems.cache ./ig-build-temp-ARCYSH/repo/txCache/mcode-basic-resource-type.cache
Josh Mandel (May 30 2019 at 23:08):
Then:
bash-4.2# find . | grep -i txCache
Josh Mandel (May 30 2019 at 23:08):
Then:
bash-4.2# find . | grep -i txCache ./ig-build-temp-ARCYSH/repo/txCache ./ig-build-temp-ARCYSH/repo/txCache/version.ctl
Grahame Grieve (May 30 2019 at 23:09):
hmm. O
Josh Mandel (May 30 2019 at 23:09):
So that's a lead for you :)
Grahame Grieve (May 30 2019 at 23:09):
I'll investigate
Grahame Grieve (May 30 2019 at 23:59):
ok so this turns out to be my fault. The publisher does clear the cache on the ci-build. I've pared that back to leave snomed/loinc cache in place
Grahame Grieve (May 31 2019 at 01:43):
ok wel, let's try again iwht the new version of the ig publisher
Chris Moesel (May 31 2019 at 03:28):
I just triggered a new build since I saw that Grahame announced a new publisher. Let's see what happens!
Josh Mandel (May 31 2019 at 03:31):
Haha, I did the same, right about when you did. Indeed let's see.
Grahame Grieve (May 31 2019 at 03:39):
so did I. I don't see any success or failure
Chris Moesel (May 31 2019 at 03:50):
Ha. I saw no comments about this in the past 2 hours, so I figured I should just go ahead and do it. I guess we're all pretty synchronized at this point. That said, 20 minutes later and no build notification yet... That 30 minute clock is ticking...
Grahame Grieve (May 31 2019 at 03:51):
I submitted 2 hours ago
Josh Mandel (May 31 2019 at 13:28):
(Well, the whole systems appears to be down -- looking now.)
Josh Mandel (May 31 2019 at 13:30):
Yeah, my attempt to pretty up the notification output yesterday resulted in an error.
Josh Mandel (May 31 2019 at 13:32):
OK, rebuilding now after that fix
Chris Moesel (May 31 2019 at 13:34):
Hey @Josh Mandel ... are you having fun yet? ;-)
Josh Mandel (May 31 2019 at 13:39):
Always :)
Josh Mandel (May 31 2019 at 13:48):
Well that wasn't exactly fast but it was way faster :)
Josh Mandel (May 31 2019 at 13:48):
(And it finished.)
Grahame Grieve (May 31 2019 at 13:57):
it's about as fast as I expected
Grahame Grieve (May 31 2019 at 13:57):
leaving the snomed & loinc expansions in place but forcing the rest to be redone
Grahame Grieve (May 31 2019 at 13:57):
thanks
Chris Moesel (May 31 2019 at 14:04):
Yes, thank you both. I really appreciate your time and dedication in getting this resolved (and in everything else you do!).
Chris Moesel (May 31 2019 at 14:05):
If I were going to be at dev days (but I'm not), I'd buy you both a beer or seven. But hey, it's the thought that matters, right?
Mark Kramer (Jun 01 2019 at 21:16):
Thank you, @Grahame Grieve and @Josh Mandel for fixing this. I will be at DevDays so the beer is on my for this fix!
Last updated: Apr 12 2022 at 19:14 UTC