FHIR Chat · Auto Build not working for fhir-mCODE-ig · committers

Stream: committers

Topic: Auto Build not working for fhir-mCODE-ig


view this post on Zulip Chris Moesel (May 29 2019 at 13:53):

We've just setup a new IG at https://github.com/HL7/fhir-mCODE-ig

It's configured for the auto build, but it doesn't seem that builds are going through. I've also tried triggering builds manually and get this response: {"org":"HL7","repo":"fhir-mCODE-ig","branch":"master"}. Still, however, I see no evidence of the build actually running.

Is there a way to debug this further? I can build the IG locally. @Josh Mandel?

view this post on Zulip Chris Moesel (May 29 2019 at 13:54):

BTW -- I've seen _other_ builds go through after triggering mine, so I don't believe it is a systemic issue with the auto build.

view this post on Zulip Josh Mandel (May 29 2019 at 20:44):

We don't have a good transparent way to enable you to debug, but I'll see what's happening in the logs. (BTW @Chris Moesel I'm always happy to get you onboarded into the k8s cluster where we manage this infra if you ever want/need to poke around.)

view this post on Zulip Josh Mandel (May 29 2019 at 20:47):

I re-triggered the github webhook to kick off another build (equivalent to what you already did by manual triggering). Things crank along pretty well for the first couple of minutes, with build.log showing lines like

No value set at Extension.valueCodeableConcept (url = 'http://hl7.org/fhir/ValueSet/service-category')
No value set at Extension.valueCodeableConcept (url = 'http://hl7.org/fhir/ValueSet/service-type')
No value set at Extension.valueCodeableConcept (url = 'http://hl7.org/fhir/ValueSet/service-type')
No value set at Extension.extension-type.valueCodeableConcept (url = 'http://hl7.org/fhir/ValueSet/research-study-objective-type')
No value set at Extension.extension-type.valueCodeableConcept (url = 'http://hl7.org/fhir/ValueSet/research-study-objective-type')
No value set at Extension.extension-type.valueCodeableConcept (url = 'http://hl7.org/fhir/ValueSet/udi-entry-type')
No value set at Extension.extension-type.valueCodeableConcept (url = 'http://hl7.org/fhir/ValueSet/udi-entry-type')
No value set at Extension.valueCodeableConcept (url = 'https://www.ncbi.nlm.nih.gov/clinvar')
No value set at Extension.valueCodeableConcept (url = 'https://www.ncbi.nlm.nih.gov/clinvar')
-tx cache miss: $expand on Include from http://snomed.info/sct where concept is-a 49755003

view this post on Zulip Josh Mandel (May 29 2019 at 20:47):

Will see where this blows up.

view this post on Zulip Josh Mandel (May 29 2019 at 20:52):

I see, simple as:

Start Time:               Wed, 29 May 2019 12:59:11 +0000
Active Deadline Seconds:  1800s
Pods Statuses:            0 Running / 0 Succeeded / 1 Failed

view this post on Zulip Josh Mandel (May 29 2019 at 20:52):

We set a 30-minute deadline for all jobs at:

https://github.com/FHIR/auto-ig-builder/blob/master/triggers/ig-commit-trigger/job.json#L11

view this post on Zulip Grahame Grieve (May 29 2019 at 20:53):

we can't get an error message when that limit is hit?

view this post on Zulip Josh Mandel (May 29 2019 at 20:53):

We should definitely look into this; the challenge right now is that jobs are killed from outside the container where they run (it's an external process doing the watching/killing, which has no context about what the job is doing; it just enforces a timer using k8s internals).

view this post on Zulip Josh Mandel (May 29 2019 at 20:53):

So the best way to get this behavior would be to wrap our job runner explicitly...

view this post on Zulip Josh Mandel (May 29 2019 at 20:54):

Do you know how long this build is expected to take @Chris Moesel? I don't think we've come close to the current deadline with other IGs, so this is new territory.

view this post on Zulip Grahame Grieve (May 29 2019 at 20:56):

mCode is that big? why?

view this post on Zulip Josh Mandel (May 29 2019 at 20:58):

It looks like the slow thing is stuff like...

-tx cache miss: $expand on Include 25 codes from http://snomed.info/sct
-tx cache miss: $expand on Include 3 codes from http://unitsofmeasure.org
-tx cache miss: $expand on Include from http://snomed.info/sct where concept is-a 272485000
-tx cache miss: $expand on Include from http://snomed.info/sct where concept is-a 42343007
-tx cache miss: $expand on Include from http://snomed.info/sct where concept is-a 698247007
-tx cache miss: $expand on Include from http://snomed.info/sct where concept is-a 368009
-tx cache miss: $expand on Include from http://snomed.info/sct where concept is-a 39785005
-tx cache miss: $expand on Include from http://snomed.info/sct where concept is-a 400047006
-tx cache miss: $expand on Include from http://snomed.info/sct where concept is-a 38341003
-tx cache miss: $expand on Include from http://snomed.info/sct where concept is-a 44695005
-tx cache miss: $expand on Include from http://snomed.info/sct where concept is-a 118940003
-tx cache miss: $expand on Include from http://snomed.info/sct where concept is-a 413839001
-tx cache miss: $expand on Include from http://snomed.info/sct where concept is-a 73211009
-tx cache miss: $expand on Include from http://snomed.info/sct where concept is-a 40930008
-tx cache miss: $expand on Include from http://snomed.info/sct where concept is-a 42399005
-tx cache miss: $expand on Include from http://snomed.info/sct where concept is-a 235856003
-tx cache miss: $expand on Include from http://snomed.info/sct where concept is-a 13200003
-tx cache miss: $expand on Include from http://snomed.info/sct where concept is-a 86406008
-tx cache miss: $expand on Include from http://snomed.info/sct where concept is-a 3723001
-tx cache miss: $expand on Include from http://snomed.info/sct where concept is-a 64779008
-tx cache miss: $expand on Include from http://snomed.info/sct where concept is-a 414916001
-tx cache miss: $expand on Include from http://snomed.info/sct where concept is-a 89362005
-tx cache miss: $expand on Include from http://snomed.info/sct where concept is-a 237840007
-tx cache miss: $expand on Include from http://snomed.info/sct where concept is-a 271737000

view this post on Zulip Grahame Grieve (May 29 2019 at 20:59):

build took 1:45 for me

view this post on Zulip Grahame Grieve (May 29 2019 at 21:00):

whoops - ran the wrong guide :-(

view this post on Zulip Grahame Grieve (May 29 2019 at 21:06):

5:37 for me

view this post on Zulip Josh Mandel (May 29 2019 at 21:06):

Is that with a bunch of cached terminology files present in your local filesystem?

view this post on Zulip Josh Mandel (May 29 2019 at 21:06):

Because the slow part looks like all these "tx cache miss" steps (it's making very slow progress).

view this post on Zulip Chris Moesel (May 29 2019 at 21:08):

Sorry -- replied when filtered to only my mentions, so missed most of the conversation... Let me try again...

view this post on Zulip Chris Moesel (May 29 2019 at 21:12):

The first run without a cache takes a while, but I've still never had it take 30+ minutes. I've checked in the txCache, so not sure about all those misses -- unless there is a dependency between the cache and the version of the publisher? I think I built the cache using a publisher that is a week or two old.

view this post on Zulip Grahame Grieve (May 29 2019 at 21:14):

no dependency. I don't understand why it's a problem

view this post on Zulip Chris Moesel (May 29 2019 at 21:15):

I have to run to a 7th grade chorus concert (wish me luck), but I'll kick off a local build later to see how long it takes for me. Just to have another data point. Thanks for looking at this @Josh Mandel and @Grahame Grieve -- we really appreciate it!

view this post on Zulip Chris Moesel (May 29 2019 at 21:15):

@Grahame Grieve -- when you run it are you sitting right across the room from the terminology server? ;-)

view this post on Zulip Grahame Grieve (May 29 2019 at 21:16):

@Josh Mandel where do this cache miss notifications appear in the log? I don't get them...

view this post on Zulip Grahame Grieve (May 29 2019 at 21:17):

I'm sitting right across the cloud from the server, just like everyone else. I don't get those cache misses

view this post on Zulip Josh Mandel (May 29 2019 at 21:35):

They appear in my build.log file, which includes all the stdout and stderr of the java process.

view this post on Zulip Grahame Grieve (May 29 2019 at 21:43):

I mean, what other log entries preceed them - I'm trying to place them in the build sequence

view this post on Zulip Josh Mandel (May 29 2019 at 22:17):

I'm updating the build system to push logs before a build is killed -- stand by :)

view this post on Zulip Josh Mandel (May 29 2019 at 22:17):

It'll take a bit...

view this post on Zulip Josh Mandel (May 29 2019 at 22:18):

In the meantime: I'm seeing stuff like: https://gist.github.com/jmandel/653a2463153f0b3892eb25e6c73f8247

view this post on Zulip Josh Mandel (May 29 2019 at 22:19):

(Somehow, making the in-build timeouts work meant upgrading jdk8->jdk12 and python2->3 :p)

view this post on Zulip Josh Mandel (May 29 2019 at 22:49):

okay, building now and should die + send a notification in ~25 min, with full logs. :fingers_crossed:

view this post on Zulip Chris Moesel (May 30 2019 at 01:29):

FYI -- just pulled master for fhir-mCODE-ig and built w/ the latest IG publisher. No cache misses and finished in 00:04:14.0187sec on my 2016 Macbook Pro.

view this post on Zulip Chris Moesel (May 30 2019 at 01:40):

Realized that was with old output still in the path, so git clean -xfd to clear all uncommitted and ignored files, and ran it again to be sure. This time was even better: 00:03:59.0590sec.

view this post on Zulip Josh Mandel (May 30 2019 at 01:54):

Hmm. I was ble to reproduce the long build times from my laptop-- though I didn't let it go to completion so I'll take some detailed timing tomorrow.

view this post on Zulip Josh Mandel (May 30 2019 at 01:56):

When you say "no cache misses", what does that mean in terms of *where" cache hits are occurring? Is this something local or is this a hit in the server? (I'm assuming it can't be local because you should not have any cache files locally after a clean )

view this post on Zulip Grahame Grieve (May 30 2019 at 02:07):

cache files are not deleted by clean - they are committed it github to keep the build times down

view this post on Zulip Grahame Grieve (May 30 2019 at 02:08):

the files in /txCache.. for some reason, it seems as though the caching algorithm is not working for mCode...

view this post on Zulip Chris Moesel (May 30 2019 at 02:38):

Right, since the txCache is committed I was definitely using the cache.

for some reason, it seems as though the caching algorithm is not working for mCode...

It seems that there aren't a ton of IGs out there still being built off DSTU2, and even fewer built off Argonaut. It seemed like we were the guinea pigs for several DSTU2/Argonaut issues in the IG Publisher. Is it possible that mCODE is having some one-off problems due to it's DSTU2-ishness?

view this post on Zulip Grahame Grieve (May 30 2019 at 02:48):

no. you and I are having no problem, and the caching algorithm is all R5 infrastructure

view this post on Zulip Grahame Grieve (May 30 2019 at 02:49):

it smells like a linux thing right now. You building on windows or OSX?

view this post on Zulip Chris Moesel (May 30 2019 at 02:58):

OSX

view this post on Zulip Chris Moesel (May 30 2019 at 03:10):

I have an idea why the cache might not be working on Linux. Windows and Mac use case-insensitive file systems by default -- so if you ask for file abc.txt, but it's really called ABC.txt, both Windows and Mac will happily give it to you. Linux, on the other hand, uses a case-sensitive file system so will report that abc.txt does not exist. My guess is that the cache manager might be looking up cache files using a different case than they're actually stored on disk. Just a hunch that would explain the cache misses.

view this post on Zulip Chris Moesel (May 30 2019 at 03:11):

@Grahame Grieve -- is it possible that could happen?

view this post on Zulip Grahame Grieve (May 30 2019 at 03:12):

of course, that was the first thing I thought of. I cannot see any grounds for that in the code

view this post on Zulip Grahame Grieve (May 30 2019 at 03:12):

you have the name of the folder correct in the ig.json, the file names are hard coded

view this post on Zulip Chris Moesel (May 30 2019 at 03:16):

I should have known -- one step ahead of me! Oh well...

view this post on Zulip Chris Moesel (May 30 2019 at 03:16):

Actually, probably 2 or 3 steps ahead of me. :stuck_out_tongue_wink:

view this post on Zulip Grahame Grieve (May 30 2019 at 03:19):

it's possible it's a character set issue? I'm going to add some debugging...

view this post on Zulip Grahame Grieve (May 30 2019 at 04:29):

ok @Chris Moesel and @Josh Mandel please run the new IG publisher (0.9.21) locally, and paste this bit of the log here:

view this post on Zulip Grahame Grieve (May 30 2019 at 04:29):

Load Terminology Cache from C:\work\org.hl7.fhir.igs\fhir-mCODE-ig\txCache       (09.0082sec)
 - load all-systems.cache......done
 - load condition-category.cache..done
 - load https___ncit.nci.nih.gov.cache....done
 - load icd-10-cm.cache.....done
 - load lang.cache..done
 - load loinc.cache....................done
 - load mcode-basic-resource-type.cache..done
 - load snomed.cache........................................................................................done
 - load ucum.cache...........done
 - load usmcodeshrcorecsResearchSubjectTerminationReasonCS.cache..done
 - load usmcodeshrcorecsSettingCS.cache..done

view this post on Zulip Josh Mandel (May 30 2019 at 14:55):

$ grep -ri "load term" -A 100  build.log

Load Terminology Cache from /scratch/ig-build-temp-KKFBIZ/repo/txCache           (23.0354sec)
Connect to Terminology Server at http://tx.fhir.org                              (23.0355sec)
-tx cache miss: Connect to http://tx.fhir.org/r2
Initialization complete                                                          (24.0867sec)
Load Content                                                                     (24.0867sec)
create package file at /scratch/ig-build-temp-KKFBIZ/repo/output/package.tgz
Processing Conformance Resources                                                 (32.0134sec)
process type: NamingSystem                                                       (32.0134sec)
process type: CodeSystem                                                         (32.0135sec)
process type: ValueSet                                                           (32.0135sec)
process type: ConceptMap                                                         (32.0136sec)
process type: DataElement                                                        (32.0136sec)
process type: StructureDefinition                                                (32.0136sec)
process type: OperationDefinition                                                (32.0137sec)
process type: SearchParameter                                                    (32.0137sec)
process type: CapabilityStatement                                                (32.0138sec)

view this post on Zulip Josh Mandel (May 30 2019 at 14:57):

(Also, the cache files are... interesting. Why not json?)

view this post on Zulip Chris Moesel (May 30 2019 at 17:15):

Mine looks more like Grahame's:

Load Terminology Cache from /Users/cmoesel/dev/HL7/fhir-mCODE-ig/txCache         (07.0828sec)
 - load lang.cache..done
 - load all-systems.cache......done
 - load usmcodeshrcorecsResearchSubjectTerminationReasonCS.cache..done
 - load icd-10-cm.cache.....done
 - load ucum.cache...........done
 - load loinc.cache....................done
 - load condition-category.cache..done
 - load usmcodeshrcorecsSettingCS.cache..done
 - load snomed.cache........................................................................................done
 - load https___ncit.nci.nih.gov.cache....done
 - load mcode-basic-resource-type.cache..done
Connect to Terminology Server at http://tx.fhir.org                              (08.0213sec)
-tx cache miss: Connect to http://tx.fhir.org/r2
Initialization complete                                                          (09.0413sec)

view this post on Zulip Grahame Grieve (May 30 2019 at 23:01):

@Josh Mandel so weird because very standard java File.list() is not seeing the files that exist:

    for (String fn : new File(folder).list()) {
      if (fn.endsWith(".cache") && !fn.equals("validation.cache")) {

view this post on Zulip Grahame Grieve (May 30 2019 at 23:02):

in your case, folder is /scratch/ig-build-temp-KKFBIZ/repo/txCache - that exists, yes?

view this post on Zulip Josh Mandel (May 30 2019 at 23:02):

Yes it does.

view this post on Zulip Grahame Grieve (May 30 2019 at 23:03):

so how could that code not find any files?

view this post on Zulip Josh Mandel (May 30 2019 at 23:04):

Hmm.

view this post on Zulip Josh Mandel (May 30 2019 at 23:06):

well txCache starts out empty.

view this post on Zulip Grahame Grieve (May 30 2019 at 23:06):

... because?

view this post on Zulip Josh Mandel (May 30 2019 at 23:06):

I'm wondering that.

view this post on Zulip Grahame Grieve (May 30 2019 at 23:07):

but if it's empty, that will be why that code doesn't see any files..

view this post on Zulip Josh Mandel (May 30 2019 at 23:07):

yes...

view this post on Zulip Josh Mandel (May 30 2019 at 23:07):

so it gets emptied.

view this post on Zulip Josh Mandel (May 30 2019 at 23:08):

About 30 seconds in.

view this post on Zulip Josh Mandel (May 30 2019 at 23:08):

So immediately post-clone:

bash-4.2# find . | grep -i txCache
./ig-build-temp-ARCYSH/repo/txCache
./ig-build-temp-ARCYSH/repo/txCache/snomed.cache
./ig-build-temp-ARCYSH/repo/txCache/lang.cache
./ig-build-temp-ARCYSH/repo/txCache/usmcodeshrcorecsSettingCS.cache
./ig-build-temp-ARCYSH/repo/txCache/icd-10-cm.cache
./ig-build-temp-ARCYSH/repo/txCache/https___ncit.nci.nih.gov.cache
./ig-build-temp-ARCYSH/repo/txCache/version.ctl
./ig-build-temp-ARCYSH/repo/txCache/usmcodeshrcorecsResearchSubjectTerminationReasonCS.cache
./ig-build-temp-ARCYSH/repo/txCache/loinc.cache
./ig-build-temp-ARCYSH/repo/txCache/condition-category.cache
./ig-build-temp-ARCYSH/repo/txCache/ucum.cache
./ig-build-temp-ARCYSH/repo/txCache/all-systems.cache
./ig-build-temp-ARCYSH/repo/txCache/mcode-basic-resource-type.cache

view this post on Zulip Josh Mandel (May 30 2019 at 23:08):

Then:

bash-4.2# find . | grep -i txCache

view this post on Zulip Josh Mandel (May 30 2019 at 23:08):

Then:

bash-4.2# find . | grep -i txCache
./ig-build-temp-ARCYSH/repo/txCache
./ig-build-temp-ARCYSH/repo/txCache/version.ctl

view this post on Zulip Grahame Grieve (May 30 2019 at 23:09):

hmm. O

view this post on Zulip Josh Mandel (May 30 2019 at 23:09):

So that's a lead for you :)

view this post on Zulip Grahame Grieve (May 30 2019 at 23:09):

I'll investigate

view this post on Zulip Grahame Grieve (May 30 2019 at 23:59):

ok so this turns out to be my fault. The publisher does clear the cache on the ci-build. I've pared that back to leave snomed/loinc cache in place

view this post on Zulip Grahame Grieve (May 31 2019 at 01:43):

ok wel, let's try again iwht the new version of the ig publisher

view this post on Zulip Chris Moesel (May 31 2019 at 03:28):

I just triggered a new build since I saw that Grahame announced a new publisher. Let's see what happens!

view this post on Zulip Josh Mandel (May 31 2019 at 03:31):

Haha, I did the same, right about when you did. Indeed let's see.

view this post on Zulip Grahame Grieve (May 31 2019 at 03:39):

so did I. I don't see any success or failure

view this post on Zulip Chris Moesel (May 31 2019 at 03:50):

Ha. I saw no comments about this in the past 2 hours, so I figured I should just go ahead and do it. I guess we're all pretty synchronized at this point. That said, 20 minutes later and no build notification yet... That 30 minute clock is ticking...

view this post on Zulip Grahame Grieve (May 31 2019 at 03:51):

I submitted 2 hours ago

view this post on Zulip Josh Mandel (May 31 2019 at 13:28):

(Well, the whole systems appears to be down -- looking now.)

view this post on Zulip Josh Mandel (May 31 2019 at 13:30):

Yeah, my attempt to pretty up the notification output yesterday resulted in an error.

view this post on Zulip Josh Mandel (May 31 2019 at 13:32):

OK, rebuilding now after that fix

view this post on Zulip Chris Moesel (May 31 2019 at 13:34):

Hey @Josh Mandel ... are you having fun yet? ;-)

view this post on Zulip Josh Mandel (May 31 2019 at 13:39):

Always :)

view this post on Zulip Josh Mandel (May 31 2019 at 13:48):

Well that wasn't exactly fast but it was way faster :)

view this post on Zulip Josh Mandel (May 31 2019 at 13:48):

(And it finished.)

view this post on Zulip Grahame Grieve (May 31 2019 at 13:57):

it's about as fast as I expected

view this post on Zulip Grahame Grieve (May 31 2019 at 13:57):

leaving the snomed & loinc expansions in place but forcing the rest to be redone

view this post on Zulip Grahame Grieve (May 31 2019 at 13:57):

thanks

view this post on Zulip Chris Moesel (May 31 2019 at 14:04):

Yes, thank you both. I really appreciate your time and dedication in getting this resolved (and in everything else you do!).

view this post on Zulip Chris Moesel (May 31 2019 at 14:05):

If I were going to be at dev days (but I'm not), I'd buy you both a beer or seven. But hey, it's the thought that matters, right?

view this post on Zulip Mark Kramer (Jun 01 2019 at 21:16):

Thank you, @Grahame Grieve and @Josh Mandel for fixing this. I will be at DevDays so the beer is on my for this fix!


Last updated: Apr 12 2022 at 19:14 UTC