FHIR Chat · Minimize diff in output files of subsequent IG Publisher ... · IG creation

Stream: IG creation

Topic: Minimize diff in output files of subsequent IG Publisher ...


view this post on Zulip David Simons (Sep 01 2021 at 17:14):

Currently our IG Publisher output contains the actual dateTime:

  • This profile was published on Fri Aug 20 07:36:59 UTC 2021
  • Generated 2021-08-20

What options exist to have this datetime set to a value of choice (eg. fix to 2021-01-01 00:00:00) ?

Rationale: do not want every output file to change upon every run

view this post on Zulip Lloyd McKenzie (Sep 01 2021 at 17:19):

The whole point is for the file to change every run - so that someone looking at a given page can distinguish what they're looking at from what someone else is looking at.

view this post on Zulip Lloyd McKenzie (Sep 01 2021 at 17:20):

What's your reason for not wanting the files to change?

view this post on Zulip David Simons (Sep 01 2021 at 17:44):

We use GitPages (from version control) and it explodes (many GBytes over time) because the timestamp changes in _every_ file, every run (commit/nightly dev builds)

Even if I make a small change in one resource, almost all files in the IG Publisher output change - due to the timestamp.

If only we could see the date end-up showing and chaning in only a few files, like the cover and footer (by reference, rather than by copy)

view this post on Zulip Jose Costa Teixeira (Sep 01 2021 at 17:59):

(why) do you want to version control the outcome of each build?

view this post on Zulip David Simons (Sep 01 2021 at 18:02):

GitPages published from /docs branch in GitHub... - which is quite convenient to published developer docs (here: IG) related to the code (here: FHIR profiles) - via the build-pipeline.

Do not want to operate a separate webserver.
And if I did, I would like the sync/update/transfer to it to be optimized, only upload concrete changes, instead of each and every file, each and every time.

Also think of the use case where the output ends up on a shared drive (OneDrive, others) - it takes _ages_ every run.

btw, next to date/genDate, there's also a changing bookmark <a name="fd08ffbe-485a-4f47-8983-0d9351539d28"> tag injected somewhere into every file - why not keep it fixed

I think there are many benefits (and ways) to minimizing the diff between IG runs - while keeping the current functionaliy. That optimization is what I am looking for :)

view this post on Zulip Jose Costa Teixeira (Sep 01 2021 at 18:15):

/docs is a branch, or a folder?

view this post on Zulip Jose Costa Teixeira (Sep 01 2021 at 18:15):

(do you have this on a repo I can see to understand)?

view this post on Zulip Jose Costa Teixeira (Sep 01 2021 at 18:18):

Several people are using github pages (I presume that is the long name for gitpages), without a need to commit the output folder (which indeed changes every build, but that's good)

view this post on Zulip Jose Costa Teixeira (Sep 01 2021 at 18:19):

our github workflow is (or was, not sure) prepared to upload e.g. to an FTP server. It worked (although it is not used because FTP uploads should not be done on every build, but on every release)

view this post on Zulip Grahame Grieve (Sep 01 2021 at 19:23):

the injected tag is only for errors in qa.html, and there's no memory of what they are. but you shouldn't have any, or many. So I don't see that there's value in figuring out how to remember what they are

view this post on Zulip Grahame Grieve (Sep 01 2021 at 19:23):

otherwise, the time of creation is under your control in the template

view this post on Zulip Lloyd McKenzie (Sep 01 2021 at 19:47):

Grahame is correct, you can override the template I guess. But if you're going to do that, I'd strip out the timestamp entirely - don't set it to a fixed value that isn't accurate.

view this post on Zulip Lloyd McKenzie (Sep 01 2021 at 19:47):

(And you can't override the template if you're publishing within HL7)

view this post on Zulip Max Masnick (Sep 01 2021 at 19:47):

I've avoided this kind of git bloat issue in the past by using a separate publish branch specifically for the .html files to be served, and then using git commit --amend to avoid creating a new commit (and thus repo bloat) every time the files change. This will require force pushing the publish branch, but that seems ok if it's _only_ being used for seeding content to a web server.

(I use git worktree to check both master and publish out at once in separate folders to make it easy to copy content from one to the other)

view this post on Zulip Jose Costa Teixeira (Sep 01 2021 at 19:56):

I still don't understand the need to commit the output folder.

view this post on Zulip Grahame Grieve (Sep 01 2021 at 19:57):

because you're not using build.fhir.org / the ci-build.

view this post on Zulip Jose Costa Teixeira (Sep 01 2021 at 19:57):

(I fixed my question).

view this post on Zulip Jose Costa Teixeira (Sep 01 2021 at 19:58):

Not using build.fhir.org does not mean one should commit the output folder

view this post on Zulip Grahame Grieve (Sep 01 2021 at 19:59):

because the most common alternative is to want to use GitHub pages, and so you do

view this post on Zulip Grahame Grieve (Sep 01 2021 at 19:59):

though there's no technical or policy reason not to use build.fhir.org

view this post on Zulip Jose Costa Teixeira (Sep 01 2021 at 20:02):

Grahame Grieve said:

because the most common alternative is to want to use GitHub pages, and so you do

we have a few things on github pages and I don't commit the output folder, which is why I'm missing the point

view this post on Zulip John Moehrke (Sep 01 2021 at 20:14):

my personal reason for using github pages... I don't personally have access to a web server, and the IG I do this with is not to be seen by the general public (so I can't let ci-build build it).

view this post on Zulip Grahame Grieve (Sep 01 2021 at 20:18):

yes that's a valid reason

view this post on Zulip Jean Duteau (Sep 01 2021 at 20:22):

i use github pages for a guide with similar reasons as John and I check in the output folder because I haven't set up the hooks to build it. so it builds locally and I check in the output into a publish folder and then "bob's your uncle" and the guide is served to my closed community.

view this post on Zulip Josh Mandel (Sep 01 2021 at 20:50):

my personal reason for using github pages... I don't personally have access to a web server, and the IG I do this with is not to be seen by the general public (so I can't let ci-build build it).

Aren't GitHub pages public?

view this post on Zulip Jose Costa Teixeira (Sep 01 2021 at 20:50):

they are

view this post on Zulip Jose Costa Teixeira (Sep 01 2021 at 20:51):

but they are not in that-place-where-everyone-sees-what's-happening (I guess that would be the reason)

view this post on Zulip Grahame Grieve (Sep 01 2021 at 20:52):

search engines will find them, they will be public

view this post on Zulip Josh Mandel (Sep 01 2021 at 20:55):

(To be clear, I think there are plenty of good reasons for putting built IG content in version control and agree that the proposal in this thread is worthwhile -- just want to be clear about goals here.)

view this post on Zulip Jose Costa Teixeira (Sep 01 2021 at 20:57):

I wonder what those reasons are. There may be, my question above is because using github pages don't necessarily require committing the built content

view this post on Zulip Jean Duteau (Sep 01 2021 at 21:02):

Josh Mandel said:

my personal reason for using github pages... I don't personally have access to a web server, and the IG I do this with is not to be seen by the general public (so I can't let ci-build build it).

Aren't GitHub pages public?

The Git repository and github pages that my client are using don't seem to be public. They aren't searchable and have a specialized URL, but I could be completely wrong about that. :)

view this post on Zulip John Moehrke (Sep 01 2021 at 21:02):

Jean Duteau said:

Josh Mandel said:

my personal reason for using github pages... I don't personally have access to a web server, and the IG I do this with is not to be seen by the general public (so I can't let ci-build build it).

Aren't GitHub pages public?

The Git repository and github pages that my client are using don't seem to be public. They aren't searchable and have a specialized URL, but I could be completely wrong about that. :)

shhh... also my github is too.. but

view this post on Zulip Josh Mandel (Sep 01 2021 at 21:02):

(Using GitHub pages does require commiting built content, though it may be on an orphan branch.)

view this post on Zulip Jean Duteau (Sep 01 2021 at 21:03):

Jose Costa Teixeira said:

I wonder what those reasons are. There may be, my question above is because using github pages don't necessarily require committing the built content

I don't think anyone was saying that it required committing the output directory, but that if you didn't want go through with the setup of a Git hook to generate the content that was served by GitHub pages, that this was an easy way of getting them available.

view this post on Zulip John Moehrke (Sep 01 2021 at 21:03):

Jose Costa Teixeira said:

I wonder what those reasons are. There may be, my question above is because using github pages don't necessarily require committing the built content

releases. Each release that IHE publishes is put into a github repo that is staging for their web site. but the incremental (ci-builds) do NOT go there.

view this post on Zulip Jose Costa Teixeira (Sep 01 2021 at 21:04):

Right. My point was to check what @David Simons wanted to do. Perhaps a simple github workflow would solve that

view this post on Zulip John Moehrke (Sep 01 2021 at 21:06):

my experience with github is that it ignores changes that do not result in a hash change. so just date change does not take up space.

view this post on Zulip Jose Costa Teixeira (Sep 01 2021 at 21:06):

John Moehrke said:

Jose Costa Teixeira said:

I wonder what those reasons are. There may be, my question above is because using github pages don't necessarily require committing the built content

releases. Each release that IHE publishes is put into a github repo that is staging for their web site. but the incremental (ci-builds) do NOT go there.

Possibly. I do think that can/should be handled by a workflow (I'm hopping to look into it soon). But David's point is about the daily/many builds, which I don' see why are being committed

view this post on Zulip John Moehrke (Sep 01 2021 at 21:08):

well.. my private publication experience... I do put into github pages anytime I need to share iwth my private team... so, yes it results in many things I have decided "to publish".

view this post on Zulip John Moehrke (Sep 01 2021 at 21:09):

if you you fall into the need for private publications, (can't use ci-build), then you are taking on burden. you need to deal with that burden somehow. the more painful, the more likely you will come back and use build.fhir.org/ig --- because it really is not that public.

view this post on Zulip Jean Duteau (Sep 01 2021 at 21:12):

i commit anytime that I have a build that I want my client to see.

view this post on Zulip Jose Costa Teixeira (Sep 01 2021 at 21:14):

Right. I wanted to ask @David Simons what are his needs, because maybe we can solve that with a simple workflow like this one
https://github.com/IHE/empty-fhir-profile/blob/master/.github/workflows/main.yml

view this post on Zulip Lloyd McKenzie (Sep 01 2021 at 21:30):

Is there a way we could put the "publication date" in a single JS file or something and have it just render on every page? That way the only page that would change is that one file. That would seem to be the best of both worlds...

view this post on Zulip Jose Costa Teixeira (Sep 01 2021 at 21:37):

I guess we could consider that, but if the purpose is to publish on github pages, aren't people better served by having a workflow that already exists?

view this post on Zulip Grahame Grieve (Sep 01 2021 at 21:38):

Is there a way we could put the "publication date" in a single JS file

not for static HTML publications, because it requires server side support, and that's not in the picture, but it doesn't matter there. For someone publishing through github pages, they have more options, and it could be done - a template thing

view this post on Zulip Jose Costa Teixeira (Sep 01 2021 at 21:42):

from David's original question I thought he could be simply missing a github workflow.
I mean, we can minimize the impact of a) manually building the IG and b) manually uploading the output to gh pages, or perhaps we can just have gh pages without any of that manual work

view this post on Zulip Eric Haas (Sep 01 2021 at 21:47):

The point is you have to commit your entire output (typically to /docs directory), but even after a small edit you must commit the entire ig output since all page are updated all the time. (I don't know any git tricks to avoid this since you don't necessary know a priori which files have been updated based on the edit and which are changed based on the updated timestamp.) I agree with the original proposal to have an IG parameter to override the dynamic datetime with a static datetime.

view this post on Zulip Jose Costa Teixeira (Sep 01 2021 at 21:52):

I'm confused with why we need to commit the output. If we want to do manual work, yes. In the workflow I mentioned, the workflow builds and automatically copies the output folder to a separate branch (could be a separate folder), so there's no need to copy anything

view this post on Zulip Jose Costa Teixeira (Sep 01 2021 at 21:54):

I don't see why we should have an ig parameter ("publication-date=..."?). Perhaps I'm missing something but this seems a solution to a problem that doesn't need to exist

view this post on Zulip Eric Haas (Sep 01 2021 at 21:56):

"Your GitHub Pages site is currently being built from the /docs folder in the master branch. Learn more."

image.png

view this post on Zulip Eric Haas (Sep 01 2021 at 21:59):

the docs folder contains all the static site content created by the ig-publisher. You can try it out, easy to set up.

view this post on Zulip Jose Costa Teixeira (Sep 01 2021 at 22:01):

you could do that, but seems more complicated. Perhaps I should ask the other way around:
If we have a workflow that automatically publishes gh-pages, why would we do the manual work? What would be missing?

view this post on Zulip Jose Costa Teixeira (Sep 01 2021 at 22:04):

(I don't know if there's a reason for putting the built content in the main/master branch)

view this post on Zulip Eric Haas (Sep 01 2021 at 22:08):

don't know if there are other workflow that avoid the bloat described above.

view this post on Zulip Jose Costa Teixeira (Sep 01 2021 at 22:11):

I think this one does avoid that bloat (again, assuming that the bloat is unnecessary)

view this post on Zulip Jean Duteau (Sep 01 2021 at 22:16):

as I've said a couple of times, my client is not interested in setting up a workflow. They regularly have github pages set up with repositories and wanted the same thing for the FHIR IG. Since we have an output directory already, this fit into their desires perfectly. I don't have David's concern of changing every file on a commit, but saying "we've solved your problem by making you not check in your output folder" isn't the right answer.

view this post on Zulip Jose Costa Teixeira (Sep 01 2021 at 22:23):

I hope David can setup a workflow and avoid the issues AND the manual labour.
I would consider that (and other alternatives) before adding an issue to the IGPublisher.

view this post on Zulip Jose Costa Teixeira (Sep 01 2021 at 22:27):

Checking the published folder is an expected source for such problems, I think most people can avoid that.

view this post on Zulip Lloyd McKenzie (Sep 01 2021 at 23:03):

I don't understand Jose. The issue expressed is that people who use github pages want the number of commits to be smaller. If instead of changing every file, we'd achieve that end with a change that would work for everyone, and everyone would still have a footer at the bottom of the page that said when the release was created.

view this post on Zulip Jose Costa Teixeira (Sep 01 2021 at 23:10):

I don't read the problem as wanting the number of commits to be smaller. I read the problem as
"we do a lot of commits to keep things up to date and we use gh pages. This means our master branch is changing too many files that wouldn't need to be changed because the content is the same"

view this post on Zulip Lloyd McKenzie (Sep 01 2021 at 23:13):

The request was "let us fix the date so it doesn't change to avoid needing to commit so many files". If the date doesn't appear in the files, that would address the problem - but still allow the date to be rendered.

view this post on Zulip Lloyd McKenzie (Sep 01 2021 at 23:14):

It wouldn't fix the QA issue, but the solution there is "minimize the number of pages in your IG that have QA issues" :)

view this post on Zulip Jose Costa Teixeira (Sep 01 2021 at 23:14):

and there's a few things that can be done to improve that.
We could change the template to static. Or we could make it dependent on the package.json

view this post on Zulip Jose Costa Teixeira (Sep 01 2021 at 23:16):

Lloyd McKenzie said:

The request was "let us fix the date so it doesn't change to avoid needing to commit so many files". If the date doesn't appear in the files, that would address the problem - but still allow the date to be rendered.

And I was asking "why does the OP need that", hoping that he can make his life better :).

view this post on Zulip Jose Costa Teixeira (Sep 01 2021 at 23:19):

I did not get a confirmation from David, so I don't know if he can be better served with a workflow. If not, or if others can't, we can change the template.
Persoanlly, I like the fact that a CI-build tells me when it has been built. I can see immediately if this is as expected.
I think static dates should be for a Release, not the CI.
I'm not saying "we should not address that", I'm saying "do you think you want to try this way of having ghpages before you need to change your template?"

view this post on Zulip Lloyd McKenzie (Sep 02 2021 at 00:44):

We want the CI build to contain the date on every file. But we also want most of the files to not change just because of the date. If the footer says "display value returned by Javascript function X", where there's a small Javascript file that spits out the publication date, then that should satisfy both sets of requirements.

view this post on Zulip Grahame Grieve (Sep 02 2021 at 02:01):

won't work when not hosted on a server. Sandbox reasons

view this post on Zulip Josh Mandel (Sep 02 2021 at 02:34):

The randomly changing bookmark value seems worth addressing, because it's pure line noise.

The dates in the footers of files, as well as the generated dates that sit inside of definitional artifacts ... these serve a function. (I could imagine a flag to the publisher that would override the bill time with something user specified... like -overrideBuildTime 2020--01-01T00:00:00Z if we really wanted to give people a way to get stable outputs.)

view this post on Zulip Lloyd McKenzie (Sep 02 2021 at 03:08):

Is there any other way to get content to render on a static page without it living on that page? Would "Github pages" be considered a server? If they would and it would work on the CI-build or any other officially hosted server, do we care if the script wouldn't display the timestamp when you're just looking at a downloaded copy? For that, we'd have the dates inside the zip file or on your harddrive...

view this post on Zulip Grahame Grieve (Sep 02 2021 at 03:11):

the sandbox environment is tough. I could spend some time looking that, but it's time I won't be doing other things

view this post on Zulip Lloyd McKenzie (Sep 02 2021 at 03:15):

Ok. So seems like short-term solution is: override the footer in your template so it doesn't include the timestamp. (And fix as many of your QA issues as you can.)
Longer term solution: submit a Git issue against the base template and - at some point - we'll see if we can get the timestamp to show up with out it changing every single page each time you build.

view this post on Zulip Lloyd McKenzie (Sep 02 2021 at 03:16):

(@David Simons)

view this post on Zulip Grahame Grieve (Sep 02 2021 at 03:53):

well, next version of the IG publisher will not generate uuids any more. It will use serial numbers unique in each file. This will be more stable (not deterministic though)

view this post on Zulip David Simons (Sep 02 2021 at 07:33):

Wow thank you all :) waking up to a lot of good responses

I changed the title to describe the problem we have - and not per se imply a solution.

We use GitHub Enterprise as _Innersource_ - and when folks contribute to the FHIR profiles in our Git repo, we also use IG Publisher to generate updated developer views on those profiles.
A GitAction runs the IG Publisher against the FHIR profiles on /master branch, and commits the generated output to our /docs branch.
And indeed GitHub Pages is configured to take a specific branch (/docs for us) and deploy it automatically on the built-in "webserver". Our GitPages are private, company internal, not public.

Our problem is that if we run the Git Action 2x in a row - without any change of the underlying FHIR profiles - almost all of the output files have changes - leading to an overhead (on version control. on file syncs, etc.).
So the /docs branch is exploding.

From the commit diff - and hence diff between subsequent IG Publisher runs - I noticed that this is due to (primarily) 1) changed date+genDate timestamps, and (sometimes) 2) <a name="<GUID>"> html bookmarks.
The timestamps are the primary root cause.

Hence my initial request to see if there is a quick turnaround to fix/strip these dates - but I'd prefer a more structural approach to optimizing the output to minimize the diff in output files of subsequent IG Publisher runs on the same input. Since it it is good to have dates includes - see below.

I was able to remove the Generated from the footer, but can not readily remove the "date"attributes being added in the main page. Need to dig deeper in the templates.
Note that our FHIR profiles in the repo do NOT have a date attribute - since the repo does timestamping at development time. Yet, the IG output does have these .date attributes filled with the genDate it seems.

Fundamentally, I think there is a difference between
1) Last modified (for us the last change to a profile in version control) - which could be inputted via the FHIR profile .date attribute.
2) Last build/generated - the timestamp of running the IG Publisher, doing validation, snapshot generation, and output rendering (genDate?)
3) Last released/packaged (e.g. as part of a program increment/formal release cycle) - a special version of 1+2 - aligning all to a fixed, but customizable timestamp.

view this post on Zulip Lloyd McKenzie (Sep 02 2021 at 17:26):

The profile.date element is not reliable. Also, even if a profile is unchanged, the snapshot can change - either because of a tooling change (now uncommon but still potential) or because the underlying resource has changed. Same goes for value sets where the page will include a snapshot of the expansion. The 'publication' date is thus very important to know if you're looking at the same thing or not.

view this post on Zulip John Moehrke (Sep 02 2021 at 18:15):

I agree, the current model is the more likely correct model. I struggle with how to do the right thing most of the time and for the dominant need; while having a special case to make this specific usecase more well behaved.

view this post on Zulip David Simons (Sep 02 2021 at 19:04):

Thank you. Generating and sharing a regular dev build of an NPM+IG seems like a dominant need to me :)
I do agree that timestamping such a build is highly preferable.

The current implementation - timestamping by copy - has become troublesome with GitHub Pages, for us at least.

Example commit of one IG run on our repo:
image.png


Last updated: Apr 12 2022 at 19:14 UTC