Stream: committers/git-help
Topic: Git help needed
Grahame Grieve (Oct 07 2021 at 21:14):
We're getting a lot of feedback from HL7 committees: We really really hate git, and we're totally powerless to deal with it
Grahame Grieve (Oct 07 2021 at 21:15):
and when we try to ask for help, experts just tell us that we're using the wrong git client, but that doesn't actually help at all
Grahame Grieve (Oct 07 2021 at 21:15):
we need to do something about this. But what?
Grahame Grieve (Oct 07 2021 at 21:17):
I'm terrified of git conflicts - it's super stressful because I might merge wrong, and git's language is totally confusing. Perhaps I'm just stupid but in my head, there's other people's changes, my changes, and the product. Git gives me 'base', 'merge-head', and 'head'. Which of those is which, and could they choose less useful language?
Grahame Grieve (Oct 07 2021 at 21:19):
so you can imagine how committee members feel.
Grahame Grieve (Oct 07 2021 at 21:20):
and then, after a merge, git tells you you're committing hundreds of files. Did you change them? Is that a whoopsie?
Gino Canessa (Oct 07 2021 at 22:03):
I don't think the problem is git per se... We are building a product via many async distributed 'teams' that have varying levels of comfort with the tools. Version management is a hard problem, even more so for people not used to dealing with it. Short of something like a custom software platform for editing the FHIR spec (which I can't imagine being feasible), I don't know how much of a difference the tooling can make.
But, we should be able to reduce the burden.
First, I'd suggest identifying the 'repeat offenders' to see if they can be removed/not checked-in, e.g.:
- Adding resources often causes conflicts in the 'top level' files (e.g.,
oids.ini
,resourcelist.html
, etc.) - can we move these out of 'normal' editing, make them generated files, etc.? - I often have conflicts with things like cache files (e.g.,
null.cache
) - they are nice to have, but I would be willing to trade a longer 'first build' for not dealing with those conflicts. - There is a lot of 'build' stuff as well as the actual content source. If those files cause problems, they could be split into different repos (e.g., if people are making changes for local builds and accidentally causing problems).
Second, I'm guessing that some educational content would go a long way. Everything from 'setting up your environment correctly' to things like 'how to deal with long-lived branches' to 'how to add a new resource/operation/etc.'. I'm guessing a lot of frustration comes from not knowing.
After mulling for a bit (but not wanting to rewrite all the above =), another possibility would be to have someone (paid) responsible for managing the repo. Merging branches, warning people of long-lived ones, deleting stale ones, etc. ... localize the pain to a single person/team and compensate them for dealing with it (e.g., ops).
Jean Duteau (Oct 07 2021 at 22:14):
i was writing up a bunch of stuff and then Gino posted and that was a lot of what I was going to say. :)
can I step back and ask why you decided on Git as your version control platform? Git imposes a certain way of thinking about your source files and how they are managed and that has been imposed on us because of the decision to use Git.
the other question is what alternative would we use if we decided to move away from Git?
I do think that treating the FHIR specification as set of source files was a good decision. And if you do that, then you need a way to manage and control the source files. But it is a different way of thinking then many specification developers at HL7 are used to thinking. I agree with Gino's suggestions on reducing the burden - identify files that are pain points, provide some education to new committers, single out repository managers who can help and/or do some of the required tasks in managing a repository of this size.
Grahame Grieve (Oct 07 2021 at 22:22):
the files that are the pain points that I've heard about are the txCache files, but they don't really matter - fix them however way. It's source merge conflicts that are the real problem from what I've heard, since they matter
Grahame Grieve (Oct 07 2021 at 22:23):
note that the build is 3hours + longer when I rebuild the txCache. I don't think we want to have build times like that
Grahame Grieve (Oct 07 2021 at 23:56):
why choose git? because of the workflows around it , and in the hope that we could manage it. The second has turned out to not be true
Vassil Peytchev (Oct 08 2021 at 00:17):
There are quite a few things to unpack here.
-
Git != GitHub. Are the beneficial workflows those of git mainly, or the add-ons of GitHub?
-
Related to the previous - are our needs better served by git-flow or GitHub flow
-
The creator of Git has stated that GitHub "creates absolutely useless garbage merges, and you should never ever use the github interfaces to merge anything" - could that be one of the reasons merges are such an issue?
-
There are multiple ways to do the same things - e.g. pull requests from branches, vs. pull requests from forks. Is one better than the others?
-
Should all merges be done without fast-forward (--no-ff parameter)?
-
Do git clients provide sufficient added value compared to the command line interface?
Josh Mandel (Oct 08 2021 at 00:27):
(On 3, I don't think Linus's gripes are things we particularly worry about. They're to do with things like signatures which haven't been a pain point for us, not about the graph structure of the repository history.)
Vassil Peytchev (Oct 08 2021 at 01:29):
On the txCache files, would adding a step after the local build to add any new files generated due to the build, and then commit them and any modified existing ones with auto-add? I think the problems occur when these files are not updated after the build, is that the case?
Rob Hausam (Oct 08 2021 at 13:14):
I think that the txcache files really are the source of most of the merge issues that I have had and have heard of - probably even up to 80-90% of them? I think those cache files are still changing far too often (and often apparently unnecessarily), and any diffs on them to see what has changed and why it has changed are for the most part completely useless. I agree with Gino that it may be better to trade a longer 'first build' for avoiding having to deal with those conflicts - add the folder to .gitignore and manage it entirely locally. And Vassil I think is also right that many times these files are not being checked in consistently - and then when someone does check them in that probably means that the changes and the potential conflicts are greater. But the easiest way may be to not check them in at all.
Lloyd McKenzie (Oct 08 2021 at 16:42):
I think asking people to use command line is a non-starter. Some of the GUI tools aren't that intuitive, but the commandline is far worse.
Jean Duteau (Oct 08 2021 at 16:57):
i actually have no problem with the commandline. it's pretty much dead simple since there is quite a restricted set of commands you're going to use
Lloyd McKenzie (Oct 08 2021 at 17:32):
Yes, but you're not an example of someone who struggles with Git :)
Joel Schneider (Oct 08 2021 at 18:03):
I tend to stick with the git command line, rather than trying to second-guess what some GUI tool might actually be doing.
Rob Hausam (Oct 08 2021 at 18:41):
I'm also one who from my experience finds it often easier and more straightforward to work with the command line rather than the various tools that I've used so far (e.g., GitKraken, TortoiseGit).
Craig Newman (Oct 08 2021 at 18:43):
As someone without much knowledge of the process (and in your target audience), I actually prefer the command line to a tool provided that there is a clear set of steps I can follow (and better yet, cut and paste). Too many menu choices in most of the tools I've tried out.
Rob Hausam (Oct 08 2021 at 18:43):
I haven't wanted to or found a need to use any of the tools for quite some time (although I started by trying GitKraken).
Grahame Grieve (Oct 08 2021 at 19:10):
I can't stand the command line. so much typing of file names - what a waste of time
Grahame Grieve (Oct 08 2021 at 19:11):
and you might like saying to make the first build last longer - very much longer, but you all like that I can wear that pain once for all of you. otherwise I'll periodically wipe your cache and you'll all have three hour builds.
Josh Mandel (Oct 08 2021 at 19:12):
You very rarely need to type out full file names with tab completion :-)
But I'm not trying to say that there is a one size fits all solution.
Josh Mandel (Oct 08 2021 at 19:13):
I do think it would be helpful to have an in-depth technical discussion, maybe a live conversation, about the overall caching strategies so that we could explore whether other options might be able to meet the needs.
Jean Duteau (Oct 08 2021 at 19:24):
well, i deleted my entire vscache directory to see how long it would take and I just got a timeout so I couldn't check :)
Jean Duteau (Oct 08 2021 at 19:34):
okay i'm trying with just deleting the files in the vscache directory and not the entire directory :)
Gino Canessa (Oct 08 2021 at 20:31):
Grahame Grieve said:
and you might like saying to make the first build last longer - very much longer, but you all like that I can wear that pain once for all of you. otherwise I'll periodically wipe your cache and you'll all have three hour builds.
I'd argue it's not we 'all'... but yeah, that's usually what users/customers want. They are asking someone else to take on pain so they don't have to.
The tradeoff doesn't have to be paid in longer build times - for example, it can be paid in development time (though I worry that right now, it's still asking you to bear most of the pain - but that's another discussion). There are lots of options here: make txCache something that only updates from the main branch builds with another directory for 'local' changes; move txCache into another repo/project/package altogether; make it so that local builds don't modify txCache and just build the local things they need on the fly each time; or if it's only an issue in your environment, identify a process that keeps a 'local' copy of the txCache to swap in and out; etc..
I don't know enough about the caching mechanism of the build process to know what would be the simplest to implement. But, if the majority of conflicts are in the txCache, then that seems like the best place to start addressing it.
But to my knowledge, there is no tool that allows different people to work on the same files at the same time, has a 'branching' concept for building/testing/exploring, and does not have to deal with merge conflicts. I don't think it's a tool limitation, but rather a process one (or a physics one, if you want to blame it on linear time =).
Lloyd McKenzie (Oct 08 2021 at 20:54):
Could we have a dual-layer cache - one generated locally and not committed, and one generated by the CI build and auto-committed? Because there are two layers, we wouldn't have conflicts and everyone would get the benefit of the CI-build maintained cache.
Vassil Peytchev (Oct 09 2021 at 01:06):
Wouldn't that mean a 3-hour CI build?
Josh Mandel (Oct 09 2021 at 01:09):
The CI build can support caching strategies. I think the main thing (for me anyway) is to understand the problem better.
Lloyd McKenzie (Oct 09 2021 at 01:39):
The CI build wouldn't have to rebuild from scratch. It would retain its own cache (which it commits) and would simply update it. The complexity would be in managing a 2-level cache.
Jean Duteau (Oct 09 2021 at 01:40):
Gino Canessa said:
But to my knowledge, there is no tool that allows different people to work on the same files at the same time, has a 'branching' concept for building/testing/exploring, and does not have to deal with merge conflicts. I don't think it's a tool limitation, but rather a process one (or a physics one, if you want to blame it on linear time =).
I think that the merge conflicts are exacerbated by:
a) we have a lot of people making PRs
b) the length of time that our checks take when a PR is submitted is much longer than normal
In previous and current projects that I'm on, the checks run the tests and they take no more than 10 minutes, so the time from a PR request to merging it is short.
Lloyd McKenzie (Oct 09 2021 at 01:40):
Only issue would be the CI-build colliding with itself. If that happened, I think it'd just revert and not bother committing the cache updates and leave it to the next CI build that didn't collide to commit.
Josh Mandel (Oct 09 2021 at 02:01):
I'm still interested in understanding whether there's an approach to caching that would avoid version control conflicts altogether.
Grahame Grieve (Oct 09 2021 at 03:38):
the thing about the txCache files is that it's easy to resolve - just abandon one or the other and move on. It's the other merges that matter. And while it's true that non-locking version control programs lead to merge, I was never afraid of svn merges, and I'm always afraid of git merges
Jean Duteau (Oct 09 2021 at 07:48):
why would you be afraid of merges via git vs svn? i'm more afraid of svn merges because of the way that svn treats branches. as well, I don't think that the concept of a pull request where you can run pre-update checks before a merge occurs exists in SVN. It didn't back when I was using SVN and a quick web search seems to imply that it doesn't.
Grahame Grieve (Oct 11 2021 at 23:48):
well, our infrastructure is better. It's a lot quicker to build without the cache. But it doesn't work - I have some bugs to find and fix
Josh Mandel (Oct 11 2021 at 23:55):
A couple of notes reflecting on this thread:
-
Before migrating FHIR from svn to git (and this was a primary motivation for the migration), we had a broken build in svn a lot of the time. People would by necessity push their possibly-broken updates onto the already broken head of the repo, which compounded the problem. We don't have that anymore; our main branch in git is protected by pre-merge checks that are really pretty effective. (Yes, this has a cost in the effort of managing these merges, which I'm not trying to minimize.)
-
Grahame when you encounter a situation with a scary merge, maybe jot down the commit hashes of the branches you're merging, and share here; without trying to intervene midstream, we can review after the fact and see if there are any workflow improvements to suggest.
Grahame Grieve (Oct 11 2021 at 23:59):
well, most of them are java code, but sure, I can do that
Josh Mandel (Oct 12 2021 at 00:05):
Cool. (The actual java semantics involved in the merge process wouldn't be different between svn and git, right?)
Grahame Grieve (Oct 12 2021 at 00:08):
probably not.
Grahame Grieve (Oct 12 2021 at 00:09):
it may be the UI - I just can't tell what it's doing line wise. But sometimes, I have no idea what it's even showing me. I had one last week - the diff showed fundamantal change to the structure, but none of the recent commits had anything like that. I just reverted the repo and made my changes again.
Grahame Grieve (Oct 12 2021 at 00:22):
oh - and I'm doing this locally as well
Josh Mandel (Oct 12 2021 at 00:44):
Locally resolving is good (but cases where the parent branches exist in GitHub will make the best examples for us to review after the fact).
John Moehrke (Oct 12 2021 at 19:47):
besides the vscache and other "files that I didn't change, so I don't know what I should do with"..... The other issue I come up against is just understanding the "right" way to do something. These "somethings" might seem obvious to the GIT elite, but I am never quite sure what to do when I am told things like my branch being behind blah blah... these "somethings" should be identified and a set of visual tutorials (we started this github process with some nice visual tutorials).
Grahame Grieve (Oct 13 2021 at 23:04):
follow up on this. We could free up a few $$ to pay someone to document 'standard procedures' to resolve common challenges for editors. Would this be useful?
Grahame Grieve (Oct 13 2021 at 23:05):
(irrespective of what else we do about cache files)
Vassil Peytchev (Oct 14 2021 at 02:09):
I am writing up some expanded wording and diagrams on this, but I don't know exactly what are "common challenges". Would it be useful to investigate your current 202110 branch?
Grahame Grieve (Oct 14 2021 at 02:10):
don't know. possibly not - there's a lot of moving parts on that one and it's highly abnormal
Vassil Peytchev (Oct 14 2021 at 02:21):
Here are some of the points I am trying to address
- The slight difference between creating pull requests from a branch vs from a fork
- Regularly merge master/main into the local feature branch
- Suggest use of kdiff3 to resolve conflicts.
- How to commit large number of modified files without typing the file names
Vassil Peytchev (Oct 29 2021 at 21:04):
I think I have identified an issue that can explain at least some of the unexpected problems when merging things. I also think that it can be mitigated with some creative scripting.
The problem: All along we have been advising people to clone master
and then to run the publish process. This makes sense, as we want to make sure that the local setup is done correctly, and the build can succeed. The problem is that the build modifies tracked files - the famous vscache files, but on occasion others, for example tools/tx/snomed/snomed.xml. This results in a set of files that are modified relative to the master
branch but not part of it. If we now create a new branch locally, these modified files are mixing with the new branch. In most(?) cases that is OK, but I think it is a source of endless aggravation of the "death by a thousand cuts" type.
-
Possible mitigation 1
-
Change the steps for making changes to be:
- clone
master
locally - create the feature branch
- run the build in the feature branch
- make changes
- run the build
- ...etc.
- clone
-
change the build script to not allow the build to run in the
master
branch - change the build script to auto commit the changes to any uncommitted files after the build
-
-
Possible mitigation 2
- Disallow direct branching and pull requests from branches, forcing all PRs to be from a fork. I think this will do more damage than good.
- change the build script to auto commit the changes to any uncommitted files after the build
@Grahame Grieve @Josh Mandel @Lloyd McKenzie @Mark Iantorno
Grahame Grieve (Oct 29 2021 at 21:12):
This results in a set of files that are modified relative to the master branch but not part of it. If we now create a new branch locally, these modified files are mixing with the new branch
I don't know what those sentences mean
Vassil Peytchev (Oct 29 2021 at 21:36):
Let's take tools/tx/snomed/snomed.xml - somehow the single quote in a description (cow's milk) got mangled into a sequence of the characters '?' '€' '™' - before the build the euro and tm characters were represented by the corresponding entity codes, after the build they were converted to the actual characters. The file is now marked as modified. git status
will show it as modified, but not staged for commit.
If you create a branch: git checkout -b test_branch
all the modified files that are not staged for commit will be in the new branch. git status
will show them as part of the current branch. In most cases that is not a problem - if you now commit the modified files while in the test_branch
, they will not show up in the master
branch as modified.
I think the problems may arise when not everything is properly committed, or things get committed by mistake in the master
branch instead of the test_branch
.
Grahame Grieve (Oct 29 2021 at 21:37):
I don't understand why changing the branch makes any difference. You still have a modified file to do something about
Vassil Peytchev (Oct 29 2021 at 21:47):
True, but are the modifications from the first build run necessary after your run the build after you make changes? Does it matter if you commit them before you run your next build?
I think the more we limit the chances of confusion of which branch we are modifying, the less will people experience edge cases that are hard to resolve.
Grahame Grieve (Oct 29 2021 at 21:58):
it means someone else should have committed them
Grahame Grieve (Nov 01 2021 at 20:32):
are the modifications from the first build run necessary after your run the build after you make changes?
no. they'll just keep getting made
Does it matter if you commit them before you run your next build?
no
Grahame Grieve (Nov 01 2021 at 20:34):
so it presently takes my build 17 minutes longer to run without the vsCache. I can get rid of the files in vsCache, which means that
- everybody's first build of a new checkout will get ~20mins longer
- the ci-build will get 20min longer
Vassil Peytchev (Nov 01 2021 at 20:41):
I think it is fairly straight forward to force committing any changes in vscache, source, and tools as part of the build script. That will save the 20 minutes of build, and will avoid issues with missing generated files.
Vassil Peytchev (Nov 01 2021 at 20:42):
I will create a PR to modify publish.sh
Grahame Grieve (Nov 01 2021 at 20:45):
I don't understand how a build script can force people to commit something when it doesn't cover the commit?
Vassil Peytchev (Nov 01 2021 at 20:53):
The script will list all new files that have not been added into the "stage" (i.e. no git add
). Then it will loop through them, and if they are under one of tools/ vscache/ or source/ it will do a git add
to them. After that, a git commit -am "Auto-commit courtesy to the publish script"
will commit both those, and any modified but not committed files.
Grahame Grieve (Nov 01 2021 at 20:54):
wooah so git commits everything any time the build runs?
Vassil Peytchev (Nov 01 2021 at 21:13):
Under certain folders. It can be limited to vscache/ only.
What is the use case for not committing everything that goes into the build?
Grahame Grieve (Nov 01 2021 at 21:25):
I'm working on some stuff, and I run the build to see if it works.
Vassil Peytchev (Nov 01 2021 at 21:26):
What's the downside of having committed it?
Josh Mandel (Nov 02 2021 at 03:06):
Commits should generally reflect an intent, and the commit history should tell a story (not just the story "Then I built. Then I built. Then I built...").
Josh Mandel (Nov 02 2021 at 03:07):
But if there were really practical wins in abandoning this ideal... I think they'd be worth evaluating.
Grahame Grieve (Nov 04 2021 at 07:13):
looks likely that I'm going to replace vsCache altogether anyway, and they'll no longer be in version control. Josh and I worked out an alternative approach that will work better, actually, and it doesn't have merge clashes.
Grahame Grieve (Nov 04 2021 at 07:14):
I'm not going to describe it quite yet - still doing research whether it's feasible. But if it is... than those files will disappear (main build first, and then, when that's bedded down, the IGs as well)
Vassil Peytchev (Nov 04 2021 at 13:41):
Another possible source of merge clashes might be the use of rebase
instead of merge
. While we never mention rebase
in any of the instructions, there was at least one mention of it in previous answers to git questions.
Anyone seeing an issue with explicitly stating in the instructions not to use rebase
for PRs?
Grahame Grieve (Nov 04 2021 at 18:19):
I don't know what it is and would never do it
Rob Hausam (Nov 04 2021 at 18:32):
@Grahame Grieve Rebase definitely has its uses. I just used it to streamline our commits (primarily consolidating and removing all of the excess "update txcache" commits) and then moving only the needed updates from our IPS development (i.e. 'connectathon*') branches into the CI build. The interactive rebase worked quite well for that. But typically you do not want to do that (or at least do it very judiciously) with any commits that have been pushed to a central repository and others may have accessed and are using. I agree with @Vassil Peytchev that inappropriate use of rebase would certainly be a source of merge conflicts.
Josh Mandel (Nov 04 2021 at 18:36):
Yeah, I wouldn't "forbid" rebasing for users that want to take this upon themselves during branch development on their own machines, but we've (since the beginning) configured the GitHub web interface for the FHIR repo to only allow merges, not rebases:
Vassil Peytchev (Nov 04 2021 at 18:39):
How about the step before - locally merge master into feature branch?
Josh Mandel (Nov 04 2021 at 19:03):
This is something developers can control how they want to do on their own side. I think our recommendations should be just to do simple merges, but if a developer wants to rebase for some reason it's really up to them, or I would say it's below the abstraction barrier because we don't have direct insight into what happens there.
Rob Hausam (Nov 04 2021 at 19:08):
Agree.
Vassil Peytchev (Nov 04 2021 at 19:24):
I think using rebase to pull changes from master into the feature branch since the branch was created is dangerous. I want to put a warning on step 8 here not to use rebase
.
Grahame Grieve (Nov 10 2021 at 06:17):
see https://chat.fhir.org/#narrow/stream/179165-committers/topic/txCache.20is.20out about the tx-cache
Josh Mandel (Nov 10 2021 at 14:59):
Fine by me to add a warning; folks should not use rebase unless they know what they're doing and why
John Moehrke (Nov 10 2021 at 16:52):
who is using rebase without knowing what it is or why... I certainly don't have a clue
Oliver Egger (Nov 10 2021 at 20:39):
For individuals, rebasing makes a lot of sense :grinning: see https://medium.datadriveninvestor.com/git-rebase-vs-merge-cc5199edd77c. i often us it for ig development before merging it into the main to have the history streamlined.
Vassil Peytchev (Nov 10 2021 at 21:03):
Key paragraph from the link, and, AFAIK, reflecting generally accepted best practices (emphasis mine):
If the feature branch you are getting changes from is shared with other developers, rebasing is not recommended, because the rebasing process will create inconsistent repositories. For individuals, rebasing makes a lot of sense.
Since pull requests need to pass the build checks, I view these branches as shared with other developers. But mainly, the step when you merge master
into the feature branch in order to run the build locally and confirm that nothing is broken, should never be a rebase.
Last updated: Apr 12 2022 at 19:14 UTC