Improve installation time in opam by shrinking the Git-generated source archives#14152
Improve installation time in opam by shrinking the Git-generated source archives#14152dra27 merged 3 commits intoocaml:trunkfrom
Conversation
|
All comment and review as ever welcomed, but this PR must have @Octachron's approval before being merged. Assuming no fundamental objections to the concept, I am suggesting that we try this for 5.4.0~beta1 on the basis that it could be trivially rolled back for the release candidate if necessary. |
|
I am unconvinced by this proposal, but then maybe I don't understand the details.
I'm left with the following questions:
|
|
Regarding windows, it seems like windows defender has special heuristic for certain archive tools. Maybe we're not hitting those and could make changes to the extraction process to improve it somehow? |
What I'm proposing is two release archives. FWIW, opam releases already do this (https://github.com/ocaml/opam/releases/download/2.4.0/opam-full-2.4.0.tar.gz contains the vendored sources as well, where https://github.com/ocaml/opam/archive/refs/tags/2.4.0.tar.gz is opam only)
Um, I just used the 5.4.0~alpha1 tag as an example release to get figures - if I'd instead experimented with the 5.3.0 tag, would this observation have disappeared, because it seems a little incongruous??
These two points are probably best discussed synchronously - in passing, we are already doing this (i.e. we de facto ship an automatically generated tarball, because GitHub generates them whether we want them or not) and they are already different. My personal opinion is that the testsuite is not worth running as part of a user installation process (and it's not possible for a user to install the compiler via opam with the testsuite being run), so the files are not worth shipping.
No, my only ulterior motive is the installation of OCaml sucking a little less for (some) Windows users! 😀 |
Possibly, yes, although this is far from easy to do. However (as testified by talks on both optimising rustup and uv) is that the best optimisation is not to do something you don't need to do in the first place! Extracting 3000 files to do nothing other than delete them after the build is an unfortunate workflow. |
|
Another little data point, while acknowledging that speed isn’t everything - processing those extra files is taking 500ms on Linux. Cloning the compiler (when installing with Relocatable) is taking 100ms… |
|
I don't want to give the idea that I disagree with trimming down the release tarballs -- I don't have a strong opinion. But my impression from a distance is that your current proposal results in increasing the complexity of the whole system, with things done in certain cases but not others, and various new edge cases. If you are correct that some files are not worth having in our releases, then why don't we remove them unconditionally all the time, to keep the overall system simpler? |
That's handy to know, ta!
In terms of the code, GNU make and m4sh make these things look much worse than they are, but point taken. I had an idea that may get the bulk of the saving but without needing as many changes elsewhere which I'll try, and report back.
By system you mean "release system"? The configure/build system will always need to work with both release tarballs and Git checkouts. |
|
Personally, this looks like a sensible change after a quick look to split the archive into an archive for compiler developers and an archive for ocaml developers optimized for building the compiler |
|
As a comparison, at dra27#214 is a version which instead removes
|
|
I think I understand the issue you want to solve. But wouldn't it be nicer to solve in the extract phase, and provide tar there with the arguments to only extract those directories and files needed for the compilation? I suspect the network bandwidth is not the main issue here, but the number of files being extracted. |
|
Absolutely, yes (that's what I'm alluding to doing with ultimately being able to control it in opam) - the size of the tarball is obviously a minor side-effect. The main reason I'm proposing changing the tarball as well it that it solves it now (and it doesn't mean that all future releases have to be bound by the same idea), but also because despite our listing running the testsuite as an "optional" step in INSTALL.adoc, in practice we really should not be recommending that users run the testsuite on releases (it has many extra dependencies, it's very compute intensive, it can unexpectedly indicate failure on a perfectly working release). The opam packaging is very highly unlikely ever to have even a |
Side note. I wish there was also a target for devs that doesn't run it all but still makes sure you got everything reasonably right. That is when I'm adding the one line (I know I can run a single test but since that's not my everyday bread it usually takes me two or three invocation, and a couple of minutes to figure out in which of the too many READMEs it's described in before getting it right :–). |
|
To keep the distributed archive uniform, I think a working solution would be to exclude the same directories from all archives of the form the Otherwise, I agree with the idea of not including directories that are only useful to compiler developers in all release archives. |
Are compiler developers actually using archives at all? I would expect them to directly clone the git repository. |
bcc3a92 to
b75fb19
Compare
b75fb19 to
30b0d4d
Compare
CI scripts and Git configuration aren't required on end-user machines.
All the programs and infrastructure remain, but the tests are removed.
592d817 to
512429f
Compare
|
Original PR rebased |
|
I've updated this PR to reflect the version of it that's included in #14247, which is simpler than my original proposal here (the diff can be seen with the Compare link on the last force push above) Although this still includes some files which aren't strictly needed, I think that moving all the "cute" parts to the testsuite is much safer in the long run - in particular, it means that From my perspective this seems good to go, if we can move towards a consensus? |
bcc3a92 to
592d817
Compare
Octachron
left a comment
There was a problem hiding this comment.
The simplified version looks very reasonable to me, and it makes sense to trim the distribution archive to fit the need of the end users.
Improve installation time in opam by shrinking the Git-generated source archives (cherry picked from commit 172b5c5)
I propose that we use the features of git-archive to remove the testsuite, manual and other extraneous files from the tarballs automatically generated by GitHub. Doing so removes some 68% of the files and 86% of the directories from the tarball (and shaves some 25% off its size 💾).
When installed through opam, this eases the pressure on the extraction and copying for files which are not used in the build. In my benchmarking on 5.4.0~alpha1, this results in a saving of 0.5 seconds on Linux (irrelevant, but not a regression), but some 100 seconds for typical first-time users on Windows (who will likely not have turned off Windows Defender). Even with Windows Defender disabled (which is something I strongly recommend not doing), this still saves saves several seconds.
The principal feature being exploited here is to add
export-ignoreto various entries in.gitattributes. Our winpthreads fork already does this for its tarball.There are some very minor changes required to
configureto support this and some similarly minor plumbing inMakefileto allowmake cleanandmake alldependto be run with such a tarball. Note that only theconfigurechanges are actually necessary - when deploying OCaml,make cleanisn't usually run.I have, perhaps too cutely, added an error message for if someone were to attempt to run the testsuite with one of these tarballs:
Most importantly, this doesn't affect the official tarballs which are created by hand and uploaded to https://caml.inria.fr/pub/distrib/ and which are added manually to the GitHub Releases. Thus, at its release, for example, https://github.com/ocaml/ocaml/releases/download/5.4.0/ocaml-5.4.0.tar.gz (prepared and uploaded by @Octachron) will contain all files, leaving distributions (e.g. Alpine) unaffected but https://github.com/ocaml/ocaml/archive/refs/tags/5.4.0.tar.gz would not.
There is the unfortunate matter of the stability of those archives as presented by GitHub, but my thinking is that since git-archive is automating this, we could instead either make the artefact uploaded to the GitHub release be a copy of the git-archive instead of the full archive (although that's possibly confusing, given the same filename as the one published at caml.inria.fr) or upload one with a different name, i.e. for the release it would be a matter of doing:
$ curl -LO ocaml-5.4.0-opam.tar.gz https://github.com/ocaml/ocaml/archive/refs/tags/5.4.0.tar.gzand attaching that to the Release tag in addition to
ocaml-5.4.0.tar.gz. For alpha, beta and rc releases, we'd just reference the GitHub-generated tarball directly, as we do today.Ultimately, I'd like to be able to control this directly in opam as well (by being able to stipulate directories which should not be extracted), but given the potential reduction of several minutes for a Windows user exploring OCaml for the first time, my opinion is that this change made now is worth the relatively small amount of additional work on the releases.
(you had probably guessed, but this affects Relocatable OCaml, which at the moment extracts the entire tarball on Windows before discovering that it can use a cached copy of the compiler... in 2022 I committed an egregious sleight-of-hand to workaround this in the demo)