Skip to content

Improve installation time in opam by shrinking the Git-generated source archives#14152

Merged
dra27 merged 3 commits intoocaml:trunkfrom
dra27:export-ignore
Feb 17, 2026
Merged

Improve installation time in opam by shrinking the Git-generated source archives#14152
dra27 merged 3 commits intoocaml:trunkfrom
dra27:export-ignore

Conversation

@dra27
Copy link
Copy Markdown
Member

@dra27 dra27 commented Jul 19, 2025

I propose that we use the features of git-archive to remove the testsuite, manual and other extraneous files from the tarballs automatically generated by GitHub. Doing so removes some 68% of the files and 86% of the directories from the tarball (and shaves some 25% off its size 💾).

When installed through opam, this eases the pressure on the extraction and copying for files which are not used in the build. In my benchmarking on 5.4.0~alpha1, this results in a saving of 0.5 seconds on Linux (irrelevant, but not a regression), but some 100 seconds for typical first-time users on Windows (who will likely not have turned off Windows Defender). Even with Windows Defender disabled (which is something I strongly recommend not doing), this still saves saves several seconds.

The principal feature being exploited here is to add export-ignore to various entries in .gitattributes. Our winpthreads fork already does this for its tarball.

There are some very minor changes required to configure to support this and some similarly minor plumbing in Makefile to allow make clean and make alldepend to be run with such a tarball. Note that only the configure changes are actually necessary - when deploying OCaml, make clean isn't usually run.

I have, perhaps too cutely, added an error message for if someone were to attempt to run the testsuite with one of these tarballs:

$ cd /tmp
$ curl -L https://github.com/dra27/ocaml/archive/refs/heads/export-ignore.tar.gz | tar -zx
$ cd ocaml-export-ignore
$ ./configure && make -j && make tests
...
Sources not found in testsuite
This happens when the sources of OCaml are extracted from a tarball
generated by git-archive (which includes those generated by GitHub)
The required files are in commit c47dfba666cf09ad044997384ebc108fd41dd9ec, for example:
  git clone https://github.com/ocaml/ocaml --revision c47dfba666cf09ad044997384ebc108fd41dd9ec --depth 1 git-sources
  mv git-sources/testsuite .
make: *** [Makefile:3045: testsuite] Error 1

Most importantly, this doesn't affect the official tarballs which are created by hand and uploaded to https://caml.inria.fr/pub/distrib/ and which are added manually to the GitHub Releases. Thus, at its release, for example, https://github.com/ocaml/ocaml/releases/download/5.4.0/ocaml-5.4.0.tar.gz (prepared and uploaded by @Octachron) will contain all files, leaving distributions (e.g. Alpine) unaffected but https://github.com/ocaml/ocaml/archive/refs/tags/5.4.0.tar.gz would not.

There is the unfortunate matter of the stability of those archives as presented by GitHub, but my thinking is that since git-archive is automating this, we could instead either make the artefact uploaded to the GitHub release be a copy of the git-archive instead of the full archive (although that's possibly confusing, given the same filename as the one published at caml.inria.fr) or upload one with a different name, i.e. for the release it would be a matter of doing:

$ curl -LO ocaml-5.4.0-opam.tar.gz https://github.com/ocaml/ocaml/archive/refs/tags/5.4.0.tar.gz

and attaching that to the Release tag in addition to ocaml-5.4.0.tar.gz. For alpha, beta and rc releases, we'd just reference the GitHub-generated tarball directly, as we do today.

Ultimately, I'd like to be able to control this directly in opam as well (by being able to stipulate directories which should not be extracted), but given the potential reduction of several minutes for a Windows user exploring OCaml for the first time, my opinion is that this change made now is worth the relatively small amount of additional work on the releases.

(you had probably guessed, but this affects Relocatable OCaml, which at the moment extracts the entire tarball on Windows before discovering that it can use a cached copy of the compiler... in 2022 I committed an egregious sleight-of-hand to workaround this in the demo)

@dra27
Copy link
Copy Markdown
Member Author

dra27 commented Jul 19, 2025

All comment and review as ever welcomed, but this PR must have @Octachron's approval before being merged.

Assuming no fundamental objections to the concept, I am suggesting that we try this for 5.4.0~beta1 on the basis that it could be trivially rolled back for the release candidate if necessary.

@gasche
Copy link
Copy Markdown
Member

gasche commented Jul 19, 2025

I am unconvinced by this proposal, but then maybe I don't understand the details.

  • I don't like the idea that release archives and opam-provided archives would have different content.
  • I think there is a contradiction between aiming to improve the switch "5.4.0~alpha1" and having a user story of "Windows user trying OCaml for the first time". Those users should probably not use a not-yet-released version.

I'm left with the following questions:

  • What is the best choice of content for release archives? (irrespectively of whether they are github-generated or Florian-generated, whether they are full releases of alpha releases, etc.) Your proposals does not make this clear.
  • Why would we support both a manual-archive process and an automated-archive-creation process? (If the manual process is better for objective reasons, shouldn't it be used for alpha, beta, rc releases as well? Are there clear costs to this?)
  • Is this proposal secretly related to CI and/or the weird +trunk switches in a way that I don't understand?

@smuenzel
Copy link
Copy Markdown
Contributor

Regarding windows, it seems like windows defender has special heuristic for certain archive tools. Maybe we're not hitting those and could make changes to the extraction process to improve it somehow?

microsoft/Windows-Dev-Performance#27 (comment)

@dra27
Copy link
Copy Markdown
Member Author

dra27 commented Jul 20, 2025

I am unconvinced by this proposal, but then maybe I don't understand the details.

  • I don't like the idea that release archives and opam-provided archives would have different content.

What I'm proposing is two release archives. FWIW, opam releases already do this (https://github.com/ocaml/opam/releases/download/2.4.0/opam-full-2.4.0.tar.gz contains the vendored sources as well, where https://github.com/ocaml/opam/archive/refs/tags/2.4.0.tar.gz is opam only)

  • I think there is a contradiction between aiming to improve the switch "5.4.0~alpha1" and having a user story of "Windows user trying OCaml for the first time". Those users should probably not use a not-yet-released version.

Um, I just used the 5.4.0~alpha1 tag as an example release to get figures - if I'd instead experimented with the 5.3.0 tag, would this observation have disappeared, because it seems a little incongruous??

I'm left with the following questions:

  • What is the best choice of content for release archives? (irrespectively of whether they are github-generated or Florian-generated, whether they are full releases of alpha releases, etc.) Your proposals does not make this clear.
  • Why would we support both a manual-archive process and an automated-archive-creation process? (If the manual process is better for objective reasons, shouldn't it be used for alpha, beta, rc releases as well? Are there clear costs to this?)

These two points are probably best discussed synchronously - in passing, we are already doing this (i.e. we de facto ship an automatically generated tarball, because GitHub generates them whether we want them or not) and they are already different. My personal opinion is that the testsuite is not worth running as part of a user installation process (and it's not possible for a user to install the compiler via opam with the testsuite being run), so the files are not worth shipping.

  • Is this proposal secretly related to CI and/or the weird +trunk switches in a way that I don't understand?

No, my only ulterior motive is the installation of OCaml sucking a little less for (some) Windows users! 😀

@dra27
Copy link
Copy Markdown
Member Author

dra27 commented Jul 20, 2025

Regarding windows, it seems like windows defender has special heuristic for certain archive tools. Maybe we're not hitting those and could make changes to the extraction process to improve it somehow?

Possibly, yes, although this is far from easy to do. However (as testified by talks on both optimising rustup and uv) is that the best optimisation is not to do something you don't need to do in the first place! Extracting 3000 files to do nothing other than delete them after the build is an unfortunate workflow.

@dra27
Copy link
Copy Markdown
Member Author

dra27 commented Jul 20, 2025

Another little data point, while acknowledging that speed isn’t everything - processing those extra files is taking 500ms on Linux. Cloning the compiler (when installing with Relocatable) is taking 100ms…

@gasche
Copy link
Copy Markdown
Member

gasche commented Jul 20, 2025

I don't want to give the idea that I disagree with trimming down the release tarballs -- I don't have a strong opinion. But my impression from a distance is that your current proposal results in increasing the complexity of the whole system, with things done in certain cases but not others, and various new edge cases. If you are correct that some files are not worth having in our releases, then why don't we remove them unconditionally all the time, to keep the overall system simpler?

@dra27
Copy link
Copy Markdown
Member Author

dra27 commented Jul 21, 2025

I don't want to give the idea that I disagree with trimming down the release tarballs

That's handy to know, ta!

But my impression from a distance is that your current proposal results in increasing the complexity of the whole system, with things done in certain cases but not others, and various new edge cases.

In terms of the code, GNU make and m4sh make these things look much worse than they are, but point taken. I had an idea that may get the bulk of the saving but without needing as many changes elsewhere which I'll try, and report back.

If you are correct that some files are not worth having in our releases, then why don't we remove them unconditionally all the time, to keep the overall system simpler?

By system you mean "release system"? The configure/build system will always need to work with both release tarballs and Git checkouts.

@Octachron
Copy link
Copy Markdown
Member

Personally, this looks like a sensible change after a quick look to split the archive into an archive for compiler developers and an archive for ocaml developers optimized for building the compiler

@dra27
Copy link
Copy Markdown
Member Author

dra27 commented Jul 21, 2025

As a comparison, at dra27#214 is a version which instead removes testsuite/tests, leaving ocamltest and the rest of the testsuite intact. The manual-pregen target is in fact a CI target, so it allows the "cute" message about to be moved to testsuite/Makefile and configure and the root Makefile are much less altered. In terms of stats, it's about 7 seconds slower on Windows to extract that one (so not much difference really). In terms of file savings:

Branch Files Dirs Tarball size
trunk 4622 393 6.3MiB
#14152 1472 53 4.5MiB
dra27#214 1603 58 4.7MiB

@hannesm
Copy link
Copy Markdown
Member

hannesm commented Jul 22, 2025

I think I understand the issue you want to solve. But wouldn't it be nicer to solve in the extract phase, and provide tar there with the arguments to only extract those directories and files needed for the compilation?

I suspect the network bandwidth is not the main issue here, but the number of files being extracted.

@dra27
Copy link
Copy Markdown
Member Author

dra27 commented Jul 22, 2025

Absolutely, yes (that's what I'm alluding to doing with ultimately being able to control it in opam) - the size of the tarball is obviously a minor side-effect. The main reason I'm proposing changing the tarball as well it that it solves it now (and it doesn't mean that all future releases have to be bound by the same idea), but also because despite our listing running the testsuite as an "optional" step in INSTALL.adoc, in practice we really should not be recommending that users run the testsuite on releases (it has many extra dependencies, it's very compute intensive, it can unexpectedly indicate failure on a perfectly working release). The opam packaging is very highly unlikely ever to have even a {with-test} guarded way of running the testsuite in the way that many other opam (library) packages do.

@dbuenzli
Copy link
Copy Markdown
Contributor

not be recommending that users run the testsuite on releases (it has many extra dependencies, it's very compute intensive, it can unexpectedly indicate failure on a perfectly working release).

Side note. I wish there was also a target for devs that doesn't run it all but still makes sure you got everything reasonably right. That is when I'm adding the one line Result.retract I don't expect it to break the most elaborate and long backend and runtime system tests.

(I know I can run a single test but since that's not my everyday bread it usually takes me two or three invocation, and a couple of minutes to figure out in which of the too many READMEs it's described in before getting it right :–).

@Octachron
Copy link
Copy Markdown
Member

To keep the distributed archive uniform, I think a working solution would be to exclude the same directories from all archives of the form the ocaml-${VERSION} and adds a ocaml-dev-${VERSION} (or ocaml-full-${VERSION}?) variants for the release archives on both github and caml.inria.fr.

Otherwise, I agree with the idea of not including directories that are only useful to compiler developers in all release archives.

@alainfrisch
Copy link
Copy Markdown
Contributor

Personally, this looks like a sensible change after a quick look to split the archive into an archive for compiler developers and an archive for ocaml developers optimized for building the compiler

Are compiler developers actually using archives at all? I would expect them to directly clone the git repository.

@dra27 dra27 force-pushed the export-ignore branch 2 times, most recently from bcc3a92 to b75fb19 Compare September 13, 2025 15:47
@dra27 dra27 removed this from the 5.4.0 bug fixes and documentation milestone Sep 15, 2025
@dra27 dra27 added the relocatable towards a relocatable compiler label Sep 15, 2025
CI scripts and Git configuration aren't required on end-user machines.
All the programs and infrastructure remain, but the tests are removed.
@dra27 dra27 force-pushed the export-ignore branch 2 times, most recently from 592d817 to 512429f Compare November 12, 2025 08:35
@dra27
Copy link
Copy Markdown
Member Author

dra27 commented Nov 12, 2025

Original PR rebased

@dra27
Copy link
Copy Markdown
Member Author

dra27 commented Nov 12, 2025

I've updated this PR to reflect the version of it that's included in #14247, which is simpler than my original proposal here (the diff can be seen with the Compare link on the last force push above)

Although this still includes some files which aren't strictly needed, I think that moving all the "cute" parts to the testsuite is much safer in the long run - in particular, it means that ./configure --enable-ocamltest always works (because the sources are guaranteed to be there).

From my perspective this seems good to go, if we can move towards a consensus?

@dra27 dra27 force-pushed the export-ignore branch 2 times, most recently from bcc3a92 to 592d817 Compare November 28, 2025 21:55
@dra27 dra27 added this to the 5.5 features milestone Dec 12, 2025
Copy link
Copy Markdown
Member

@Octachron Octachron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The simplified version looks very reasonable to me, and it makes sense to trim the distribution archive to fit the need of the end users.

@dra27 dra27 merged commit 172b5c5 into ocaml:trunk Feb 17, 2026
57 of 68 checks passed
dra27 added a commit that referenced this pull request Feb 17, 2026
Improve installation time in opam by shrinking the Git-generated source archives

(cherry picked from commit 172b5c5)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

no-change-entry-needed relocatable towards a relocatable compiler

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants