chore(nix): Move nix integ jobs to ec2 fleets#5461
Merged
Conversation
This reverts commit fae3385.
lrstewart
reviewed
Aug 14, 2025
Co-authored-by: Lindsay Stewart <stewart.r.lindsay@gmail.com>
jmayclin
approved these changes
Aug 20, 2025
Contributor
jmayclin
left a comment
There was a problem hiding this comment.
This reduces the wall clock runtime to 17 min**
Can we add the old number too? Otherwise it's difficult to judge how much of an improvement it is.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Release Summary:
Resolved issues:
n/a
Description of changes:
Move the Nix Integration CodeBuild jobs to CodeBuild Ec2 fleets, with custom AMI's (not AL).
This reduces the wall clock runtime from 28 min. to 17 min**. The breakdown per child job is shown in this screenshot for this job
**Caveats: These are 6 large instances (3xc7g.12xl and 3xc7a.16xl), with fully populated /nix stores (warmed cache) that maintain state between runs.
The fleets are active and used in the running of the IntegNix job for this PR.
Call-outs:
Nix Python
This PR removes the old Python3.10 nix packages, used by the original Nix job. Now we're managing python with uv.
Pytest
Pytest can maintain a state file between test runs. This allows it to only run failed tests on subsequent runs, and dramatically speeds up re-run attempts, without impact if everything passes. I've added a single retry to the nix uvinteg shell wrapper to do this with every full integration run.
With this change, we can remove the retry of pytest and mark two specific tests as flaky; delivering an overall speedup. The state file lives in
/tmp/$CODEBUILD_BUILD_IDdir, so we're not at risk of failing to run tests from one run to the next.Nix store
The cache jobs do a nix build for each platform and then save those files to s3. Future jobs download this store, however- with Ec2 fleets, the hosts are re-used as-is, so often this download step is a no-op.
This store does need periodic cleaning with
nix store gcto avoid filling up the disk, since we're not discarding these images as we do with Docker.The weekly cleanup from #5430 has been added to the buildspec.
Custom AMI's
It's Nix installed onto the ec2 marketplace Ubuntu24. This could all be automated away in a future pipeline (punt). The script used is checked in here for future reference.
Merge queues and the head build
For our integration tests, we do the build twice, once with the PR and once with main, creating additional binaries
s2nd_headands2nc_head. For some reason, merge queues and ec2 fleets end up getting git clones that don't have the main branch (depth 1 and no other branches). There is a workaround in this PR to do a fetch and checkout main to avoid this failing.Testing:
How is this change tested (unit tests, fuzz tests, etc.)? CI, adhoc jobs
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.