Skip to content

build(docker): layering and caching improvements#70

Merged
jandom merged 30 commits intomainfrom
jandom/2025-12/build/docker-layering-improvements
Dec 15, 2025
Merged

build(docker): layering and caching improvements#70
jandom merged 30 commits intomainfrom
jandom/2025-12/build/docker-layering-improvements

Conversation

@jandom
Copy link
Copy Markdown
Collaborator

@jandom jandom commented Dec 9, 2025

Summary

Slight change to the main Dockerfile, should probably do the Blackwell and test docker images too, for consistency.
The Blackwell Dockerfile is left untouched as its own thing for now.

We continue to use dockerhub for stable/production docker image distribution.
This PR introduces also a GHCR (github container registry) which is easy to use with GH runners (self-hosted or otherwise). We use GHCR to store a special tag called cache, eg cache-12.1.1-cudnn8-devel-ubuntu22.04, which are non-runnable but are used to cache all the intermediate builder layers. Combined with improvements to dockerfile structure, this cache results in massively faster build times (because docker can check that it has the layer in GHCR already, doesn't rebuild but pulls instead).

Results

No-op, rebuild from cache: 1.2s
Change a single source: 2.4+/-0.2s (N=3)

Changes

Improve docker image build speeds by installing a bare-bones version of the repo in the builder stage and then copying over the sources in the runtime stage. Now editing the sources will still give you a sub 10 seconds rebuild, when you have the cache warm.

See detailed notes in the PR comments.

Related Issues

  • disparate docker images
  • questions around how to package the code

Testing

  • Not sure how to test this, the container runs and I could import openfold, training would be ideal

Tests are broken with

FAILED openfold3/tests/test_utils.py::TestUtils::test_chunk_layer_dict - AssertionError: tensor(False) is not true
FAILED openfold3/tests/test_utils.py::TestUtils::test_chunk_layer_tensor - AssertionError: tensor(False) is not true

but it's also broken on main, so not related to my changes

Other Notes

Manually triggered the gh workflow

https://github.com/aqlaboratory/openfold-3/actions/runs/20098817240

Dockerhub cache

https://github.com/aqlaboratory/openfold-3/actions/runs/20101218016 ✅ but slow 1h 17min
https://github.com/aqlaboratory/openfold-3/actions/runs/20133611865 ✅ 23mn looks like some cache is working!

GHCR cache
https://github.com/aqlaboratory/openfold-3/actions/runs/20135233502 ✅ GHCR cache, 50 min
https://github.com/aqlaboratory/openfold-3/actions/runs/20136728861 ✅ GHCR cache, 15 min – of which 14 s (!!!) were building the image

***GHCR cache + parallel re-usable workflows ***

https://github.com/aqlaboratory/openfold-3/actions/runs/20190734384/attempts/1 ✅ 32min cold, 12 min warm

** Final version **

We spin up N aws-runners, which can run multiple builds (different base images etc), here CUDA 12.1.1 only but can easily be extended.

Screenshot 2025-12-13 at 12 11 30

@jandom jandom requested a review from jnwei December 9, 2025 15:59
@jandom jandom self-assigned this Dec 9, 2025
Copy link
Copy Markdown
Contributor

@jnwei jnwei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good. We have a tests Dockerfile that extends the docker image with the packages needed for running tests. Perhaps you could also have a look at that Dockerfile and see if any similar optimizations could be made there.

@jandom jandom marked this pull request as ready for review December 10, 2025 10:09
@jandom jandom requested a review from jnwei December 10, 2025 10:09

- name: Build test layers on Docker image
run: docker build -t openfold3-test-runner -f openfold3/tests/Dockerfile .
run: docker build --target test -t openfold-docker:test -f docker/Dockerfile .
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Harmonized to match what's in DOCKER.md

Comment on lines +96 to +103
# Test stage - build on devel layer with test dependencies
FROM devel AS test

COPY environments/requirements-test.txt /opt/openfold3/requirements-test.txt

WORKDIR /opt/openfold3
RUN pip install -r requirements-test.txt
RUN pip install --force-reinstall --no-deps .
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just added 'test' as another target within the same Dockerfile, one Dockerfile less!

Comment on lines +22 to +25
docker build \
-f docker/development/Dockerfile \
--target test \
-t openfold-docker:test .
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original Dockerfile now has two targets: devel and target

@jandom jandom requested a review from vinay-swamy December 10, 2025 10:18
@jandom
Copy link
Copy Markdown
Collaborator Author

jandom commented Dec 10, 2025

All the tests are passing on 2xlarge, just had to beef up the disk space – saving mooneeyy 💸

@jandom jandom changed the title build: docker layering improvements build(docker) layering improvements Dec 11, 2025
@jandom jandom removed the request for review from vinay-swamy December 11, 2025 15:23
@jandom jandom changed the title build(docker) layering improvements build(docker): layering improvements Dec 11, 2025
@jandom jandom changed the title build(docker): layering improvements build(docker): layering and caching improvements Dec 11, 2025
Copy link
Copy Markdown
Contributor

@jnwei jnwei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks great, thank you so much for cleaning up the docker workflows!

Some final comments / suggestions in thread / on the PR description.

PR description:

  • Could you add a sentence about how GHCR is used to cache previous docker builds?
  • Are there still issues with the 2 test_utils.py tests that are mentioned? Looking at the action logs, it looks like the tests have all been passing.
  • Possible typo: "We spin up N was-runners, which can run multiple builds...": was-runners should be aws-runners, right?

@jandom
Copy link
Copy Markdown
Collaborator Author

jandom commented Dec 15, 2025

Possible typo: "We spin up N was-runners, which can run multiple builds...": was-runners should be aws-runners, right?

Fixed, thanks

Are there still issues with the 2 test_utils.py tests that are mentioned? Looking at the action logs, it looks like the tests have all been passing.

Ah, you're right... ugh this gives me the usual fear that it's a flaky test... let's wait and see

Could you add a sentence about how GHCR is used to cache previous docker builds?

Updated.

Let's brain-storm some comms about this, the consortium members might be interested.

@jandom jandom merged commit 59175f0 into main Dec 15, 2025
2 checks passed
@jandom jandom deleted the jandom/2025-12/build/docker-layering-improvements branch December 15, 2025 09:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants