build(docker): layering and caching improvements#70
Conversation
jnwei
left a comment
There was a problem hiding this comment.
Overall this looks good. We have a tests Dockerfile that extends the docker image with the packages needed for running tests. Perhaps you could also have a look at that Dockerfile and see if any similar optimizations could be made there.
.github/workflows/ci-test.yml
Outdated
|
|
||
| - name: Build test layers on Docker image | ||
| run: docker build -t openfold3-test-runner -f openfold3/tests/Dockerfile . | ||
| run: docker build --target test -t openfold-docker:test -f docker/Dockerfile . |
There was a problem hiding this comment.
Harmonized to match what's in DOCKER.md
docker/Dockerfile
Outdated
| # Test stage - build on devel layer with test dependencies | ||
| FROM devel AS test | ||
|
|
||
| COPY environments/requirements-test.txt /opt/openfold3/requirements-test.txt | ||
|
|
||
| WORKDIR /opt/openfold3 | ||
| RUN pip install -r requirements-test.txt | ||
| RUN pip install --force-reinstall --no-deps . |
There was a problem hiding this comment.
I just added 'test' as another target within the same Dockerfile, one Dockerfile less!
| docker build \ | ||
| -f docker/development/Dockerfile \ | ||
| --target test \ | ||
| -t openfold-docker:test . |
There was a problem hiding this comment.
The original Dockerfile now has two targets: devel and target
|
All the tests are passing on 2xlarge, just had to beef up the disk space – saving mooneeyy 💸 |
jnwei
left a comment
There was a problem hiding this comment.
Overall this looks great, thank you so much for cleaning up the docker workflows!
Some final comments / suggestions in thread / on the PR description.
PR description:
- Could you add a sentence about how GHCR is used to cache previous docker builds?
- Are there still issues with the 2
test_utils.pytests that are mentioned? Looking at the action logs, it looks like the tests have all been passing. - Possible typo: "We spin up N was-runners, which can run multiple builds...":
was-runnersshould beaws-runners, right?
Fixed, thanks
Ah, you're right... ugh this gives me the usual fear that it's a flaky test... let's wait and see
Updated. Let's brain-storm some comms about this, the consortium members might be interested. |
Summary
Slight change to the main Dockerfile,
should probably do the Blackwell and test docker images too, for consistency.The Blackwell Dockerfile is left untouched as its own thing for now.
We continue to use dockerhub for stable/production docker image distribution.
This PR introduces also a GHCR (github container registry) which is easy to use with GH runners (self-hosted or otherwise). We use GHCR to store a special tag called
cache, egcache-12.1.1-cudnn8-devel-ubuntu22.04, which are non-runnable but are used to cache all the intermediate builder layers. Combined with improvements to dockerfile structure, this cache results in massively faster build times (because docker can check that it has the layer in GHCR already, doesn't rebuild but pulls instead).Results
No-op, rebuild from cache: 1.2s
Change a single source: 2.4+/-0.2s (N=3)
Changes
Improve docker image build speeds by installing a bare-bones version of the repo in the builder stage and then copying over the sources in the runtime stage. Now editing the sources will still give you a sub 10 seconds rebuild, when you have the cache warm.
See detailed notes in the PR comments.
Related Issues
Testing
Tests are broken with
but it's also broken on main, so not related to my changes
Other Notes
Manually triggered the gh workflow
https://github.com/aqlaboratory/openfold-3/actions/runs/20098817240 ✅
Dockerhub cache
https://github.com/aqlaboratory/openfold-3/actions/runs/20101218016 ✅ but slow 1h 17min
https://github.com/aqlaboratory/openfold-3/actions/runs/20133611865 ✅ 23mn looks like some cache is working!
GHCR cache
https://github.com/aqlaboratory/openfold-3/actions/runs/20135233502 ✅ GHCR cache, 50 min
https://github.com/aqlaboratory/openfold-3/actions/runs/20136728861 ✅ GHCR cache, 15 min – of which 14 s (!!!) were building the image
***GHCR cache + parallel re-usable workflows ***
https://github.com/aqlaboratory/openfold-3/actions/runs/20190734384/attempts/1 ✅ 32min cold, 12 min warm
** Final version **
We spin up N aws-runners, which can run multiple builds (different base images etc), here CUDA 12.1.1 only but can easily be extended.