docker rehaul: remove repo2docker and add gpu support#3161
docker rehaul: remove repo2docker and add gpu support#3161
Conversation
vanpelt
left a comment
There was a problem hiding this comment.
This is looking cool, a couple things to consider:
- The base image is really important here. When "gpu" is enabled we should default to the latest version of cuda and allow the user to specify what cuda version they want. Of course a user can override this entirely with a custom baseimage or image. I wonder if we can derive the cuda version from wandb-metadata.json or elsewhere...
- We should consider standardizing around miniconda for our builds. The benefits will be faster build times for users that have conda configs and the ability to use any python version we want without doing an entire rebuild. We're currently using miniconda in our wandb/local build.
The cool thing about conda is that we can install any version of python we want. Potentially we make early steps of the build always install the miniconda py39 distribution, but then create a conda environment later with the appropriate python version, i.e.
conda env create --name=wandb python=3.7 pip
pip install -r requirements.txt
The trick is ensuring the appropriate conda environment is activated by default, happy to discuss more here if that's interesting.
We definitely have a lot of users using conda today and I think that's growing. It's especially popular for anyone in windows land and MLFlow has standardized on it.
|
@vanpelt yeah conda is a good idea, repo2docker basically built their pip stuff on top of conda infra I think — lemme see if I can sub that in easily cuda versioning is a good note, I haven't really tested it but eg cuda/torch compatibility is definitely a common issue |
Codecov Report
@@ Coverage Diff @@
## master #3161 +/- ##
==========================================
- Coverage 80.15% 79.95% -0.21%
==========================================
Files 213 213
Lines 27875 27882 +7
==========================================
- Hits 22343 22292 -51
- Misses 5532 5590 +58
Flags with carried forward coverage won't be shown. Click here to find out more.
|
| # make sure `python` points at the right version | ||
| RUN update-alternatives --install /usr/bin/python python /usr/bin/python{py_version} 1 \ | ||
| && update-alternatives --install /usr/local/bin/python python /usr/bin/python{py_version} 1 | ||
| """ |
There was a problem hiding this comment.
We should probably keep these in a different file, docker_templates or something
There was a problem hiding this comment.
sorta prefer keeping these here for now since they are only used in once place?
| def get_current_python_version(): | ||
| full_version = sys.version.split()[0].split(".") | ||
| major = full_version[0] | ||
| version = ".".join(full_version[:2]) if len(full_version) >= 2 else major + ".0" |
There was a problem hiding this comment.
Is this just dropping the + for the dev versions of python potentially?
There was a problem hiding this comment.
dropping the + would be correct since this is getting the version mostly to get the right docker base image, and i don't think they have them for dev versions
| "Docker BuildX is not installed, for faster builds upgrade docker: https://github.com/docker/buildx#installing" | ||
| ) | ||
| requirements_line = "RUN WANDB_DISABLE_CACHE=true " | ||
| requirements_line += "pip install -r requirements.txt" |
There was a problem hiding this comment.
Doesn't this drop the use of _wandb_bootstrap? What if the repo doesn't have a requirements.txt?
There was a problem hiding this comment.
fixed and readded this in
KyleGoyette
left a comment
There was a problem hiding this comment.
Couple of questions where I'm not sure if we're doing the right thing especially around installing requirements. Also still needs some tests.
I don't think we should block this on CUDA version. We should allow users to specify, defaulting to the newest. We should be able to add the Cuda version to RunInfo and pull it down with the rest.
| """Fill in the Dockerfile templates for stage 2 of build. CPU version is built on python:slim, GPU | ||
| version is built on nvidia:cuda""" | ||
|
|
||
| python_base_image = "python:{}-slim-buster".format(py_version) # slim for running |
There was a problem hiding this comment.
We could allow users to specify a base image using the docker key of the launch spec config. in case they want to directly specify another image as base
There was a problem hiding this comment.
We may also need a windows option for windows users.
| wandb.termwarn( | ||
| "Docker BuildX is not installed, for faster builds upgrade docker: https://github.com/docker/buildx#installing" | ||
| ) | ||
| prefix = "RUN WANDB_DISABLE_CACHE=true" |
There was a problem hiding this comment.
could factor out this warn and prefix to outside of the other if statements no?
| ) | ||
| image_uri = launch_project.docker_image | ||
| _logger.info("Getting docker command...") | ||
|
|
There was a problem hiding this comment.
We should be able to apply the final build step ontop of a user provided image, where we would set the environment variables and inject the launch-metadata.json file. Then we can get rid of the above warning
There was a problem hiding this comment.
i'm gonna push this to another pr, this one is complicated enough and doing this would require totally different handling
| launch_spec.get("overrides", {}), | ||
| launch_spec.get("resource", "local"), | ||
| launch_spec.get("resource_args", {}), | ||
| launch_spec.get("cuda", False), |
There was a problem hiding this comment.
When we pull the wandb-metadata.json file we can check that for a cuda version if this is false.
KyleGoyette
left a comment
There was a problem hiding this comment.
A few small changes and some more questions. This is getting pretty close.
vanpelt
left a comment
There was a problem hiding this comment.
So much HOTNESS 🔥 . Nice work!
I can imagine simplifying this build even further in the future. Micromamba looks really sick. Instead of installing the specific python version directly in Ubuntu or Debian, we could let micromamba do it. It can handle conda envs, virtual envs, pip installations and python installations all in a tiny package. Their official docker image is 35MB 😮
| }, | ||
| "StoppingCondition": { | ||
| "MaxRuntimeInSeconds": "60" | ||
| "MaxRuntimeInSeconds": 60 |
| mkdir -p test-results | ||
| py{35,36,37,38,39,310}: python -m pytest {env:CI_PYTEST_SPLIT_ARGS:} -n={env:CI_PYTEST_PARALLEL:4} --durations=20 --junitxml=test-results/junit.xml --cov-config=.coveragerc --cov --cov-report= --no-cov-on-fail {posargs:tests/} | ||
| pylaunch: jupyter-repo2docker --no-run ./tests/fixtures/ | ||
| ; pylaunch: jupyter-repo2docker --no-run ./tests/fixtures/ |
There was a problem hiding this comment.
The caching here is busted, this can be removed.
| entity: str, project: str, run_name: str, api: Api, dir: str | ||
| ) -> Optional[str]: | ||
| metadata = api.download_url( | ||
| reqs = api.download_url( # @@@ |
There was a problem hiding this comment.
nit: Remove this comment?
|
|
||
| # remove args that were passed in for launch but not passed to sagemaker | ||
| sagemaker_args.pop("region", None) | ||
| sagemaker_args.pop("profile", None) |
There was a problem hiding this comment.
nit, there's another arg popped on line 291 if we want to group them
Fixes WB-7840
Description
--gpuflag to CLI to enable building a gpu enabled image (TODO: if rerunning a wandb run previously on gpu, set gpu on as default)metrics:
Testing
Checklist