Pass worker resources to pod args by jhamman · Pull Request #398 · dask/dask-kubernetes

jhamman · 2022-02-02T19:48:16Z

This adds a new keyword argument to make_pod_spec, facilitating passing worker resources through to the dask-worker --resources option.

This avoids the pattern of manipulating the pod spec after creating.

Before:

pod_spec = make_pod_spec(...)
pod_spec.spec.containers[0].args.extend(["--resources", "FOO=1"])

Now:

pod_spec = make_pod_spec(..., resources="FOO=1")

jacobtomlinson

Neat thanks @jhamman.

It would be awesome to sync up some time and chat about what you're working on. Feel free to drop something into my calendar!

ddelange · 2022-03-23T12:03:13Z

I was surprised to see this kwarg on the docs (btw it says 2021.03 🤔) but not available on my machine (2022.1.0). Are there plans for a release this month?

I think it is needed for dask to understand our GPU setup. This would be our setup to schedule on a GPU machine managed by kOps:

dask_kubernetes.make_pod_spec(
    image="rapidsai/rapidsai:cuda11.5-runtime-ubuntu20.04-py3.8",
    # memory_limit="15Gi",  # not needed, dask-worker will auto-detect available RAM. using cpu requests to reserve a whole g4dn.xlarge
    cpu_request="3800m",  # 200m reservation by kube-system daemonsets
    resources="GPU=1",  # http://distributed.dask.org/en/stable/resources.html
    threads_per_worker=1,  # could try to increase with risk of OOM when multiple threads try to load different models into GPU
    extra_pod_config={
        "tolerations": [
            {"key": "nvidia.com/gpu", "operator": "Exists", "effect": "NoSchedule"}
        ],
        "nodeSelector": {"kops.k8s.io/gpu": "1"},  # kOps only https://github.com/kubernetes/kops/blob/v1.22.4/docs/gpu.md
    },
)

For non-kOps clusters, the nodeSelector is probably missing and you'd have to hack in nvidia.com/gpu: 1 into requests.limits:

dask_kubernetes.make_pod_spec(
    image="rapidsai/rapidsai:cuda11.5-runtime-ubuntu20.04-py3.8",
    # memory_limit="15Gi",  # not needed, dask-worker will auto-detect available RAM. using cpu requests to reserve a whole g4dn.xlarge
    # cpu_request="3800m",  # 200m reservation by kube-system daemonsets
    extra_container_config={
        "resources": {"limits": {"nvidia.com/gpu": "1"}, "requests": {"cpu": "3800m"}},
    },  # https://v1-22.docs.kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/
    resources="GPU=1",  # http://distributed.dask.org/en/stable/resources.html
    threads_per_worker=1,  # could try to increase with risk of OOM when multiple threads try to load different models into GPU
    extra_pod_config={
        "tolerations": [
            {"key": "nvidia.com/gpu", "operator": "Exists", "effect": "NoSchedule"}
        ],
    },
)

jacobtomlinson · 2022-03-23T13:01:33Z

Thanks @ddelange. Looks like the version number on the docs is broken, it's technically correct because we are 38 commits ahead of 2021.03.0. We are also a few commits ahead of 2022.1.0. We tend to release ad-hoc here whenever there is something of note to publish.

Thanks for spotting the docs issues. Please feel free to raise a PR to correct it.

ddelange · 2022-03-23T13:24:00Z

Maybe versioneer does not find the commit of the latest tag on the branch from which the docs are built, and falls back to the latest tag which is present on that branch?

ddelange · 2022-03-23T13:42:29Z

Maybe it's cool to point readthedocs by default to the latest stable release (instead of to HEAD containing unreleased features like this one)?

Under the Versions tab you can then activate all previous tags:

And if I'm not mistaken, all new tags will from then on automatically be marked as Active and be available from the docs dropdown:

jacobtomlinson · 2022-03-23T16:17:53Z

Yeah it must be a versioneer issue in RTD. We intentionally chose to point dask and distributed to stable recently but decided to leave other projects on latest as they get released less frequently.

Joseph Hamman added 2 commits February 2, 2022 11:11

add worker-resources to make_pod_spec

0df1680

fix typo

3d48d0e

jacobtomlinson approved these changes Feb 3, 2022

View reviewed changes

jacobtomlinson merged commit 431c84d into dask:main Feb 3, 2022

ddelange mentioned this pull request Mar 23, 2022

Add gpu_limit to make_pod_spec #421

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pass worker resources to pod args#398

Pass worker resources to pod args#398
jacobtomlinson merged 2 commits intodask:mainfrom
jhamman:feature/worker-resources

jhamman commented Feb 2, 2022

Uh oh!

jacobtomlinson left a comment

Uh oh!

ddelange commented Mar 23, 2022 •

edited

Loading

Uh oh!

jacobtomlinson commented Mar 23, 2022

Uh oh!

ddelange commented Mar 23, 2022

Uh oh!

ddelange commented Mar 23, 2022 •

edited

Loading

Uh oh!

jacobtomlinson commented Mar 23, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

jhamman commented Feb 2, 2022

Uh oh!

jacobtomlinson left a comment

Choose a reason for hiding this comment

Uh oh!

ddelange commented Mar 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jacobtomlinson commented Mar 23, 2022

Uh oh!

ddelange commented Mar 23, 2022

Uh oh!

ddelange commented Mar 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jacobtomlinson commented Mar 23, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ddelange commented Mar 23, 2022 •

edited

Loading

ddelange commented Mar 23, 2022 •

edited

Loading