Set worker memory limits at OS level?

In https://github.com/dask/distributed/issues/6110#issuecomment-1105837219, we found that workers were running themselves out of memory to the point where the machines became unresponsive. Because the memory limit in the Nanny is implemented [at the application level](https://github.com/dask/distributed/blob/69478732c3500c3d0c256605c6cdfaec41e65863/distributed/worker_memory.py#L322-L331), and in a periodic callback no less, there's nothing stopping workers from successfully allocating more memory than they're allowed to, as long as the Nanny doesn't catch them.

And as it turns out, if you allocate enough memory that you start heavily swapping (my guess, unconfirmed), but not so much that you get OOMkilled by the OS, it seems that you can effectively lock up the Nanny (and worker) Python processes, so the bad worker never gets caught, and everything just hangs. Memory limits are an important failsafe for stability, to un-stick this sort of situation. 

A less brittle solution than this periodic callback might be to use the OS to enforce hard limits.

The logical approach would just be `resource.setrlimit(resource.RLIMIT_RSS, memory_limit_in_bytes)`. However, it turns out that [`RLIMIT_RSS` is not supported on newer Linux kernels](https://stackoverflow.com/a/3043778/17100540). The solution nowadays appears to be cgroups.

Also relevant: https://jvns.ca/blog/2017/02/17/mystery-swap, https://unix.stackexchange.com/a/621576.

We could use `memory.memsw.limit_in_bytes` to limit total RAM+swap usage, or `memory.limit_in_bytes` to limit just RAM usage, or some smart combo of both. (Allowing a little swap might still be good for unmanaged memory.)

Obviously, this whole discussion is Linux-specific. I haven't found (or tried that hard to find) macOS and Windows approaches—I think there might be something for Windows, sounds like probably not for macOS. We can always keep the current periodic callback behavior around for them, though.

cc @fjetter

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Set worker memory limits at OS level? #6177

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Set worker memory limits at OS level? #6177

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions