Skip to content

Make AMM memory measure configurable #6577

@crusaderky

Description

@crusaderky

The Active Memory Manager uses the optimistic memory (managed + unmanaged old) as a hardcoded measure to base all of its decisions upon.
This is generally a good choice in a production environment.
There are however two notable exceptions:

  1. When the process memory does not deflate on its own. This issue is probably fixable with distributed.nanny.environ.MALLOC_TRIM_THRESHOLD_ is ineffective #5971 on Linux, and (to my knowledge) unfixable on MacOSX. This can cause the AMM to take poor decisions, e.g. move all data away from a worker because it sees huge amounts of managed memory - except that that memory is actually reusable.
  2. In unit tests. Most of the AMM tests currently run on nannies and require large amounts of data and lax constraints to be stable. The AMM stress tests are currently disabled on CI, not because of AMM's fault (the same tests fail also with AMM disabled) but instead because, in order to let AMM take correct decisions, they have to spawn 10 Nannies, which are too much for the measly github CI hosts to handle. Those stress tests would be extremely valuable to run in CI, as they've detected state machine corruption and other deadlocks in the past many times already. See Remove @avoid_ci from stress tests #6271.

Design

Add a new setting to distributed.yaml, {distributed.scheduler.active-memory-manager.measure: optimistic}. This mirrors {distributed.worker.memory.rebalance.measure: optimistic}. Note that rebalance() has been penned in to be rewritten: #4906.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions