Race condition in remove-replicas could make you lose the last replica of a key

CC @fjetter @mrocklin @jrbourbeau 

This issue is downstream of
1. https://github.com/dask/distributed/pull/5046
2. https://github.com/dask/distributed/pull/5111
3. https://github.com/crusaderky/distributed/commit/cea3c4a44cff44a4dbb3d940f2ff0d6680092bf5, which merges the two

The sum of all three is currently available at https://github.com/crusaderky/distributed/tree/amm_wsmr

---

There is a race condition in the new ``remove-replicas`` action, highlighted by https://github.com/crusaderky/distributed/blob/5a8f4ffa3a5fcfde14d57b72a30dfeada1245332/distributed/tests/test_active_memory_manager.py#L128-L153:

1. An AMM drop policy runs once to drop one of the two replicas of a key
2. Then the same or another policy runs again, _in a different AMM run_, before the recommendations from the first iteration had the time to either be enacted or rejected. On the second run, it asks to drop the same key again, but for whatever reason the _other_ worker gets selected to drop its replica.

In this use case, AMM will accidentally destroy the last replica of the key.

This issue is because of how the current WSMR PR implements ``remove-replicas``:
1. the scheduler asks the worker to drop the key
2. If the worker accepts the suggestion, it sends back a message to the scheduler, which removes the worker from who_has on the scheduler side.

The algorithm should be changed as follows:
1. the scheduler asks the worker to drop the key and immediately removes the worker locally from who_has.
2. If the worker rejects the suggestion, it sends back a message to the scheduler which re-adds the worker into who_has on the scheduler side.

In real life, this flaw will only manifest itself when the workers and/or network are so laggy that the full round-trip of the remove-replicas action will take longer than the interval at which the AMM is run (5s by default).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Race condition in remove-replicas could make you lose the last replica of a key #5265

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Race condition in remove-replicas could make you lose the last replica of a key #5265

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions