-
-
Notifications
You must be signed in to change notification settings - Fork 757
Description
CC @fjetter @mrocklin @jrbourbeau
This issue is downstream of
- Worker state machine refactor #5046
- Active Memory Manager framework + discard excess replicas #5111
- crusaderky@cea3c4a, which merges the two
The sum of all three is currently available at https://github.com/crusaderky/distributed/tree/amm_wsmr
There is a race condition in the new remove-replicas action, highlighted by https://github.com/crusaderky/distributed/blob/5a8f4ffa3a5fcfde14d57b72a30dfeada1245332/distributed/tests/test_active_memory_manager.py#L128-L153:
- An AMM drop policy runs once to drop one of the two replicas of a key
- Then the same or another policy runs again, in a different AMM run, before the recommendations from the first iteration had the time to either be enacted or rejected. On the second run, it asks to drop the same key again, but for whatever reason the other worker gets selected to drop its replica.
In this use case, AMM will accidentally destroy the last replica of the key.
This issue is because of how the current WSMR PR implements remove-replicas:
- the scheduler asks the worker to drop the key
- If the worker accepts the suggestion, it sends back a message to the scheduler, which removes the worker from who_has on the scheduler side.
The algorithm should be changed as follows:
- the scheduler asks the worker to drop the key and immediately removes the worker locally from who_has.
- If the worker rejects the suggestion, it sends back a message to the scheduler which re-adds the worker into who_has on the scheduler side.
In real life, this flaw will only manifest itself when the workers and/or network are so laggy that the full round-trip of the remove-replicas action will take longer than the interval at which the AMM is run (5s by default).