AMM: Remember replication requests that take longer than 2s

# State of the art
The current AMM design is as follows:

1. The AMM runs all registered policies
2. Each policy issues ``drop`` or ``replicate`` suggestions
3. The AMM assigns each suggestion to specific workers.
Workers with the least allocated memory are the first to be chosen as recipients for replication.
Workers with the most allocated memory are the first to be chosen as targets for dropping redundant replicas.
4. The amount of allocated memory on each worker is estimated as actual reported memory + sum of the suggestions taken so far in the current AMM run
5. After all the policies have yielded their suggestions, the AMM sends them in bulk to the workers.
6. Drop suggestions are enacted immediately scheduler-side; workers may later inform the scheduler that they rejected the suggestion.
7. After 2 seconds (configurable), the AMM runs again. Any replicate commands that are still in transit have been forgotten and will be issued anew if still applicable, potentially choosing a different target.

This design works well under the important assumption that either
- most replications take less than 2 seconds to complete, OR
- nothing *besides the AMM* transfers or generates memory that substantially tilt the balance of the cluster

On a busy cluster, where tasks run produce/destroy large amounts of memory in a non-homogeneous fashion across the cluster, then the AMM will re-issue the replicate commands that didn't complete yet from the last time and will potentially choose a different target, thus
1. exacerbating a situation where network comms already constitute a bottleneck, and
2. eventually generating multiple, unnecessary extra replicas of the data. These extra replicas will be later cleaned up by the ReduceReplicas policy, *if it's enabled*.

Notably, if the AMM is disabled (which, today, is the default), then ``retire_workers`` will spawn a temporary instance of it with only the ``RetireWorker`` policy in it; there won't be a ``ReduceReplicas`` running to clear up accidental duplicates created as described above.

# Workaround
Users could increase the AMM interval to a duration that is longer than most of their network comms. This requires some finesse and it will cause a degradation in the usefulness of the AMM.

# Proposed design
Add a second, longer configurable timer to AMM, e.g. ``replicate-timeout: 30s``.
At the end of every AMM run, record the issued ``replicate`` suggestions together with their timestamp.
At the beginning of the next AMM run,
- silently forget the suggestions that were completed successfully
- silently forget the suggestions for keys that are no longer in memory anywhere
- silently forget the suggestions where the recipient worker is no longer there or has entered paused or retiring status
- loudly forget (log a warning) the suggestions older than the timeout that are not completed yet. They still may get completed in the future, as there's no way to cancel them
- add the surviving suggestions to the memory estimate for the new run, as if they had been just issued

### Cost
The proposed solution cost is O(n) (scales linearly) to the number of tasks being replicated. The total number of tasks on the cluster is inconsequential.


CC @fjetter @gjoseph92 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AMM: Remember replication requests that take longer than 2s #5759

State of the art

Workaround

Proposed design

Cost

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

AMM: Remember replication requests that take longer than 2s #5759

Description

State of the art

Workaround

Proposed design

Cost

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions