Skip to content

AMM: Remember replication requests that take longer than 2s #5759

@crusaderky

Description

@crusaderky

State of the art

The current AMM design is as follows:

  1. The AMM runs all registered policies
  2. Each policy issues drop or replicate suggestions
  3. The AMM assigns each suggestion to specific workers.
    Workers with the least allocated memory are the first to be chosen as recipients for replication.
    Workers with the most allocated memory are the first to be chosen as targets for dropping redundant replicas.
  4. The amount of allocated memory on each worker is estimated as actual reported memory + sum of the suggestions taken so far in the current AMM run
  5. After all the policies have yielded their suggestions, the AMM sends them in bulk to the workers.
  6. Drop suggestions are enacted immediately scheduler-side; workers may later inform the scheduler that they rejected the suggestion.
  7. After 2 seconds (configurable), the AMM runs again. Any replicate commands that are still in transit have been forgotten and will be issued anew if still applicable, potentially choosing a different target.

This design works well under the important assumption that either

  • most replications take less than 2 seconds to complete, OR
  • nothing besides the AMM transfers or generates memory that substantially tilt the balance of the cluster

On a busy cluster, where tasks run produce/destroy large amounts of memory in a non-homogeneous fashion across the cluster, then the AMM will re-issue the replicate commands that didn't complete yet from the last time and will potentially choose a different target, thus

  1. exacerbating a situation where network comms already constitute a bottleneck, and
  2. eventually generating multiple, unnecessary extra replicas of the data. These extra replicas will be later cleaned up by the ReduceReplicas policy, if it's enabled.

Notably, if the AMM is disabled (which, today, is the default), then retire_workers will spawn a temporary instance of it with only the RetireWorker policy in it; there won't be a ReduceReplicas running to clear up accidental duplicates created as described above.

Workaround

Users could increase the AMM interval to a duration that is longer than most of their network comms. This requires some finesse and it will cause a degradation in the usefulness of the AMM.

Proposed design

Add a second, longer configurable timer to AMM, e.g. replicate-timeout: 30s.
At the end of every AMM run, record the issued replicate suggestions together with their timestamp.
At the beginning of the next AMM run,

  • silently forget the suggestions that were completed successfully
  • silently forget the suggestions for keys that are no longer in memory anywhere
  • silently forget the suggestions where the recipient worker is no longer there or has entered paused or retiring status
  • loudly forget (log a warning) the suggestions older than the timeout that are not completed yet. They still may get completed in the future, as there's no way to cancel them
  • add the surviving suggestions to the memory estimate for the new run, as if they had been just issued

Cost

The proposed solution cost is O(n) (scales linearly) to the number of tasks being replicated. The total number of tasks on the cluster is inconsequential.

CC @fjetter @gjoseph92

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions