[Proposal] Resolving the IP address contention issue

## What is the Issue?
When tasks are removed from SwarmKit, they're immediately removed from the object store, along with all the resources allocated to them (this includes IP addresses). But, actual task containers can take much longer to shut down, as they finish their shutdown procedures. This means that IP addresses might be held for longer. If new tasks get created in the meantime, they may be allocated the same IP address, leading to a conflict, or just a wonky state until old tasks finish their shutdown.

## Existing Mitigations
There were two fixes that were recently added to alleviate this problem
1. Putting used IP addresses to the end of the queue, so that it's less likely that an already used IP address is allocated to a new task. For deployments with rapid creation of new tasks, or small subnets, this doesn't do much.
2. On the overlay network, if the same IP is used for some reason, then there is better handling so that the configuration in the kernel is consistent across nodes. This helps when the subnet is almost exhausted, but hides the fact that such a thing is happening. This might make it harder to debug the network if issues do arise.

While these mitigations are reasonable, they don't fix the root cause, and don't guarantee that the issue will not come up.

## Long Term Fix
The long term fix involves not freeing up resources immediately when the task gets removed. This is a rough draft of the fix, which might evolve as it becomes clearer:

When a task is removed,
1. Update desired state for the task to `REMOVED` (this is a new task status), instead of removing it from the object store
2. Dispatcher tells the agent that desired state has been updated to removed, and the agent is responsible for going through task shutdown
3. The task reaper only removes the task when desired state is `REMOVED` and actual state > `RUNNING` (the task reaper is the only place where a task should be removed)

We may or may not revert the mitigations once the long term fix is in, but we should test without the mitigations. Additionally, we will need to decide how the API deals with `REMOVED` tasks (whether they are returned by default or not). But we can get to these questions later.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proposal] Resolving the IP address contention issue #2407

What is the Issue?

Existing Mitigations

Long Term Fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Proposal] Resolving the IP address contention issue #2407

Description

What is the Issue?

Existing Mitigations

Long Term Fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions