Skip to content

Fix potential ip overlap with node LB (method 2)#41062

Closed
xinfengliu wants to merge 3 commits intomoby:masterfrom
xinfengliu:fix-ip-overlap-method2
Closed

Fix potential ip overlap with node LB (method 2)#41062
xinfengliu wants to merge 3 commits intomoby:masterfrom
xinfengliu:fix-ip-overlap-method2

Conversation

@xinfengliu
Copy link
Contributor

@xinfengliu xinfengliu commented Jun 3, 2020

Fixes #40989

Signed-off-by: Xinfeng Liu xinfeng.liu@gmail.com

- What I did
Call network removal for overlay network on container exit.

- How I did it

  • Add network removal code to tryDetachContainerFromClusterNetwork() in
    daemon/container_operations.go to make sure network removal on overlay
    nework is called when container exits. Also for swarmkit controller, we don't have to worry network removal in task shutdown.

  • Add network removal for task container startup failure.
    Update: when task container fails to startup, the CleanUp() in daemon/start.go will also be called, so the network removal will be called too.

  • Add task shutdown codes in a few other places where task container should not continue to run.

  • Add removeNetworks() in Prepare() of daemon/cluster/executor/container/controller.go.

  • Bring over PR 41011 to make an integrated fix.

- How to verify it
Added test cases in CI.

  • test NormalService
  • test ServiceStartFail
  • test ServiceCompletion
  • test ServiceExitFail
  • test ServiceHealthCheckFail
  • test ServiceHealthCheckExitEarly
  • test ServiceInvalidMount

- Description for the changelog

Fix potential IP overlapping with node LB when container exits itself or container creation/startup fails.

- A picture of a cute animal (not mandatory but encouraged)

add test cases for verifying node LB removal under some conditions.
Also add:
- `ServiceWithRestartAttempts` and `ServiceWithHealthCheck` for service creation in tests.
- Waiting conditions for `CompletedTasksCount` and `FailedTasksCount`

Signed-off-by: Xinfeng Liu <xinfeng.liu@gmail.com>
Add network removal code to `tryDetachContainerFromClusterNetwork()` in
`daemon/container_operations.go` to make sure network removal on overlay
nework is called when container exits.

Also for swarmkit controller, we don't have to worry network removal in
task shutdown.

Add task shutdown codes in a few other places where task container should not
continue to run.

Signed-off-by: Xinfeng Liu <xinfeng.liu@gmail.com>
@xinfengliu xinfengliu force-pushed the fix-ip-overlap-method2 branch from 51c4cb3 to be1b80b Compare June 4, 2020 03:13
@thaJeztah
Copy link
Member

@dperny @arkodg PTAL

@xinfengliu
Copy link
Contributor Author

xinfengliu commented Jun 5, 2020

To be more specific about the ideas of this PR:
I think the best place for network removal is in releaseNetwork() of CleanUp() in daemon/start.go. This is clean and safe. This way can avoid race conditions with swarmkit controllers inside docker engine and avoid the messy codes in swarmkit controllers for handling network removal .

Further optimize fix potential IP overlap with node LB (method 2).
- Refactor test codes
- Add testServiceInvalidMount
- No need to run removeNetworks() when task container startup failure.
- Add removeNetworks() in Prepare() of daemon/cluster/executor/container/controller.go.
- Bring over moby#41011 to make a
integrated fix.

Signed-off-by: Xinfeng Liu <xinfeng.liu@gmail.com>
@xinfengliu xinfengliu force-pushed the fix-ip-overlap-method2 branch from 94183d5 to 7100de3 Compare June 6, 2020 07:42
@thaJeztah thaJeztah added this to the 20.03.0 milestone Jun 18, 2020
@mightydok
Copy link

Will be this patch backported to 19.03?

@AkihiroSuda
Copy link
Member

@dperny PTAL?

@mightydok
Copy link

@xinfengliu @thaJeztah any updates?
We need this patch to 19.03 version.

@arkodg
Copy link
Contributor

arkodg commented Aug 5, 2020

@xinfengliu making Remove in the Cluster Executor a no-op seems like a bad idea and breaks away from the async nature of the requests performed today.
The async nature itself the reason we are seeing this issue today, but we need to find a different way to synchronize a task exit with clean up of the LB associated with it

@thaJeztah
Copy link
Member

Ah, sorry, wanted to comment as well; we went over this (and the other PR) in a maintainers meeting two weeks ago, and there definitely were concerns with the change (at least it was considered "risky" for a patch release). We discussed that, yes, there is something to solve, but unsure what the best approach was.

To add to that; it's a complicated flow of events (combination of "non-managed" (regular) containers, "managed" / swarmkit containers, which made it difficult to review and with certainty determine the impact (and possible risks that were overlooked).

Perhaps we need a separate call/meeting to go through the changes (and issues). Something like a (simple) "flow chart" of events that happen and impact the problem could help (at least from my side to "visualise" what's happening). Ideally, "domain experts" should be in that call (@dperny @xinfengliu @arkodg)

@mightydok
Copy link

@thaJeztah @xinfengliu any updates? Without this fix we have 2-3 failed deploys caused by overlaped IP addresses.

@xinfengliu
Copy link
Contributor Author

Hi @mightydok ,

As @thaJeztah mentioned at earlier time, the project maintainers had some concerns about this PR.

In fact, I'm not the original author of those docker components I made the modifications to, so I'm not sure if this PR is 100% safe.

I guess @thaJeztah is looking for the best and the safest approach to address this issue. IMHO, It's best that the original authors of related components can help on this.

@mightydok
Copy link

@xinfengliu thnx for info

@thaJeztah what the next steps for this issue? How we can help with it?

@mightydok
Copy link

@dperny @arkodg fyi

@thaJeztah thaJeztah modified the milestones: 20.10.0, 20.10.1, 20.10.2 Dec 9, 2020
@thaJeztah thaJeztah modified the milestones: 20.10.2, 20.10.3 Jan 5, 2021
@thaJeztah thaJeztah removed this from the 20.10.3 milestone Feb 2, 2021
@thaJeztah thaJeztah modified the milestones: 20.10.4, 21.xx Feb 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Node LB not deleted after container exits itself (cause IP overlapping)

5 participants