scheduler: Version-aware failure tracking by aaronlehmann · Pull Request #2253 · moby/swarmkit

aaronlehmann · 2017-06-14T00:18:13Z

This addresses a longstanding TODO which I forgot about. Currently, the scheduler tracks potential failure loops by service ID. However, if the service is updated, it won't distinguish the new version of the service from the old one. This means even if the operator fixes the underlying problem, the new tasks from the updated service will still avoid the node where the problem was detected.

Change the failure tracking to be sensitive to SpecVersion as well. If SpecVersion changes, we'll treat the task as a different kind of task for failure tracking purposes.

Also, add a periodic cleanup pass to make sure recentFailures doesn't grow without bound, and clean up addOrUpdateNode so it doesn't try to fix up uninitialized nodeInfo structs (which is fragile).

cc @aluzzardi

This addresses a longstanding TODO which I forgot about. Currently, the scheduler tracks potential failure loops by service ID. However, if the service is updated, it won't distinguish the new version of the service from the old one. This means even if the operator fixes the underlying problem, the new tasks from the updated service will still avoid the node where the problem was detected. Change the failure tracking to be sensitive to SpecVersion as well. If SpecVersion changes, we'll treat the task as a different kind of task for failure tracking purposes. Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>

This map could technically grow without bound. When adding a new failure record, if the cleanup hasn't happened recently, loop over all the recentFailures entries and remove any that only consist of expired records. Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>

This function previously could take an uninitialized NodeInfo structure and fill in whatever was missing. This is very error-prone, so remove this logic and change the only caller that relies on it to always pass in a properly initialized NodeInfo. Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>

codecov · 2017-06-14T00:28:59Z

Codecov Report

Merging #2253 into master will decrease coverage by 6.25%.
The diff coverage is 57.14%.

@@            Coverage Diff             @@
##           master    #2253      +/-   ##
==========================================
- Coverage   66.63%   60.38%   -6.26%     
==========================================
  Files          92      124      +32     
  Lines       13480    20278    +6798     
==========================================
+ Hits         8983    12245    +3262     
- Misses       3550     6648    +3098     
- Partials      947     1385     +438

cyli

LGTM with the caveat I'm not super familiar with the scheduler code.

Nonblocking - I'm not sure how feasible these are, especially given that we probably don't inject a clock into the scheduler, but would it be possible to add tests that check that:

When a node starts up, failed tasks aren't immediately cleaned up
Tasks are not cleared before the monitoring period is up?

aaronlehmann · 2017-06-15T21:58:31Z

When a node starts up, failed tasks aren't immediately cleaned up

I don't understand this one. Does this mean a node becoming READY? The scheduler itself starting up?

Tasks are not cleared before the monitoring period is up?

As you mentioned, we don't inject a clock. I could augment the existing tests to check that recentFailures[x] is monotonic during the test run. Not sure how useful this is. I'd be curious to hear what tests you would write for something like this.

cyli · 2017-06-15T22:24:41Z

I don't understand this one. Does this mean a node becoming READY? The scheduler itself starting up?

Er sorry, again, not super familiar with the scheduler life cycle. Whenever nodeInfo is first created for a node, I guess.

Tasks are not cleared before the monitoring period is up?

Oops, I realized I meant recentFailures[x] is not cleaned up before the monitoring period is up. This one is only really feasible with an injected clock I guess, otherwise will only probabilistically fail, but similar to your current test of simulating task failures - simulate that they failed at particular times, and if even if the number is more than the requisite to be marked unhealthy (I guess 5?), if they're spread out across/past the monitoring boundary the node is not considered unhealthy, and the oldest failure is only cleaned up past the monitoring boundary.

But none of these may be possible without the clock injection, hence non-blocking. :)

Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>

aaronlehmann · 2017-06-15T23:34:41Z

Added a commit that makes sure the values of recentFailures progress as expected, and aren't cleaned up during the tests (which are expected to take less than 5 minutes).

cyli · 2017-06-16T18:41:07Z

LGTM, thanks for adding the assertions

- moby/swarmkit#2266 (support for templating Node.Hostname in docker executor) - moby/swarmkit#2281 (change restore action on objects to be update, not delete/create) - moby/swarmkit#2285 (extend watch queue with timeout and size limit) - moby/swarmkit#2253 (version-aware failure tracking in the scheduler) - moby/swarmkit#2275 (update containerd and port executor to container client library) - moby/swarmkit#2292 (rename some generic resources) - moby/swarmkit#2300 (limit the size of the external CA response) - moby/swarmkit#2301 (delete global tasks when the node running them is deleted) Minor cleanups, dependency bumps, and vendoring: - moby/swarmkit#2271 - moby/swarmkit#2279 - moby/swarmkit#2283 - moby/swarmkit#2282 - moby/swarmkit#2274 - moby/swarmkit#2296 (dependency bump of etcd, go-winio) Signed-off-by: Ying Li <ying.li@docker.com> Upstream-commit: 4509a00 Component: engine

Aaron Lehmann added 3 commits June 13, 2017 17:17

cyli approved these changes Jun 15, 2017

View reviewed changes

scheduler: Check monotonicity of recentFailures during tests

cf4dc91

Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>

aaronlehmann mentioned this pull request Jun 19, 2017

scheduler: Clean up addOrUpdateNode #2268

Merged

aaronlehmann merged commit 99d5d7b into moby:master Jun 29, 2017

aaronlehmann deleted the scheduler-recent-failure-versioned branch June 29, 2017 21:21

cyli mentioned this pull request Jul 11, 2017

Re-vendor swarmkit. moby/moby#34061

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scheduler: Version-aware failure tracking#2253

scheduler: Version-aware failure tracking#2253
aaronlehmann merged 4 commits intomoby:masterfrom
aaronlehmann:scheduler-recent-failure-versioned

aaronlehmann commented Jun 14, 2017

Uh oh!

codecov bot commented Jun 14, 2017 •

edited

Loading

Uh oh!

cyli left a comment

Uh oh!

aaronlehmann commented Jun 15, 2017

Uh oh!

cyli commented Jun 15, 2017 •

edited

Loading

Uh oh!

aaronlehmann commented Jun 15, 2017

Uh oh!

cyli commented Jun 16, 2017 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aaronlehmann commented Jun 14, 2017

Uh oh!

codecov bot commented Jun 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

cyli left a comment

Choose a reason for hiding this comment

Uh oh!

aaronlehmann commented Jun 15, 2017

Uh oh!

cyli commented Jun 15, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aaronlehmann commented Jun 15, 2017

Uh oh!

cyli commented Jun 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Jun 14, 2017 •

edited

Loading

cyli commented Jun 15, 2017 •

edited

Loading

cyli commented Jun 16, 2017 •

edited

Loading