scheduler: Monitor nodes for excessive failures by aaronlehmann · Pull Request #1552 · moby/swarmkit

aaronlehmann · 2016-09-20T00:03:34Z

There's a bad failure mode when a node ends up rejecting or failing all
tasks. The scheduler will continue to prefer this node, because it's the
least loaded. It ends up in a loop where it's assigning a task to this
node forever, and it always gets rejected.

To avoid this, have the scheduler track recent failures for each service
on each node. If a service has failed 5 times in the last 5 minutes, we
take that into account in the function that sorts nodes, and prefer
nodes that have fewer recent failures.

This is a rough heuristic, but keeps the scheduler simple, and should
fix the vast majority of cases where things grind to a halt due to a
faulty node.

cc @aluzzardi @dongluochen

codecov-io · 2016-09-20T00:10:07Z

Current coverage is 57.06% (diff: 77.14%)

Merging #1552 into master will increase coverage by 0.07%

@@             master      #1552   diff @@
==========================================
  Files            91         91          
  Lines         14745      14757    +12   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits           8403       8421    +18   
+ Misses         5244       5239     -5   
+ Partials       1098       1097     -1

Powered by Codecov. Last update 8be60e1...830bc62

aaronlehmann · 2016-10-06T14:34:38Z

ping - any comments on this proof of concept?

dongluochen · 2016-10-15T00:21:55Z

I'm a little worry this might be premature optimization. We don't know the reason of those failures in swarm2k test. There are obvious failure reasons that bouncing tasks doesn't help. For example, nodes cannot pull the image because image doesn't exist.

The scheduler will continue to prefer this node, because it's the least loaded.

If we can add some randomness to select from a set of good nodes, not the only best node. This may partially relieve the problem.

aaronlehmann · 2016-10-15T00:32:55Z

I'm a little worry this might be premature optimization. We don't know the reason of those failures in swarm2k test. There are obvious failure reasons that bouncing tasks doesn't help. For example, nodes cannot pull the image because image doesn't exist.

When I looked at the files associated with that test, I saw that some nodes had containers fail with weird containerd errors. It could be a symptom of a bad Docker installation or a kernel problem. Even if that's not actually what was happening in that case, it's easy to imagine a broken setup that can't successfully run containers. If you scale to enough nodes, it seems inevitable you'll end up with at least one.

If we can add some randomness to select from a set of good nodes, not the only best node. This may partially relieve the problem.

I agree that's another way to solve the problem. I discussed that with @aluzzardi a few months ago and he preferred to keep the scheduler deterministic. IIRC, he was concerned that there would be some outliers that cause problems and confuse people.

To successfully work around this issue with randomness, you have to sometimes select a node that's not the best one. For example, if only one node has 0 copies of the service, you can't select that node every time. I think this would work pretty well in a large cluster. There you care more about averages than the exact node choice. But in a cluster with two nodes, you want a replicated service with two replicas to have one copy on each node. If they sometimes randomly end up on the same node, that will be an annoying behavior which looks like a bug.

dongluochen · 2016-10-15T01:04:44Z

It could be a symptom of a bad Docker installation or a kernel problem.

I think we should surface this problem from node management point of view. If some nodes have much higher failure rate than average, they should be marked as less favorable for any task. (we might need to exclude imageNotExist or other unrelated failure reasons.) When failure rate difference reaches some threshold, we can set the node to unhealthy and report it to cluster administrator. Otherwise this bad node would be the favorite node for all services, damaging cluster performance.

aluzzardi · 2016-10-20T00:11:41Z

Design LGTM

aaronlehmann · 2016-10-20T00:13:46Z

manager/scheduler/scheduler.go

+			}
+		}
+
+		if recentFailuresA > maxFailures || recentFailuresB > maxFailures {


Let's log if the recent failures count affected the scheduling decision.

aaronlehmann · 2016-10-20T22:58:59Z

I've reorganized this a bit, added a log message when a node is treated as faulty, and added a unit test. PTAL.

dongluochen · 2016-10-21T00:52:34Z

manager/scheduler/nodeinfo.go

+
+// taskFailed records a task failure from a given service.
+func (nodeInfo *NodeInfo) taskFailed(ctx context.Context, serviceID string) {
+	if nodeInfo.recentFailures == nil {


recentFailures is initialized at constructor and never gets reset. Is this check necessary?

It was paranoia. We've had some issues in the scheduler before with nil pointer dereferences, due to things bypassing the constructor. It looks like this is not an issue anymore. Removed the check, and also the one in countRecentFailures.

dongluochen · 2016-10-21T00:55:09Z

manager/scheduler/nodeinfo.go

+	}
+
+	recentFailureCount := len(nodeInfo.recentFailures[serviceID])
+	for i := len(nodeInfo.recentFailures[serviceID]) - 1; i >= 0; i-- {


len(nodeInfo.recentFailures[serviceID]) -> recentFailureCount.

dongluochen · 2016-10-21T01:04:36Z

manager/scheduler/nodeinfo.go

+		expired++
+	}
+
+	if len(nodeInfo.recentFailures[serviceID]) == maxFailures-1 {


I think it should fix recentFailures before checking.

nodeInfo.recentFailures[serviceID] = append(nodeInfo.recentFailures[serviceID][expired:], now) if len(nodeInfo.recentFailures[serviceID]) == maxFailures { log.G(ctx).Warnf("underweighting node %s for service %s because it experienced %d failures or rejections within %s", nodeInfo.ID, serviceID, maxFailures, monitorFailures.String()) }

I think it should fix recentFailures before checking.

I did it this way so it would only log when crossing the threshold of the number of failures that downranks the node. Otherwise, it would continue logging if more tasks fail on that node.

If you have maxFailures-1 in the map from long ago, this would tigger the warning message where it shouldn't.

This was supposed to be len(nodeInfo.recentFailures[serviceID])-expired. Fixed.

There's a bad failure mode when a node ends up rejecting or failing all tasks. The scheduler will continue to prefer this node, because it's the least loaded. It ends up in a loop where it's assigning a task to this node forever, and it always gets rejected. To avoid this, have the scheduler track recent failures for each service on each node. If a service has failed 5 times in the last 5 minutes, we take that into account in the function that sorts nodes, and prefer nodes that have fewer recent failures. This is a rough heuristic, but keeps the scheduler simple, and should fix the vast majority of cases where things grind to a halt due to a faulty node. Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>

dongluochen · 2016-10-21T01:46:18Z

LGTM

aluzzardi · 2016-10-21T07:41:48Z

LGTM

GordonTheTurtle added the status/0-triage label Sep 20, 2016

aaronlehmann force-pushed the faulty-nodes branch from 4d1e0e8 to 297a0ef Compare September 20, 2016 00:05

aaronlehmann force-pushed the faulty-nodes branch from 297a0ef to 743c2b2 Compare September 20, 2016 00:34

aaronlehmann commented Oct 20, 2016

View reviewed changes

aaronlehmann force-pushed the faulty-nodes branch from 743c2b2 to 58960bb Compare October 20, 2016 22:47

aaronlehmann changed the title ~~[RFC] scheduler: Monitor nodes for excessive failures~~ scheduler: Monitor nodes for excessive failures Oct 20, 2016

aaronlehmann force-pushed the faulty-nodes branch from 58960bb to c93e870 Compare October 20, 2016 22:58

dongluochen reviewed Oct 21, 2016

View reviewed changes

aaronlehmann force-pushed the faulty-nodes branch from c93e870 to d54e9ae Compare October 21, 2016 01:15

aaronlehmann force-pushed the faulty-nodes branch from d54e9ae to 830bc62 Compare October 21, 2016 01:39

aluzzardi merged commit 7eb5c8e into moby:master Oct 21, 2016

aaronlehmann deleted the faulty-nodes branch February 13, 2017 21:25

aaronlehmann mentioned this pull request Feb 13, 2017

Docker Swarm Mode recreated containers distribution moby/moby#30981

Closed

aaronlehmann mentioned this pull request Jul 13, 2017

orchestrator/restart: Reset restart history when task spec changes #2304

Merged

Conversation

aaronlehmann commented Sep 20, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-io commented Sep 20, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Current coverage is 57.06% (diff: 77.14%)

Uh oh!

aaronlehmann commented Oct 6, 2016

Uh oh!

dongluochen commented Oct 15, 2016

Uh oh!

aaronlehmann commented Oct 15, 2016

Uh oh!

dongluochen commented Oct 15, 2016

Uh oh!

aluzzardi commented Oct 20, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aaronlehmann commented Oct 20, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongluochen commented Oct 21, 2016

Uh oh!

aluzzardi commented Oct 21, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

aaronlehmann commented Sep 20, 2016 •

edited

Loading

codecov-io commented Sep 20, 2016 •

edited

Loading