scheduler: Monitor nodes for excessive failures#1552
Conversation
4d1e0e8 to
297a0ef
Compare
Current coverage is 57.06% (diff: 77.14%)@@ master #1552 diff @@
==========================================
Files 91 91
Lines 14745 14757 +12
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 8403 8421 +18
+ Misses 5244 5239 -5
+ Partials 1098 1097 -1
|
297a0ef to
743c2b2
Compare
|
ping - any comments on this proof of concept? |
|
I'm a little worry this might be premature optimization. We don't know the reason of those failures in swarm2k test. There are obvious failure reasons that bouncing tasks doesn't help. For example, nodes cannot pull the image because image doesn't exist.
If we can add some randomness to select from a set of good nodes, not the only best node. This may partially relieve the problem. |
When I looked at the files associated with that test, I saw that some nodes had containers fail with weird containerd errors. It could be a symptom of a bad Docker installation or a kernel problem. Even if that's not actually what was happening in that case, it's easy to imagine a broken setup that can't successfully run containers. If you scale to enough nodes, it seems inevitable you'll end up with at least one.
I agree that's another way to solve the problem. I discussed that with @aluzzardi a few months ago and he preferred to keep the scheduler deterministic. IIRC, he was concerned that there would be some outliers that cause problems and confuse people. To successfully work around this issue with randomness, you have to sometimes select a node that's not the best one. For example, if only one node has 0 copies of the service, you can't select that node every time. I think this would work pretty well in a large cluster. There you care more about averages than the exact node choice. But in a cluster with two nodes, you want a replicated service with two replicas to have one copy on each node. If they sometimes randomly end up on the same node, that will be an annoying behavior which looks like a bug. |
I think we should surface this problem from node management point of view. If some nodes have much higher failure rate than average, they should be marked as less favorable for any task. (we might need to exclude imageNotExist or other unrelated failure reasons.) When failure rate difference reaches some threshold, we can set the node to unhealthy and report it to cluster administrator. Otherwise this bad node would be the favorite node for all services, damaging cluster performance. |
|
Design LGTM |
manager/scheduler/scheduler.go
Outdated
| } | ||
| } | ||
|
|
||
| if recentFailuresA > maxFailures || recentFailuresB > maxFailures { |
There was a problem hiding this comment.
Let's log if the recent failures count affected the scheduling decision.
743c2b2 to
58960bb
Compare
58960bb to
c93e870
Compare
|
I've reorganized this a bit, added a log message when a node is treated as faulty, and added a unit test. PTAL. |
manager/scheduler/nodeinfo.go
Outdated
|
|
||
| // taskFailed records a task failure from a given service. | ||
| func (nodeInfo *NodeInfo) taskFailed(ctx context.Context, serviceID string) { | ||
| if nodeInfo.recentFailures == nil { |
There was a problem hiding this comment.
recentFailures is initialized at constructor and never gets reset. Is this check necessary?
There was a problem hiding this comment.
It was paranoia. We've had some issues in the scheduler before with nil pointer dereferences, due to things bypassing the constructor. It looks like this is not an issue anymore. Removed the check, and also the one in countRecentFailures.
manager/scheduler/nodeinfo.go
Outdated
| } | ||
|
|
||
| recentFailureCount := len(nodeInfo.recentFailures[serviceID]) | ||
| for i := len(nodeInfo.recentFailures[serviceID]) - 1; i >= 0; i-- { |
There was a problem hiding this comment.
len(nodeInfo.recentFailures[serviceID]) -> recentFailureCount.
manager/scheduler/nodeinfo.go
Outdated
| expired++ | ||
| } | ||
|
|
||
| if len(nodeInfo.recentFailures[serviceID]) == maxFailures-1 { |
There was a problem hiding this comment.
I think it should fix recentFailures before checking.
nodeInfo.recentFailures[serviceID] = append(nodeInfo.recentFailures[serviceID][expired:], now)
if len(nodeInfo.recentFailures[serviceID]) == maxFailures {
log.G(ctx).Warnf("underweighting node %s for service %s because it experienced %d failures or rejections within %s", nodeInfo.ID, serviceID, maxFailures, monitorFailures.String())
}
There was a problem hiding this comment.
I think it should fix recentFailures before checking.
I did it this way so it would only log when crossing the threshold of the number of failures that downranks the node. Otherwise, it would continue logging if more tasks fail on that node.
There was a problem hiding this comment.
If you have maxFailures-1 in the map from long ago, this would tigger the warning message where it shouldn't.
There was a problem hiding this comment.
This was supposed to be len(nodeInfo.recentFailures[serviceID])-expired. Fixed.
c93e870 to
d54e9ae
Compare
There's a bad failure mode when a node ends up rejecting or failing all tasks. The scheduler will continue to prefer this node, because it's the least loaded. It ends up in a loop where it's assigning a task to this node forever, and it always gets rejected. To avoid this, have the scheduler track recent failures for each service on each node. If a service has failed 5 times in the last 5 minutes, we take that into account in the function that sorts nodes, and prefer nodes that have fewer recent failures. This is a rough heuristic, but keeps the scheduler simple, and should fix the vast majority of cases where things grind to a halt due to a faulty node. Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
d54e9ae to
830bc62
Compare
|
LGTM |
1 similar comment
|
LGTM |
There's a bad failure mode when a node ends up rejecting or failing all
tasks. The scheduler will continue to prefer this node, because it's the
least loaded. It ends up in a loop where it's assigning a task to this
node forever, and it always gets rejected.
To avoid this, have the scheduler track recent failures for each service
on each node. If a service has failed 5 times in the last 5 minutes, we
take that into account in the function that sorts nodes, and prefer
nodes that have fewer recent failures.
This is a rough heuristic, but keeps the scheduler simple, and should
fix the vast majority of cases where things grind to a halt due to a
faulty node.
cc @aluzzardi @dongluochen