Skip to content

scheduler: Monitor nodes for excessive failures#1552

Merged
aluzzardi merged 1 commit intomoby:masterfrom
aaronlehmann:faulty-nodes
Oct 21, 2016
Merged

scheduler: Monitor nodes for excessive failures#1552
aluzzardi merged 1 commit intomoby:masterfrom
aaronlehmann:faulty-nodes

Conversation

@aaronlehmann
Copy link
Copy Markdown
Collaborator

@aaronlehmann aaronlehmann commented Sep 20, 2016

There's a bad failure mode when a node ends up rejecting or failing all
tasks. The scheduler will continue to prefer this node, because it's the
least loaded. It ends up in a loop where it's assigning a task to this
node forever, and it always gets rejected.

To avoid this, have the scheduler track recent failures for each service
on each node. If a service has failed 5 times in the last 5 minutes, we
take that into account in the function that sorts nodes, and prefer
nodes that have fewer recent failures.

This is a rough heuristic, but keeps the scheduler simple, and should
fix the vast majority of cases where things grind to a halt due to a
faulty node.

cc @aluzzardi @dongluochen

@codecov-io
Copy link
Copy Markdown

codecov-io commented Sep 20, 2016

Current coverage is 57.06% (diff: 77.14%)

Merging #1552 into master will increase coverage by 0.07%

@@             master      #1552   diff @@
==========================================
  Files            91         91          
  Lines         14745      14757    +12   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits           8403       8421    +18   
+ Misses         5244       5239     -5   
+ Partials       1098       1097     -1   

Sunburst

Powered by Codecov. Last update 8be60e1...830bc62

@aaronlehmann
Copy link
Copy Markdown
Collaborator Author

ping - any comments on this proof of concept?

@dongluochen
Copy link
Copy Markdown
Contributor

I'm a little worry this might be premature optimization. We don't know the reason of those failures in swarm2k test. There are obvious failure reasons that bouncing tasks doesn't help. For example, nodes cannot pull the image because image doesn't exist.

The scheduler will continue to prefer this node, because it's the least loaded.

If we can add some randomness to select from a set of good nodes, not the only best node. This may partially relieve the problem.

@aaronlehmann
Copy link
Copy Markdown
Collaborator Author

I'm a little worry this might be premature optimization. We don't know the reason of those failures in swarm2k test. There are obvious failure reasons that bouncing tasks doesn't help. For example, nodes cannot pull the image because image doesn't exist.

When I looked at the files associated with that test, I saw that some nodes had containers fail with weird containerd errors. It could be a symptom of a bad Docker installation or a kernel problem. Even if that's not actually what was happening in that case, it's easy to imagine a broken setup that can't successfully run containers. If you scale to enough nodes, it seems inevitable you'll end up with at least one.

If we can add some randomness to select from a set of good nodes, not the only best node. This may partially relieve the problem.

I agree that's another way to solve the problem. I discussed that with @aluzzardi a few months ago and he preferred to keep the scheduler deterministic. IIRC, he was concerned that there would be some outliers that cause problems and confuse people.

To successfully work around this issue with randomness, you have to sometimes select a node that's not the best one. For example, if only one node has 0 copies of the service, you can't select that node every time. I think this would work pretty well in a large cluster. There you care more about averages than the exact node choice. But in a cluster with two nodes, you want a replicated service with two replicas to have one copy on each node. If they sometimes randomly end up on the same node, that will be an annoying behavior which looks like a bug.

@dongluochen
Copy link
Copy Markdown
Contributor

It could be a symptom of a bad Docker installation or a kernel problem.

I think we should surface this problem from node management point of view. If some nodes have much higher failure rate than average, they should be marked as less favorable for any task. (we might need to exclude imageNotExist or other unrelated failure reasons.) When failure rate difference reaches some threshold, we can set the node to unhealthy and report it to cluster administrator. Otherwise this bad node would be the favorite node for all services, damaging cluster performance.

@aluzzardi
Copy link
Copy Markdown
Member

Design LGTM

}
}

if recentFailuresA > maxFailures || recentFailuresB > maxFailures {
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's log if the recent failures count affected the scheduling decision.

@aaronlehmann aaronlehmann changed the title [RFC] scheduler: Monitor nodes for excessive failures scheduler: Monitor nodes for excessive failures Oct 20, 2016
@aaronlehmann
Copy link
Copy Markdown
Collaborator Author

I've reorganized this a bit, added a log message when a node is treated as faulty, and added a unit test. PTAL.


// taskFailed records a task failure from a given service.
func (nodeInfo *NodeInfo) taskFailed(ctx context.Context, serviceID string) {
if nodeInfo.recentFailures == nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recentFailures is initialized at constructor and never gets reset. Is this check necessary?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was paranoia. We've had some issues in the scheduler before with nil pointer dereferences, due to things bypassing the constructor. It looks like this is not an issue anymore. Removed the check, and also the one in countRecentFailures.

}

recentFailureCount := len(nodeInfo.recentFailures[serviceID])
for i := len(nodeInfo.recentFailures[serviceID]) - 1; i >= 0; i-- {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

len(nodeInfo.recentFailures[serviceID]) -> recentFailureCount.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

expired++
}

if len(nodeInfo.recentFailures[serviceID]) == maxFailures-1 {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should fix recentFailures before checking.

    nodeInfo.recentFailures[serviceID] = append(nodeInfo.recentFailures[serviceID][expired:], now)

    if len(nodeInfo.recentFailures[serviceID]) == maxFailures {
        log.G(ctx).Warnf("underweighting node %s for service %s because it experienced %d failures or rejections within %s", nodeInfo.ID, serviceID, maxFailures, monitorFailures.String())
    }

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should fix recentFailures before checking.

I did it this way so it would only log when crossing the threshold of the number of failures that downranks the node. Otherwise, it would continue logging if more tasks fail on that node.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have maxFailures-1 in the map from long ago, this would tigger the warning message where it shouldn't.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was supposed to be len(nodeInfo.recentFailures[serviceID])-expired. Fixed.

There's a bad failure mode when a node ends up rejecting or failing all
tasks. The scheduler will continue to prefer this node, because it's the
least loaded. It ends up in a loop where it's assigning a task to this
node forever, and it always gets rejected.

To avoid this, have the scheduler track recent failures for each service
on each node. If a service has failed 5 times in the last 5 minutes, we
take that into account in the function that sorts nodes, and prefer
nodes that have fewer recent failures.

This is a rough heuristic, but keeps the scheduler simple, and should
fix the vast majority of cases where things grind to a halt due to a
faulty node.

Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
@dongluochen
Copy link
Copy Markdown
Contributor

LGTM

1 similar comment
@aluzzardi
Copy link
Copy Markdown
Member

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants