design: add topology-aware scheduling proposal by aaronlehmann · Pull Request #1473 · moby/swarmkit

aaronlehmann · 2016-08-30T18:35:29Z

codecov-io · 2016-08-30T18:41:22Z

Current coverage is 54.76% (diff: 100%)

Merging #1473 into master will decrease coverage by 0.05%

@@             master      #1473   diff @@
==========================================
  Files           107        107          
  Lines         17607      17607          
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
- Hits           9651       9642     -9   
- Misses         6779       6791    +12   
+ Partials       1177       1174     -3

Powered by Codecov. Last update f6a679d...e5701b8

dongluochen · 2016-08-30T20:40:29Z

design/topology_proposal.md

+
+## Behavior
+
+A simple use of this feature would be to spread tasks evenly between two


two -> multiple

dongluochen · 2016-08-30T21:06:23Z

Design LGTM.

stevvooe · 2016-08-30T21:38:35Z

design/topology_proposal.md

+A slightly more complicated use case involves hierarchical topology. Say there
+are two datacenters, which each have four rows, each row having 20 racks. To
+spread tasks evenly at each of these levels, there could be three `SpreadOver`
+messages in `Preferences`. The first would spread over datacenters, the second


What does the math look like here? Would it minimize items per bucket then fallback to a compounding factor? When would this carryover be triggered?

Internally, the scheduler will be sorting the nodes to rank them. Basically, these rules define a ranking function for the nodes.

Using the example above, the sort function first takes into account the datacenter label. If datacenter A has 19 tasks from this service and datacenter B has 20 tasks from this service, all nodes with the datacenter=A label will rank above the datacenter B nodes at that moment. If any rows in datacenter A are more lightly populated than other rows, the nodes in those rows will be the highest ranked. And If any racks in those rows are more lightly populated than the other racks in those rows, their nodes will be ranked at the very top.

Suppose we start in this situation and want to scale up by two tasks. The first task will go to a node in datacenter A, unless that's not possible because of constraints. After the first task is scheduled, our instantaneous rankings change. Now both datacenters have the same number of tasks from the service, so the rankings instead depend what's going on at the row level. If there is one row in either datacenter that has fewer tasks than the other, the nodes in that row will be ranked highest. And so on.

aaronlehmann · 2016-09-01T03:38:04Z

I discussed this with @stevvooe and concluded that there are a number of things that should change in this proposal.

Making the notion of precedence absolute is too rigid. If one datacenter experiences a problem that causes all of its tasks to fail immediately, tasks will keep getting assigned to it indefinitely, and they will never get scheduled to the datacenter that's still working properly.
There should be a notion of weighting to give relative strength to the different preferences (i.e. "I want to balance tasks over racks, but it's 20x more important to me to keep tasks balanced between datacenters"). This is distinct from weights that we may eventually add on individual label values ("prefer datacenter A over datacenter B by factor X").
It is difficult to describe the current proposal mathematically, and to implement it efficiently.

I think all three problems can be addressed well by moving to a weighted random selection algorithm. Based on the initial state, the user's preferences, and the number of tasks being scheduled, we could compute relative weights for the nodes that would guide the final state towards the distribution we want. Then we would assign each task with a Monte Carlo approach. I'm going to think more about an approach along these lines and change the proposal if it works out.

aluzzardi · 2016-09-01T06:38:48Z

To be honest, not a fan.

Last time we implemented random relative weight distribution it was for the agent weighted peers. Then we hit problems. Can't remember the exact issue, it was the agent taking forever to reconnect on manager failure. It took a lot of effort from @LK4D4 and @aaronlehmann (IIRC) to debug and fix. Long story short, it took longer to implement, was harder to debug, less predictable and more difficult to tweak to real life scenarios.

In hindsight, a simple round robin would have been the better choice.

Making the notion of precedence absolute is too rigid. If one datacenter experiences a problem that causes all of its tasks to fail immediately, tasks will keep getting assigned to it indefinitely, and they will never get scheduled to the datacenter that's still working properly.

I think we're mixing two different problems. This feature is not about federation and multi datacenter, it's just about having a user defined spreading factor.

Nodes failing (even if it's a set of nodes sharing a common label) should be handled by the node failure system already in place. Eventually those nodes would be marked as down. We also have another proposal in place to quarantine nodes (due to the swarm2k issue).

There should be a notion of weighting to give relative strength to the different preferences

The complexity (not just in terms of implementation but also of operability) far outweighs the advantages.

I want to balance tasks over racks, but it's 20x more important to me to keep tasks balanced between datacenters

Then list datacenter first and rack later. Ask the user to list the preferences by importance rather than specifying relative weight.

It is difficult to describe the current proposal mathematically, and to implement it efficiently.

Isn't this basically a multisort function?

aaronlehmann · 2016-09-01T16:34:33Z

Nodes failing (even if it's a set of nodes sharing a common label) should be handled by the node failure system already in place. Eventually those nodes would be marked as down. We also have another proposal in place to quarantine nodes (due to the swarm2k issue).

I don't think I'm familiar with this. Is there an issue or design document about quarantining nodes?

Then list datacenter first and rack later. Ask the user to list the preferences by importance rather than specifying relative weight.

That was the plan in this proposal, but now I think it could be too limiting. If there's some problem that prevents one of the criteria you listed from getting balanced, the scheduler will try to do that forever, and never consider any of the lower-precedence criteria that was given by the user.

Isn't this basically a multisort function?

Yes, exactly. But the moment you assign a task, it could totally change the rankings (for example, if the nodes are evenly split between two datacenter, and I assign a new task to one of the datacenters, all the nodes in the other datacenter will move to the top half of the rankings). So far I haven't found a good way to avoid doing a full scan over the nodes for every task that's assigned. It seems possible, but it gets really complicated. With the naive approach, scheduling n tasks on m nodes is O(m*n).

Just curious, do you like the proposal as currently stated?

stevvooe · 2016-09-01T18:32:08Z

Last time we implemented random relative weight distribution it was for the agent weighted peers. Then we hit problems.

These problems had nothing to do with the use of random weighting and the time spent by @aaronlehmann and @LK4D4 were mostly due to ranging errors in observable failure reporting.

dongluochen · 2016-09-01T19:06:49Z

But the moment you assign a task, it could totally change the rankings (for example, if the nodes are evenly split between two datacenter, and I assign a new task to one of the datacenters, all the nodes in the other datacenter will move to the top half of the rankings).

This is similar to the issue of assigning k tasks to n sorted feasible nodes. Instead of keep finding the best node, I think it ok just to assign each node a task the top k nodes. With a dynamic environment, trying to achieve best option is not really useful.

aaronlehmann · 2016-09-01T20:31:26Z

This is similar to the issue of assigning k tasks to n sorted feasible nodes. Instead of keep finding the best node, I think it ok just to assign each node a task the top k nodes. With a dynamic environment, trying to achieve best option is not really useful.

I don't think this approach really makes sense. Suppose I have 10 nodes in each of two AZs. I start with only one replica, in AZ 1. Now I want to scale up to 10 replicas. The nodes from AZ 2 will all rank highest when I sort. So I'll end up with 9 tasks in AZ 2 and only 1 in AZ 1, which is a really bad outcome.

dongluochen · 2016-09-01T20:42:42Z

I don't think this approach really makes sense. Suppose I have 10 nodes in each of two AZs. I start with only one replica, in AZ 1. Now I want to scale up to 10 replicas. The nodes from AZ 2 will all rank highest when I sort. So I'll end up with 9 tasks in AZ 2 and only 1 in AZ 1, which is a really bad outcome.

I'm thinking a different approach. In topology aware scenario, a total order of the nodes isn't that useful. In stead we may have the nodes in buckets. If there are 2 AZs, they could split the tasks (weighted or not) first. Then select nodes for tasks within each AZ.

aaronlehmann · 2016-09-01T21:08:17Z

Reopening, this is still under active discussion.

dongluochen · 2016-09-01T21:35:36Z

With the node labels scheduler could construct a topology architecture as a tree. The tasks from a service start splitting in a top down approach. In this example, the root assigns tasks to rows, the rows assign tasks to racks, the racks then select nodes for the tasks.

There are 2 approaches in the assignment. It could be equal at each level, or weighted (depends on how many nodes eligible for the tasks).

aaronlehmann · 2016-09-02T00:55:34Z

For the sake of exploration, I implemented the deterministic approach described in the original proposal based on @dongluochen's bucket tree suggestion above: aaronlehmann@db2c93c

It's not tested or very clean or optimized yet, but it should completely handle all the cases (i.e. resource-limited cases where we have to backtrack and assign a task to a different place in the tree than we concluded was best).

I'm still not fully sure whether I prefer the deterministic or Monte Carlo approach, so I'm trying to experiment with designs and see how things turn out. The fact that I could put this together this reasonably efficient implementation in a few hours and it's actually fairly comprehensible makes me feel better about the deterministic approach than I felt this morning.

aaronlehmann · 2016-09-02T01:37:34Z

Also added a stupidly simple test that spreads over two values of a label that have an uneven number of nodes associated with them: aaronlehmann@aa776e6

aaronlehmann · 2016-09-02T18:31:14Z

BTW, while this proof of concept implementation is deterministic, I don't think it would be a very big change to add random jitter to the way tasks are distributed at each level of the tree. This could be desirable to avoid situations where we deterministically schedule tasks to the same bad node indefinitely. On the other hand, it could be undesirable in other cases ("Why did my 2-replica service get scheduled to the same node?"). In general, I think randomization would better suit large scale deployments that can average out the outliers, and determinism is better for tiny swarms. Perhaps there should be a user preference on whether to add randomness, or even a user-specified jitter factor. The downside is that providing this option locks us in to scheduling tasks in a particular way.

aluzzardi · 2016-09-26T21:50:13Z

self ping @aluzzardi

thaJeztah · 2016-10-01T14:30:45Z

design/topology_proposal.md

+The CLI for adding placement preferences could look like this:
+
+```bash
+$ docker service update --placement-prefs spread:engine.labels.dc,spread:engine.labels.row,spread:engine.labels.rack servicename


I noticed you're using colons here; for other flags (such as --security-opt apparmor:unconfined) we switched to using = instead. This also would be consistent with the --mount syntax, I believe.

(Just reading this now, I missed this proposal)

Changed to =

yank1 · 2016-12-06T02:41:15Z

LGTM

aaronlehmann · 2017-01-18T00:58:59Z

@aluzzardi: I've updated this to better match the code in #1512. I tweaked the protobuf definitions to exactly match the additions there, and updated the behavioral description.

I also renamed the document to design/topology.md, and reworded it to change it from a proposal to a specification (future tense into present, etc). I removed the alternative proposal and the CLI discussion (which is out of scope).

Are there any changes we should consider before merging this?

Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>

aluzzardi · 2017-01-24T23:08:49Z

Making the notion of precedence absolute is too rigid. If one datacenter experiences a problem that causes all of its tasks to fail immediately, tasks will keep getting assigned to it indefinitely, and they will never get scheduled to the datacenter that's still working properly.

Is this case handled?

Also, are we handling the "default" algorithm (HA) as a spread over node.id or is it special cased?

aluzzardi · 2017-01-24T23:13:38Z

Regarding the above, I think we could work around that by skipping empty sets. For instance, while spreading over the datacenter label, we should be able to see that there are no nodes under a given datacenter label value.

aaronlehmann · 2017-01-24T23:46:27Z

Making the notion of precedence absolute is too rigid. If one datacenter experiences a problem that causes all of its tasks to fail immediately, tasks will keep getting assigned to it indefinitely, and they will never get scheduled to the datacenter that's still working properly.

Is this case handled?

Not explicitly. It was dropped from the design discussion in response to your earlier comments (#1473 (comment))

To the extent that individual nodes are being marked as down, the problem is handled. But the spread algorithm doesn't do any randomization or failure monitoring of its own.

Also, are we handling the "default" algorithm (HA) as a spread over node.id or is it special cased?

We should probably discuss the implementation over in #1512, but basically the way it works is we build a tree with a level for each spread directive, and spread evenly between the branches at each level. The leaves are individual nodes.

Regarding the above, I think we could work around that by skipping empty sets. For instance, while spreading over the datacenter label, we should be able to see that there are no nodes under a given datacenter label value.

I'm not sure this is necessary. We only consider the label values found among nodes that are up. So if the only place a certain label value appears is on a dead node, the scheduler won't even be aware of that value.

aaronlehmann · 2017-02-07T20:52:28Z

Code was merged

stevvooe · 2017-02-07T21:41:35Z

👏

kent-h · 2017-05-19T15:25:45Z

@aaronlehmann @aluzzardi @dongluochen,

Some input on the deterministic vs monte-carlo/how-to-do-scheduling discussion:

There seem to be two different use cases here, which can be summed up as:
--placement-pref 'spread='
and
--placement-req 'spread='

Placement preferences:

Scheduling should ignore failed/offline nodes.
Many tasks can be scheduled per node (by default).
Always yields to requirements, only serving to score requirement-satisfying nodes.

Placement requirements:

Should be very strictly enforced.
Only one task can be scheduled per node (by default).
This can be used for the HA scenario.
Always takes precedence over preferences, immediately failing nodes.

Both:

Choose randomly between nodes that have the same score. No unbalanced scheduling. (no 2v0 in a 2-node swarm)
Higher-importance constraints are ignored

There has also been a proposal for a --max-replicas-per-node moby/moby#26259 This could be strictly applied whether using preferences, requirements, or both.

I think the goal is to give users tools to decide how scheduling should work, instead of deciding for them.

Last thing: The all-powerful, all-capable, has-all-the-features scheduling algorithm can definitely do better than O(m*n). Off the top of my head:

Calculate score for each node.
Place all nodes in a sorted list. (Something here to randomize same-cost nodes)
Pop the "best" node.
Schedule task on node.
Calculate new node score.
Re-add node to the list (maintain randomness of same-cost nodes).
Repeat 3-6 until done all tasks, or negative score found.

Complexity: ~O(n*log(n) + t*log(n)) where n=nodes, t=tasks

Am I missing anything?

aaronlehmann · 2017-05-19T17:23:31Z

Last thing: The all-powerful, all-capable, has-all-the-features scheduling algorithm can definitely do better than O(m*n). Off the top of my head:

Calculate score for each node.
Place all nodes in a sorted list. (Something here to randomize same-cost nodes)
Pop the "best" node.
Schedule task on node.
Calculate new node score.
Re-add node to the list (maintain randomness of same-cost nodes).
Repeat 3-6 until done all tasks, or negative score found.

There are two different scenarios we can discuss: scheduling a set of unrelated tasks, or scheduling a group of tasks that can be treated as interchangable (i.e. multiple replicas of the same service).

The former is difficult to do efficiently without constraining the algorithm. The scores of the nodes would depend on the particular task being scheduled.

If the tasks can be treated as identical, you can redefine the scheduling problem as "how many tasks should be scheduled on each node?", instead of iterated scheduling of individual tasks. This is what we do now (after a step that groups idential tasks), and it ends up looking pretty similar to what you suggested. We do a single sort, then assign tasks to nodes with the lowest "scores" until the scores all match, then do round-robin scheduling from that point if there are still more tasks to schedule.

GordonTheTurtle added the status/1-design-review label Aug 30, 2016

dongluochen reviewed Aug 30, 2016
View reviewed changes

aaronlehmann force-pushed the topology-proposal branch from ec6ea13 to a17a30a Compare August 30, 2016 21:04

stevvooe reviewed Aug 30, 2016
View reviewed changes

aaronlehmann closed this Sep 1, 2016

aaronlehmann reopened this Sep 1, 2016

aaronlehmann mentioned this pull request Sep 8, 2016

Topology-aware scheduling #1512

Merged

thaJeztah reviewed Oct 1, 2016

View reviewed changes

aaronlehmann force-pushed the topology-proposal branch from a17a30a to d3797f5 Compare October 6, 2016 15:08

aaronlehmann force-pushed the topology-proposal branch from d3797f5 to 8f8cbca Compare January 18, 2017 00:56

design: add topology-aware scheduling document

e5701b8

Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>

aaronlehmann force-pushed the topology-proposal branch from 8f8cbca to e5701b8 Compare January 18, 2017 01:01

aaronlehmann mentioned this pull request Feb 3, 2017

Topology-aware scheduling moby/moby#30725

Merged

aaronlehmann merged commit bf90af2 into moby:master Feb 7, 2017

aaronlehmann deleted the topology-proposal branch February 7, 2017 20:52

thaJeztah mentioned this pull request Nov 8, 2017

Placement Preferences not satisfying spread moby/moby#35433

Closed


		## Behavior

		A simple use of this feature would be to spread tasks evenly between two

Conversation

aaronlehmann commented Aug 30, 2016

Uh oh!

codecov-io commented Aug 30, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Current coverage is 54.76% (diff: 100%)

Uh oh!

dongluochen Aug 30, 2016

Choose a reason for hiding this comment

Uh oh!

dongluochen commented Aug 30, 2016

Uh oh!

stevvooe Aug 30, 2016

Choose a reason for hiding this comment

Uh oh!

aaronlehmann Aug 30, 2016

Choose a reason for hiding this comment

Uh oh!

aaronlehmann commented Sep 1, 2016

Uh oh!

aluzzardi commented Sep 1, 2016

Uh oh!

aaronlehmann commented Sep 1, 2016

Uh oh!

stevvooe commented Sep 1, 2016

Uh oh!

dongluochen commented Sep 1, 2016

Uh oh!

aaronlehmann commented Sep 1, 2016

Uh oh!

dongluochen commented Sep 1, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aaronlehmann commented Sep 1, 2016

Uh oh!

dongluochen commented Sep 1, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aaronlehmann commented Sep 2, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aaronlehmann commented Sep 2, 2016

Uh oh!

aaronlehmann commented Sep 2, 2016

Uh oh!

aluzzardi commented Sep 26, 2016

Uh oh!

thaJeztah Oct 1, 2016

Choose a reason for hiding this comment

Uh oh!

aaronlehmann Oct 6, 2016

Choose a reason for hiding this comment

Uh oh!

thaJeztah Oct 6, 2016

Choose a reason for hiding this comment

Uh oh!

yank1 commented Dec 6, 2016

Uh oh!

aaronlehmann commented Jan 18, 2017

Uh oh!

aluzzardi commented Jan 24, 2017

Uh oh!

aluzzardi commented Jan 24, 2017

Uh oh!

aaronlehmann commented Jan 24, 2017

Uh oh!

aaronlehmann commented Feb 7, 2017

Uh oh!

stevvooe commented Feb 7, 2017

Uh oh!

kent-h commented May 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aaronlehmann commented May 19, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

codecov-io commented Aug 30, 2016 •

edited

Loading

dongluochen commented Sep 1, 2016 •

edited

Loading

dongluochen commented Sep 1, 2016 •

edited

Loading

aaronlehmann commented Sep 2, 2016 •

edited

Loading

kent-h commented May 19, 2017 •

edited

Loading