Skip to content

design: add topology-aware scheduling proposal#1473

Merged
aaronlehmann merged 1 commit intomoby:masterfrom
aaronlehmann:topology-proposal
Feb 7, 2017
Merged

design: add topology-aware scheduling proposal#1473
aaronlehmann merged 1 commit intomoby:masterfrom
aaronlehmann:topology-proposal

Conversation

@aaronlehmann
Copy link
Collaborator

@codecov-io
Copy link

codecov-io commented Aug 30, 2016

Current coverage is 54.76% (diff: 100%)

Merging #1473 into master will decrease coverage by 0.05%

@@             master      #1473   diff @@
==========================================
  Files           107        107          
  Lines         17607      17607          
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
- Hits           9651       9642     -9   
- Misses         6779       6791    +12   
+ Partials       1177       1174     -3   

Sunburst

Powered by Codecov. Last update f6a679d...e5701b8


## Behavior

A simple use of this feature would be to spread tasks evenly between two
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two -> multiple

@dongluochen
Copy link
Contributor

Design LGTM.

A slightly more complicated use case involves hierarchical topology. Say there
are two datacenters, which each have four rows, each row having 20 racks. To
spread tasks evenly at each of these levels, there could be three `SpreadOver`
messages in `Preferences`. The first would spread over datacenters, the second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the math look like here? Would it minimize items per bucket then fallback to a compounding factor? When would this carryover be triggered?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Internally, the scheduler will be sorting the nodes to rank them. Basically, these rules define a ranking function for the nodes.

Using the example above, the sort function first takes into account the datacenter label. If datacenter A has 19 tasks from this service and datacenter B has 20 tasks from this service, all nodes with the datacenter=A label will rank above the datacenter B nodes at that moment. If any rows in datacenter A are more lightly populated than other rows, the nodes in those rows will be the highest ranked. And If any racks in those rows are more lightly populated than the other racks in those rows, their nodes will be ranked at the very top.

Suppose we start in this situation and want to scale up by two tasks. The first task will go to a node in datacenter A, unless that's not possible because of constraints. After the first task is scheduled, our instantaneous rankings change. Now both datacenters have the same number of tasks from the service, so the rankings instead depend what's going on at the row level. If there is one row in either datacenter that has fewer tasks than the other, the nodes in that row will be ranked highest. And so on.

@aaronlehmann
Copy link
Collaborator Author

I discussed this with @stevvooe and concluded that there are a number of things that should change in this proposal.

  • Making the notion of precedence absolute is too rigid. If one datacenter experiences a problem that causes all of its tasks to fail immediately, tasks will keep getting assigned to it indefinitely, and they will never get scheduled to the datacenter that's still working properly.
  • There should be a notion of weighting to give relative strength to the different preferences (i.e. "I want to balance tasks over racks, but it's 20x more important to me to keep tasks balanced between datacenters"). This is distinct from weights that we may eventually add on individual label values ("prefer datacenter A over datacenter B by factor X").
  • It is difficult to describe the current proposal mathematically, and to implement it efficiently.

I think all three problems can be addressed well by moving to a weighted random selection algorithm. Based on the initial state, the user's preferences, and the number of tasks being scheduled, we could compute relative weights for the nodes that would guide the final state towards the distribution we want. Then we would assign each task with a Monte Carlo approach. I'm going to think more about an approach along these lines and change the proposal if it works out.

@aluzzardi
Copy link
Member

To be honest, not a fan.

Last time we implemented random relative weight distribution it was for the agent weighted peers. Then we hit problems. Can't remember the exact issue, it was the agent taking forever to reconnect on manager failure. It took a lot of effort from @LK4D4 and @aaronlehmann (IIRC) to debug and fix. Long story short, it took longer to implement, was harder to debug, less predictable and more difficult to tweak to real life scenarios.

In hindsight, a simple round robin would have been the better choice.

Making the notion of precedence absolute is too rigid. If one datacenter experiences a problem that causes all of its tasks to fail immediately, tasks will keep getting assigned to it indefinitely, and they will never get scheduled to the datacenter that's still working properly.

I think we're mixing two different problems. This feature is not about federation and multi datacenter, it's just about having a user defined spreading factor.

Nodes failing (even if it's a set of nodes sharing a common label) should be handled by the node failure system already in place. Eventually those nodes would be marked as down. We also have another proposal in place to quarantine nodes (due to the swarm2k issue).

There should be a notion of weighting to give relative strength to the different preferences

The complexity (not just in terms of implementation but also of operability) far outweighs the advantages.

I want to balance tasks over racks, but it's 20x more important to me to keep tasks balanced between datacenters

Then list datacenter first and rack later. Ask the user to list the preferences by importance rather than specifying relative weight.

It is difficult to describe the current proposal mathematically, and to implement it efficiently.

Isn't this basically a multisort function?

@aaronlehmann
Copy link
Collaborator Author

Nodes failing (even if it's a set of nodes sharing a common label) should be handled by the node failure system already in place. Eventually those nodes would be marked as down. We also have another proposal in place to quarantine nodes (due to the swarm2k issue).

I don't think I'm familiar with this. Is there an issue or design document about quarantining nodes?

Then list datacenter first and rack later. Ask the user to list the preferences by importance rather than specifying relative weight.

That was the plan in this proposal, but now I think it could be too limiting. If there's some problem that prevents one of the criteria you listed from getting balanced, the scheduler will try to do that forever, and never consider any of the lower-precedence criteria that was given by the user.

Isn't this basically a multisort function?

Yes, exactly. But the moment you assign a task, it could totally change the rankings (for example, if the nodes are evenly split between two datacenter, and I assign a new task to one of the datacenters, all the nodes in the other datacenter will move to the top half of the rankings). So far I haven't found a good way to avoid doing a full scan over the nodes for every task that's assigned. It seems possible, but it gets really complicated. With the naive approach, scheduling n tasks on m nodes is O(m*n).

Just curious, do you like the proposal as currently stated?

@stevvooe
Copy link
Contributor

stevvooe commented Sep 1, 2016

Last time we implemented random relative weight distribution it was for the agent weighted peers. Then we hit problems.

These problems had nothing to do with the use of random weighting and the time spent by @aaronlehmann and @LK4D4 were mostly due to ranging errors in observable failure reporting.

@dongluochen
Copy link
Contributor

But the moment you assign a task, it could totally change the rankings (for example, if the nodes are evenly split between two datacenter, and I assign a new task to one of the datacenters, all the nodes in the other datacenter will move to the top half of the rankings).

This is similar to the issue of assigning k tasks to n sorted feasible nodes. Instead of keep finding the best node, I think it ok just to assign each node a task the top k nodes. With a dynamic environment, trying to achieve best option is not really useful.

@aaronlehmann
Copy link
Collaborator Author

This is similar to the issue of assigning k tasks to n sorted feasible nodes. Instead of keep finding the best node, I think it ok just to assign each node a task the top k nodes. With a dynamic environment, trying to achieve best option is not really useful.

I don't think this approach really makes sense. Suppose I have 10 nodes in each of two AZs. I start with only one replica, in AZ 1. Now I want to scale up to 10 replicas. The nodes from AZ 2 will all rank highest when I sort. So I'll end up with 9 tasks in AZ 2 and only 1 in AZ 1, which is a really bad outcome.

@dongluochen
Copy link
Contributor

dongluochen commented Sep 1, 2016

I don't think this approach really makes sense. Suppose I have 10 nodes in each of two AZs. I start with only one replica, in AZ 1. Now I want to scale up to 10 replicas. The nodes from AZ 2 will all rank highest when I sort. So I'll end up with 9 tasks in AZ 2 and only 1 in AZ 1, which is a really bad outcome.

I'm thinking a different approach. In topology aware scenario, a total order of the nodes isn't that useful. In stead we may have the nodes in buckets. If there are 2 AZs, they could split the tasks (weighted or not) first. Then select nodes for tasks within each AZ.

@aaronlehmann
Copy link
Collaborator Author

Reopening, this is still under active discussion.

@aaronlehmann aaronlehmann reopened this Sep 1, 2016
@dongluochen
Copy link
Contributor

dongluochen commented Sep 1, 2016

With the node labels scheduler could construct a topology architecture as a tree. The tasks from a service start splitting in a top down approach. In this example, the root assigns tasks to rows, the rows assign tasks to racks, the racks then select nodes for the tasks.

There are 2 approaches in the assignment. It could be equal at each level, or weighted (depends on how many nodes eligible for the tasks).

topology tree

@aaronlehmann
Copy link
Collaborator Author

aaronlehmann commented Sep 2, 2016

For the sake of exploration, I implemented the deterministic approach described in the original proposal based on @dongluochen's bucket tree suggestion above: aaronlehmann@db2c93c

It's not tested or very clean or optimized yet, but it should completely handle all the cases (i.e. resource-limited cases where we have to backtrack and assign a task to a different place in the tree than we concluded was best).

I'm still not fully sure whether I prefer the deterministic or Monte Carlo approach, so I'm trying to experiment with designs and see how things turn out. The fact that I could put this together this reasonably efficient implementation in a few hours and it's actually fairly comprehensible makes me feel better about the deterministic approach than I felt this morning.

@aaronlehmann
Copy link
Collaborator Author

Also added a stupidly simple test that spreads over two values of a label that have an uneven number of nodes associated with them: aaronlehmann@aa776e6

@aaronlehmann
Copy link
Collaborator Author

BTW, while this proof of concept implementation is deterministic, I don't think it would be a very big change to add random jitter to the way tasks are distributed at each level of the tree. This could be desirable to avoid situations where we deterministically schedule tasks to the same bad node indefinitely. On the other hand, it could be undesirable in other cases ("Why did my 2-replica service get scheduled to the same node?"). In general, I think randomization would better suit large scale deployments that can average out the outliers, and determinism is better for tiny swarms. Perhaps there should be a user preference on whether to add randomness, or even a user-specified jitter factor. The downside is that providing this option locks us in to scheduling tasks in a particular way.

@aluzzardi
Copy link
Member

self ping @aluzzardi

The CLI for adding placement preferences could look like this:

```bash
$ docker service update --placement-prefs spread:engine.labels.dc,spread:engine.labels.row,spread:engine.labels.rack servicename
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed you're using colons here; for other flags (such as --security-opt apparmor:unconfined) we switched to using = instead. This also would be consistent with the --mount syntax, I believe.

(Just reading this now, I missed this proposal)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to =

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@yank1
Copy link

yank1 commented Dec 6, 2016

LGTM

@aaronlehmann
Copy link
Collaborator Author

@aluzzardi: I've updated this to better match the code in #1512. I tweaked the protobuf definitions to exactly match the additions there, and updated the behavioral description.

I also renamed the document to design/topology.md, and reworded it to change it from a proposal to a specification (future tense into present, etc). I removed the alternative proposal and the CLI discussion (which is out of scope).

Are there any changes we should consider before merging this?

Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
@aluzzardi
Copy link
Member

Making the notion of precedence absolute is too rigid. If one datacenter experiences a problem that causes all of its tasks to fail immediately, tasks will keep getting assigned to it indefinitely, and they will never get scheduled to the datacenter that's still working properly.

Is this case handled?

Also, are we handling the "default" algorithm (HA) as a spread over node.id or is it special cased?

@aluzzardi
Copy link
Member

Regarding the above, I think we could work around that by skipping empty sets. For instance, while spreading over the datacenter label, we should be able to see that there are no nodes under a given datacenter label value.

@aaronlehmann
Copy link
Collaborator Author

Making the notion of precedence absolute is too rigid. If one datacenter experiences a problem that causes all of its tasks to fail immediately, tasks will keep getting assigned to it indefinitely, and they will never get scheduled to the datacenter that's still working properly.

Is this case handled?

Not explicitly. It was dropped from the design discussion in response to your earlier comments (#1473 (comment))

To the extent that individual nodes are being marked as down, the problem is handled. But the spread algorithm doesn't do any randomization or failure monitoring of its own.

Also, are we handling the "default" algorithm (HA) as a spread over node.id or is it special cased?

We should probably discuss the implementation over in #1512, but basically the way it works is we build a tree with a level for each spread directive, and spread evenly between the branches at each level. The leaves are individual nodes.

Regarding the above, I think we could work around that by skipping empty sets. For instance, while spreading over the datacenter label, we should be able to see that there are no nodes under a given datacenter label value.

I'm not sure this is necessary. We only consider the label values found among nodes that are up. So if the only place a certain label value appears is on a dead node, the scheduler won't even be aware of that value.

@aaronlehmann
Copy link
Collaborator Author

Code was merged

@aaronlehmann aaronlehmann merged commit bf90af2 into moby:master Feb 7, 2017
@aaronlehmann aaronlehmann deleted the topology-proposal branch February 7, 2017 20:52
@stevvooe
Copy link
Contributor

stevvooe commented Feb 7, 2017

👏

@kent-h
Copy link

kent-h commented May 19, 2017

@aaronlehmann @aluzzardi @dongluochen,

Some input on the deterministic vs monte-carlo/how-to-do-scheduling discussion:

There seem to be two different use cases here, which can be summed up as:
--placement-pref 'spread='
and
--placement-req 'spread='

Placement preferences:

  • Scheduling should ignore failed/offline nodes.
  • Many tasks can be scheduled per node (by default).
  • Always yields to requirements, only serving to score requirement-satisfying nodes.

Placement requirements:

  • Should be very strictly enforced.
  • Only one task can be scheduled per node (by default).
  • This can be used for the HA scenario.
  • Always takes precedence over preferences, immediately failing nodes.

Both:

  • Choose randomly between nodes that have the same score. No unbalanced scheduling. (no 2v0 in a 2-node swarm)
  • Higher-importance constraints are ignored

There has also been a proposal for a --max-replicas-per-node moby/moby#26259 This could be strictly applied whether using preferences, requirements, or both.

I think the goal is to give users tools to decide how scheduling should work, instead of deciding for them.

Last thing: The all-powerful, all-capable, has-all-the-features scheduling algorithm can definitely do better than O(m*n). Off the top of my head:

  1. Calculate score for each node.
  2. Place all nodes in a sorted list. (Something here to randomize same-cost nodes)
  3. Pop the "best" node.
  4. Schedule task on node.
  5. Calculate new node score.
  6. Re-add node to the list (maintain randomness of same-cost nodes).
  7. Repeat 3-6 until done all tasks, or negative score found.

Complexity: ~O(n*log(n) + t*log(n)) where n=nodes, t=tasks

Am I missing anything?

@aaronlehmann
Copy link
Collaborator Author

Last thing: The all-powerful, all-capable, has-all-the-features scheduling algorithm can definitely do better than O(m*n). Off the top of my head:

Calculate score for each node.
Place all nodes in a sorted list. (Something here to randomize same-cost nodes)
Pop the "best" node.
Schedule task on node.
Calculate new node score.
Re-add node to the list (maintain randomness of same-cost nodes).
Repeat 3-6 until done all tasks, or negative score found.

There are two different scenarios we can discuss: scheduling a set of unrelated tasks, or scheduling a group of tasks that can be treated as interchangable (i.e. multiple replicas of the same service).

The former is difficult to do efficiently without constraining the algorithm. The scores of the nodes would depend on the particular task being scheduled.

If the tasks can be treated as identical, you can redefine the scheduling problem as "how many tasks should be scheduled on each node?", instead of iterated scheduling of individual tasks. This is what we do now (after a step that groups idential tasks), and it ends up looking pretty similar to what you suggested. We do a single sort, then assign tasks to nodes with the lowest "scores" until the scores all match, then do round-robin scheduling from that point if there are still more tasks to schedule.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants