Topology-aware scheduling by aaronlehmann · Pull Request #1512 · moby/swarmkit

aaronlehmann · 2016-09-08T02:31:25Z

This implements the proposal in #1473. Currently it's a non-randomized approach, but could be adapted to add random noise to the decisions it makes if that's the direction we decide we want to go.

There are two unit tests that validate this.

cc @dongluochen @aluzzardi

codecov-io · 2016-09-08T02:37:11Z

Codecov Report

Merging #1512 into master will increase coverage by 0.1%.

@@            Coverage Diff            @@
##           master    #1512     +/-   ##
=========================================
+ Coverage   53.96%   54.06%   +0.1%     
=========================================
  Files         106      108      +2     
  Lines       18411    18489     +78     
=========================================
+ Hits         9936     9997     +61     
- Misses       7269     7274      +5     
- Partials     1206     1218     +12

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update af3f0e1...d3b1cd8. Read the comment docs.

dongluochen · 2016-09-16T01:35:04Z

manager/scheduler/nodeset.go

+			switch {
+			case len(descriptor) > len(nodeLabelPrefix) && strings.EqualFold(descriptor[:len(nodeLabelPrefix)], nodeLabelPrefix):
+				if node.Spec.Annotations.Labels != nil {
+					value = node.Spec.Annotations.Labels[descriptor[len(nodeLabelPrefix):]]


IIRC, Docker engine allows label with empty string value. That wouldn't work in this case.

In this code, an empty value would be treated as one of the values of the label. The alternative is to handle these specially somehow, but it would make the code more complicated. Any thoughts?

One way is to keep empty value as a branch from a node (no change needed). Another way is to reject empty value, and add a log message of invalid preference label.

I agree that both of those options could make sense. I disagree about logging in this case because it would cause an enourmous amount of log output unless you make sure that every node has every label that you are selecting on in any service.

I guess the question is what behavior we want. In the current code, an empty value is weighted equally with any of the other values. One advantage of doing it this way is that tasks don't become unschedulable if there are no nodes available with the labels set (you could always use constraints to limit the tasks to those nodes if you really want to).

The other approach would be to only allow scheduling the tasks on nodes that have the labels set, when the service calls uses those labels in a scheduling preference. This might be closer to what the user had in mind, but it's easier to end up in a bad situation (nothing can be scheduled), and hard to understand why that happens.

it would cause an enourmous amount of log output unless you make sure that every node has every label that you are selecting on in any service.

I agree. If we take that approach, it should be done outside of the for loop, in a label validation procedure.

I think keeping empty value making it simpler for now. We can wait to see if it might cause problem.

dongluochen · 2016-09-16T16:27:37Z

manager/scheduler/nodeset.go

+				if node.Spec.Annotations.Labels != nil {
+					value = node.Spec.Annotations.Labels[descriptor[len(nodeLabelPrefix):]]
+				}
+			case len(descriptor) > len(engineLabelPrefix) && strings.EqualFold(descriptor[:len(engineLabelPrefix)], engineLabelPrefix):


This allows node.labels.az=az1 and engine.labels.rack=rack1 working together to form az.rack topology. It might add difficulty to debug error.

I think this configuration should be supported. Any ideas on making it easier to debug?

In that case we would need to ask user provide both node.labels. set and engine.labels. set. I think it ok.

dongluochen · 2016-09-16T16:31:01Z

manager/scheduler/nodeset.go

+			// TODO(aaronl): Support other items from constraint
+			// syntax like node ID, hostname, os/arch, etc?
+			default:
+				continue


If some nodes do not contain all the preferences, for example, node1 has node.labels.az=az1 and node.labels.rack=rack1, and node2 has only node.labels.rack=rack1. The path from root to node2 would have one less level. I think the algorithm handles it. It may be good to add comment here.

The path from root to node2 would have one less level.

I don't think that's right. The levels of the tree are determined by the scheduling preferences, not the nodes themselves. If a node is missing a particular label, that level of the tree just has "" for the value, but there are the same number of levels for all nodes.

I'll add a comment to clarify this.

You are right. It's a empty value on the node..

dongluochen · 2016-09-16T17:49:57Z

LGTM.

aaronlehmann · 2016-09-19T21:14:42Z

Rebased on top of #1550, and made some changes to apply that optimization inside this code. Now the leaf nodes of the tree contain heaps that store up to k nodes, where k is the number of tasks we are trying to schedule. When the scheduler calls findBestNodes, it flattens this heap to return a sorted slice of nodes. The next call to findBestNodes re-heapifys the slice to sort it again with any changes to the nodes taken into account.

aaronlehmann · 2016-09-29T09:14:40Z

Rebased.

dongluochen · 2016-10-01T00:58:17Z

manager/scheduler/decision_tree.go

+	// so on.
+	for dt.nodeHeap.Len() > 0 {
+		heap.Pop(&dt.nodeHeap)
+	}


I feel like the heap structure is irrelevant now because the current state of heap cannot be used directly. On every pass we will sort all the nodes in this leaf node. Heapsort is a sorting implementation.

I haven't checked other places using heap. If it's similar, should we change the data structure to slice? How to implement sorting is separate issue and we can decouple them.

aaronlehmann · 2016-10-01T14:02:58Z

I feel like the heap structure is irrelevant now because the current state of heap cannot be used directly. On every pass we will sort all the nodes in this leaf node. Heapsort is a sorting implementation.

I haven't checked other places using heap. If it's similar, should we change the data structure to slice? How to implement sorting is separate issue and we can decouple them.

It's useful for generating the original list of nodes. Originally this PR made a slice with all nodes and used quickselect, but @LK4D4 suggested using a heap instead. It's a little more efficient because the heap only needs to contain up to n nodes at any one time, vs the slice which needs to start with all the nodes.

Once we've built the tree, we could switch to a slice, but I had it continue to use the heap for sorting out of consistency.

aaronlehmann · 2016-10-19T18:10:50Z

Rebased.

dongluochen · 2016-10-21T01:15:06Z

Need rebase.

aaronlehmann · 2016-10-21T01:24:33Z

Rebased, thanks.

dongluochen · 2016-10-21T01:48:42Z

LGTM

ncoolz · 2016-11-29T05:17:07Z

LGTM. When will be this pr merged?

allencloud · 2016-11-29T05:52:36Z

I think this needs rebase though. @aaronlehmann

aaronlehmann · 2016-11-29T19:03:27Z

Rebased.

This didn't make it into Docker 1.13. Now that 1.14 development is happening, we can try to move it forward. The main thing we need to do is finalize the design, including the protobuf types. I'll work with @aluzzardi on this.

yank1 · 2016-12-06T02:42:30Z

LGTM

aaronlehmann · 2017-01-19T22:50:30Z

Rebased again.

@aluzzardi @thaJeztah: Any thoughts on what the CLI for this should look like? The draft proposal had:

--placement-prefs spread=engine.labels.dc,spread=engine.labels.row,spread=engine.labels.rack

This would allow some future expansion (i.e. pack=...). However, I see two potential problems:

It might be weird allow repeated "keys", and have order matter in something that looks like a set of k/v specifications. Really (spread, engine.labels.dc) is a tuple rather than a key and value, but I'm not sure what syntax would make this cleaer without being unwieldly.
There's no clear way to add additional settings to spread=... pair in the future, for example a weight.

An alternative is to have two levels of nesting with different separators, for example:

--placement-prefs 'type=spread,on=engine.labels.dc;type=spread,on=engine.labels.row;type=spread,on=engine.labels.rack'

I don't like this much though. It feels very hard to read, and ; is an unfriendly character to use in something passed on the command line (we could use something else like /, but again, there's no obvious choice).

Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>

aaronlehmann · 2017-02-02T14:47:30Z

ping @aluzzardi @thaJeztah @dnephin

dnephin · 2017-02-02T17:22:59Z

I think the most explicit version would be to repeat the flag. The order of flags on the command line is always significant.

--placement-prefs 'spread,engine.labels.dc'
--placement-prefs 'spread,engine.labels.dc,weight=3'
--placement-prefs 'spread,engine.labels.rack'

It is quite verbose, but I don't see any way around that. I think most service create/update commands lines are already long enough that users will either write these commands lines in bash scripts, or use the Compose format. So the verbosity is probably not a big problem.

aaronlehmann · 2017-02-02T17:36:09Z

Thanks for the input @dnephin. I like the simplicity of this solution. I have two concerns though:

The significance of ordering on the command line may not be obvious. As a user, I think it would be confusing to me that swapping the order of two flags leads to totally different results. Maybe calling it something like --append-placement-pref would help a little bit, since it would become a verb, which suggests the flag is taking some action rather than just providing information to the command?
The nice thing about not repeating the flag is that it makes service updates a bit easier to deal with. With a repeated flag, we'd need a way to delete or modify existing preferences instead of just letting the user specify an updated set. This seems reasonable though, especially if we went with a flag name like --append-placement-pref for adding them. I suppose we could have a counterpart called --remove-placement-pref. There's a little bit of ambiguity because theoretically the same preference could be repeated multiple times (I don't think it would be useful to do so, but the data structure allows it...). I think this is manageable because --remove-placement-pref could just delete all that match.

dnephin · 2017-02-02T17:45:15Z

The significance of ordering on the command line may not be obvious.

Possibly. There is definitely precedence for the order of flags being significant.

The nice thing about not repeating the flag is that it makes service updates a bit easier to deal with. With a repeated flag, we'd need a way to delete or modify existing preferences instead of just letting the user specify an updated set

True, this is something we already do for any flag that populates a list or mapping. In update the flags would be --placement-pref-add and --placement-pref-rm. You can see the existing flags in cli/command/service/opts.go.

aaronlehmann · 2017-02-02T17:49:14Z

Possibly. There is definitely precedence for the order of flags being significant.

I think you meant precedent, but that's a nice pun ;).

True, this is something we already do for any flag that populates a list or mapping. In update the flags would be --placement-pref-add and --placement-pref-rm.

Do you have any thoughts on using append instead of add in this case to make it clearer that order is significant? I like that more than add here, but I can understand that we might not want the inconsistency. Also, do you have any thoughts on making it --placement-pref-add (or --placement-pref-append) in service create as well as service update, for the same reason?

dnephin · 2017-02-02T19:40:47Z

I would much rather stick with add. We already have at least a few flags were order is significant:

dns-add
dns-search-add
mount-add (I think order is significant when mounts are nested?)

aaronlehmann · 2017-02-02T20:59:01Z

Hm, okay, but I still think this case is a bit different from DNS servers or mounts. For those the order is usually not of primary importance, but with these "scheduling preferences" order is very crucial to what it's doing ("first spread tasks over the datacenters, then over the racks within the datacenters").

And what do you think about making the flag for service create end with -add as well to hint that the flag is adding to an ordered list rather than just providing an unordered item of data?

Another thought: Using --...-add and --...-remove does make it a little awkward to insert something in the middle of the list without going through intermediate states you don't want. I suppose we could support command lines like --placement-pref-remove foo --placement-pref-remove bar --placement-pref-add foo --placement-pref-add new --placement-pref-add bar, but this is the kind of thing I was hoping to avoid with the concept of shoving the whole configuration into a single parameter (ugly as it is).

dnephin · 2017-02-02T21:03:55Z

Ya, I don't have any idea of how to deal with that. All of this is much easier with a Compose file.

aaronlehmann · 2017-02-03T20:19:50Z

Opened moby/moby#30725 for the Docker parts.

aluzzardi · 2017-02-03T22:08:46Z

--placement-prefs spread=engine.labels.dc,spread=engine.labels.row,spread=engine.labels.rack

Regarding the flags:

I wouldn't comma separate them. Instead, call it placement-pref (or whatever - just no s) and have the user repeat the flag for every preference. This is consistent with --secret, -e, etc etc.
I would either use type=spread,on=engine.labels.dc for consistency with our "new" flags, or perhaps something like --placement spread:engine.labels.dc --placement spread:engine.labels.rack,weight=15
add vs remove debate: There is also the possibility of having neither of those and re-using the same flag that we have for create (in which case, if passed, it will wipe whatever is there)

Regarding the syntax, I think it would be useful to sketch out what it would look like in compose and work our way down to the flags. @dnephin could you help out on this? what's the most compose-y way of representing those placement preferences?

aluzzardi · 2017-02-03T22:10:31Z

@aaronlehmann is it fair to say that our default (and always implicitly appended) placement preference is spread over node.id? Do you think there would be room in the future to override that somehow (e.g. pack on node.id)?

aluzzardi · 2017-02-03T22:12:41Z

Design, Protos & Code LGTM

We need to work out the flags though.

@aaronlehmann Since the flags are swarmctl only at this point, do you want to merge this one in and open a follow up with just flags (once we figure out what those should be, with the help of the compose sketch and @dnephin)?

aaronlehmann · 2017-02-03T22:44:40Z

is it fair to say that our default (and always implicitly appended) placement preference is spread over node.id? Do you think there would be room in the future to override that somehow (e.g. pack on node.id)?

Yes, and I don't see why it couldn't be overridden in the future. If we add support for "pack", then packing on node.id should do exactly what you're describing.

Since the flags are swarmctl only at this point, do you want to merge this one in and open a follow up with just flags (once we figure out what those should be, with the help of the compose sketch and @dnephin)?

Hm, not sure where swarmctl comes in. I haven't implemented the flags for swarmctl, but I did implement them in moby/moby#30725 today based on the thread with @dnephin.

But yes, I think we should go ahead and merge this, since the only area that still needs work is the flags, and this PR doesn't include flags.

dnephin · 2017-02-04T17:53:08Z

I think the Compose format would be something like this:

deploy:
  placement:
    preferences:
      - spread,engine.labels.dc
      - spread,engine.labels.rack

If at some point there are more types, and they take extra properties we could add an expanded form:

deploy:
  placement:
    preferences:
      - type: spread
        value: engine.labels.dc
        weight: 4

aluzzardi · 2017-02-07T22:24:17Z

spread,engine.labels.dc

I'm not a huge fan of the comma - it's not just any field, we're spreading based on the right hand side.

how about spread:engine.labels.dc or spread=engine.labels.dc?

Sorry for post-merge comment

aaronlehmann · 2017-02-07T22:28:47Z

Sorry for post-merge comment

No problem at all about continuing to comment on this. No syntax is implemented in this PR and it's just a related discussion that's going on to inform the Docker PR.

how about spread:engine.labels.dc or spread=engine.labels.dc?

Previously, @thaJeztah mentioned a preference for =, because several other options are structured as k/v pairs and use =.

I think = is good, and if we add options in the future it will look like spread=engine.labels.dc,weight=40, which looks right.

Any thoughts? I can update the Docker PR (moby/moby#30725) to use = instead of ,.

aluzzardi · 2017-02-07T22:50:50Z

Mapping to @dnephin comment about short/long version, I was thinking something like:

k/v options:
type=spread,value=engine.labels.dc,weight=4

porcelain:
spread:engine.labels.dc.weight

but maybe that's overkill, and we should go with the simpler approach you suggested: spread=engine.labels.dc,weight=40

In which case, @dnephin compose snippets would look like:

deploy:
  placement:
    preferences:
      - spread: engine.labels.dc
      - spread: engine.labels.rack

deploy:
  placement:
    preferences:
      - spread: engine.labels.dc
        weight: 4

denverdino · 2017-04-05T08:24:58Z

@aluzzardi the "placement" is not supported in the Compose template V3.2 and included in the 17.04.0-ce-rc2. Is there any plan for the support? Thanks a lot

GordonTheTurtle added the status/0-triage label Sep 8, 2016

aaronlehmann force-pushed the topology branch from 2f30529 to 8926c5e Compare September 8, 2016 02:52

dongluochen reviewed Sep 16, 2016

View reviewed changes

aaronlehmann force-pushed the topology branch from 8926c5e to c87f4f0 Compare September 16, 2016 17:24

aaronlehmann force-pushed the topology branch from c87f4f0 to 6d18096 Compare September 19, 2016 21:05

aaronlehmann force-pushed the topology branch from 6d18096 to 2358b27 Compare September 20, 2016 18:08

aaronlehmann changed the title ~~[WIP] Topology-aware scheduling~~ Topology-aware scheduling Sep 20, 2016

aaronlehmann force-pushed the topology branch from 2358b27 to 19fad95 Compare September 21, 2016 23:45

aaronlehmann force-pushed the topology branch from 19fad95 to f18d016 Compare September 29, 2016 09:14

aaronlehmann force-pushed the topology branch from f18d016 to 9cc4c4b Compare September 29, 2016 13:52

dongluochen reviewed Oct 1, 2016

View reviewed changes

aaronlehmann force-pushed the topology branch from 9cc4c4b to d42beda Compare October 19, 2016 18:10

aaronlehmann force-pushed the topology branch from d42beda to 5b7f29f Compare October 21, 2016 01:24

aaronlehmann force-pushed the topology branch from 5b7f29f to 3dcc0d7 Compare November 29, 2016 19:01

aaronlehmann force-pushed the topology branch from 3dcc0d7 to 43cc0d3 Compare November 29, 2016 22:22

aaronlehmann mentioned this pull request Jan 18, 2017

design: add topology-aware scheduling proposal #1473

Merged

aaronlehmann force-pushed the topology branch from 43cc0d3 to 5c6d877 Compare January 18, 2017 01:11

aaronlehmann force-pushed the topology branch from 5c6d877 to 67c0751 Compare January 19, 2017 22:35

Aaron Lehmann added 2 commits February 2, 2017 14:41

Implement scheduling preferences based on bucket tree

42ed878

Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>

Add scheduling topology tests

d3b1cd8

Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>

aaronlehmann force-pushed the topology branch from 67c0751 to d3b1cd8 Compare February 2, 2017 14:44

aaronlehmann mentioned this pull request Feb 3, 2017

Topology-aware scheduling moby/moby#30725

Merged

aaronlehmann merged commit fb654e9 into moby:master Feb 7, 2017

aaronlehmann deleted the topology branch February 7, 2017 20:51

nicolaka mentioned this pull request Feb 9, 2017

topology-aware networking in Swarm Mode moby/moby#30850

Open

denverdino mentioned this pull request Apr 20, 2017

Support placement preferences in stack deployment moby/moby#32743

Closed

thaJeztah mentioned this pull request Nov 8, 2017

Placement Preferences not satisfying spread moby/moby#35433

Closed

corhere mentioned this pull request May 28, 2025

manager: fix task scheduler infinite loop #3200

Merged

Conversation

aaronlehmann commented Sep 8, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-io commented Sep 8, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongluochen Sep 16, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongluochen commented Sep 16, 2016

Uh oh!

aaronlehmann commented Sep 19, 2016

Uh oh!

aaronlehmann commented Sep 29, 2016

Uh oh!

dongluochen Oct 1, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aaronlehmann commented Oct 1, 2016

Uh oh!

aaronlehmann commented Oct 19, 2016

Uh oh!

dongluochen commented Oct 21, 2016

Uh oh!

aaronlehmann commented Oct 21, 2016

Uh oh!

dongluochen commented Oct 21, 2016

Uh oh!

ncoolz commented Nov 29, 2016

Uh oh!

allencloud commented Nov 29, 2016

Uh oh!

aaronlehmann commented Nov 29, 2016

Uh oh!

yank1 commented Dec 6, 2016

Uh oh!

aaronlehmann commented Jan 19, 2017

Uh oh!

aaronlehmann commented Feb 2, 2017

Uh oh!

dnephin commented Feb 2, 2017

Uh oh!

aaronlehmann commented Feb 2, 2017

Uh oh!

dnephin commented Feb 2, 2017

Uh oh!

aaronlehmann commented Feb 2, 2017

Uh oh!

dnephin commented Feb 2, 2017

Uh oh!

aaronlehmann commented Feb 2, 2017

Uh oh!

dnephin commented Feb 2, 2017

Uh oh!

aaronlehmann commented Feb 3, 2017

Uh oh!

aaronlehmann commented Sep 8, 2016 •

edited

Loading

codecov-io commented Sep 8, 2016 •

edited

Loading

dongluochen Sep 16, 2016 •

edited

Loading

dongluochen Oct 1, 2016 •

edited

Loading