Skip to content

Topology-aware scheduling#1512

Merged
aaronlehmann merged 2 commits intomoby:masterfrom
aaronlehmann:topology
Feb 7, 2017
Merged

Topology-aware scheduling#1512
aaronlehmann merged 2 commits intomoby:masterfrom
aaronlehmann:topology

Conversation

@aaronlehmann
Copy link
Collaborator

@aaronlehmann aaronlehmann commented Sep 8, 2016

This implements the proposal in #1473. Currently it's a non-randomized approach, but could be adapted to add random noise to the decisions it makes if that's the direction we decide we want to go.

There are two unit tests that validate this.

cc @dongluochen @aluzzardi

@codecov-io
Copy link

codecov-io commented Sep 8, 2016

Codecov Report

Merging #1512 into master will increase coverage by 0.1%.

@@            Coverage Diff            @@
##           master    #1512     +/-   ##
=========================================
+ Coverage   53.96%   54.06%   +0.1%     
=========================================
  Files         106      108      +2     
  Lines       18411    18489     +78     
=========================================
+ Hits         9936     9997     +61     
- Misses       7269     7274      +5     
- Partials     1206     1218     +12

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update af3f0e1...d3b1cd8. Read the comment docs.

switch {
case len(descriptor) > len(nodeLabelPrefix) && strings.EqualFold(descriptor[:len(nodeLabelPrefix)], nodeLabelPrefix):
if node.Spec.Annotations.Labels != nil {
value = node.Spec.Annotations.Labels[descriptor[len(nodeLabelPrefix):]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, Docker engine allows label with empty string value. That wouldn't work in this case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this code, an empty value would be treated as one of the values of the label. The alternative is to handle these specially somehow, but it would make the code more complicated. Any thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way is to keep empty value as a branch from a node (no change needed). Another way is to reject empty value, and add a log message of invalid preference label.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that both of those options could make sense. I disagree about logging in this case because it would cause an enourmous amount of log output unless you make sure that every node has every label that you are selecting on in any service.

I guess the question is what behavior we want. In the current code, an empty value is weighted equally with any of the other values. One advantage of doing it this way is that tasks don't become unschedulable if there are no nodes available with the labels set (you could always use constraints to limit the tasks to those nodes if you really want to).

The other approach would be to only allow scheduling the tasks on nodes that have the labels set, when the service calls uses those labels in a scheduling preference. This might be closer to what the user had in mind, but it's easier to end up in a bad situation (nothing can be scheduled), and hard to understand why that happens.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would cause an enourmous amount of log output unless you make sure that every node has every label that you are selecting on in any service.

I agree. If we take that approach, it should be done outside of the for loop, in a label validation procedure.

I think keeping empty value making it simpler for now. We can wait to see if it might cause problem.

if node.Spec.Annotations.Labels != nil {
value = node.Spec.Annotations.Labels[descriptor[len(nodeLabelPrefix):]]
}
case len(descriptor) > len(engineLabelPrefix) && strings.EqualFold(descriptor[:len(engineLabelPrefix)], engineLabelPrefix):
Copy link
Contributor

@dongluochen dongluochen Sep 16, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows node.labels.az=az1 and engine.labels.rack=rack1 working together to form az.rack topology. It might add difficulty to debug error.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this configuration should be supported. Any ideas on making it easier to debug?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case we would need to ask user provide both node.labels. set and engine.labels. set. I think it ok.

// TODO(aaronl): Support other items from constraint
// syntax like node ID, hostname, os/arch, etc?
default:
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If some nodes do not contain all the preferences, for example, node1 has node.labels.az=az1 and node.labels.rack=rack1, and node2 has only node.labels.rack=rack1. The path from root to node2 would have one less level. I think the algorithm handles it. It may be good to add comment here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The path from root to node2 would have one less level.

I don't think that's right. The levels of the tree are determined by the scheduling preferences, not the nodes themselves. If a node is missing a particular label, that level of the tree just has "" for the value, but there are the same number of levels for all nodes.

I'll add a comment to clarify this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. It's a empty value on the node..

@dongluochen
Copy link
Contributor

LGTM.

@aaronlehmann
Copy link
Collaborator Author

Rebased on top of #1550, and made some changes to apply that optimization inside this code. Now the leaf nodes of the tree contain heaps that store up to k nodes, where k is the number of tasks we are trying to schedule. When the scheduler calls findBestNodes, it flattens this heap to return a sorted slice of nodes. The next call to findBestNodes re-heapifys the slice to sort it again with any changes to the nodes taken into account.

@aaronlehmann aaronlehmann changed the title [WIP] Topology-aware scheduling Topology-aware scheduling Sep 20, 2016
@aaronlehmann
Copy link
Collaborator Author

Rebased.

// so on.
for dt.nodeHeap.Len() > 0 {
heap.Pop(&dt.nodeHeap)
}
Copy link
Contributor

@dongluochen dongluochen Oct 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like the heap structure is irrelevant now because the current state of heap cannot be used directly. On every pass we will sort all the nodes in this leaf node. Heapsort is a sorting implementation.

I haven't checked other places using heap. If it's similar, should we change the data structure to slice? How to implement sorting is separate issue and we can decouple them.

@aaronlehmann
Copy link
Collaborator Author

I feel like the heap structure is irrelevant now because the current state of heap cannot be used directly. On every pass we will sort all the nodes in this leaf node. Heapsort is a sorting implementation.

I haven't checked other places using heap. If it's similar, should we change the data structure to slice? How to implement sorting is separate issue and we can decouple them.

It's useful for generating the original list of nodes. Originally this PR made a slice with all nodes and used quickselect, but @LK4D4 suggested using a heap instead. It's a little more efficient because the heap only needs to contain up to n nodes at any one time, vs the slice which needs to start with all the nodes.

Once we've built the tree, we could switch to a slice, but I had it continue to use the heap for sorting out of consistency.

@aaronlehmann
Copy link
Collaborator Author

Rebased.

@dongluochen
Copy link
Contributor

Need rebase.

@aaronlehmann
Copy link
Collaborator Author

Rebased, thanks.

@dongluochen
Copy link
Contributor

LGTM

@ncoolz
Copy link

ncoolz commented Nov 29, 2016

LGTM. When will be this pr merged?

@allencloud
Copy link
Contributor

I think this needs rebase though. @aaronlehmann

@aaronlehmann
Copy link
Collaborator Author

Rebased.

This didn't make it into Docker 1.13. Now that 1.14 development is happening, we can try to move it forward. The main thing we need to do is finalize the design, including the protobuf types. I'll work with @aluzzardi on this.

@yank1
Copy link

yank1 commented Dec 6, 2016

LGTM

@aaronlehmann
Copy link
Collaborator Author

Rebased again.

@aluzzardi @thaJeztah: Any thoughts on what the CLI for this should look like? The draft proposal had:

--placement-prefs spread=engine.labels.dc,spread=engine.labels.row,spread=engine.labels.rack

This would allow some future expansion (i.e. pack=...). However, I see two potential problems:

  • It might be weird allow repeated "keys", and have order matter in something that looks like a set of k/v specifications. Really (spread, engine.labels.dc) is a tuple rather than a key and value, but I'm not sure what syntax would make this cleaer without being unwieldly.
  • There's no clear way to add additional settings to spread=... pair in the future, for example a weight.

An alternative is to have two levels of nesting with different separators, for example:

--placement-prefs 'type=spread,on=engine.labels.dc;type=spread,on=engine.labels.row;type=spread,on=engine.labels.rack'

I don't like this much though. It feels very hard to read, and ; is an unfriendly character to use in something passed on the command line (we could use something else like /, but again, there's no obvious choice).

Aaron Lehmann added 2 commits February 2, 2017 14:41
Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
@aaronlehmann
Copy link
Collaborator Author

ping @aluzzardi @thaJeztah @dnephin

@dnephin
Copy link
Member

dnephin commented Feb 2, 2017

I think the most explicit version would be to repeat the flag. The order of flags on the command line is always significant.

--placement-prefs 'spread,engine.labels.dc'
--placement-prefs 'spread,engine.labels.dc,weight=3'
--placement-prefs 'spread,engine.labels.rack'

It is quite verbose, but I don't see any way around that. I think most service create/update commands lines are already long enough that users will either write these commands lines in bash scripts, or use the Compose format. So the verbosity is probably not a big problem.

@aaronlehmann
Copy link
Collaborator Author

Thanks for the input @dnephin. I like the simplicity of this solution. I have two concerns though:

  • The significance of ordering on the command line may not be obvious. As a user, I think it would be confusing to me that swapping the order of two flags leads to totally different results. Maybe calling it something like --append-placement-pref would help a little bit, since it would become a verb, which suggests the flag is taking some action rather than just providing information to the command?
  • The nice thing about not repeating the flag is that it makes service updates a bit easier to deal with. With a repeated flag, we'd need a way to delete or modify existing preferences instead of just letting the user specify an updated set. This seems reasonable though, especially if we went with a flag name like --append-placement-pref for adding them. I suppose we could have a counterpart called --remove-placement-pref. There's a little bit of ambiguity because theoretically the same preference could be repeated multiple times (I don't think it would be useful to do so, but the data structure allows it...). I think this is manageable because --remove-placement-pref could just delete all that match.

@dnephin
Copy link
Member

dnephin commented Feb 2, 2017

The significance of ordering on the command line may not be obvious.

Possibly. There is definitely precedence for the order of flags being significant.

The nice thing about not repeating the flag is that it makes service updates a bit easier to deal with. With a repeated flag, we'd need a way to delete or modify existing preferences instead of just letting the user specify an updated set

True, this is something we already do for any flag that populates a list or mapping. In update the flags would be --placement-pref-add and --placement-pref-rm. You can see the existing flags in cli/command/service/opts.go.

@aaronlehmann
Copy link
Collaborator Author

Possibly. There is definitely precedence for the order of flags being significant.

I think you meant precedent, but that's a nice pun ;).

True, this is something we already do for any flag that populates a list or mapping. In update the flags would be --placement-pref-add and --placement-pref-rm.

Do you have any thoughts on using append instead of add in this case to make it clearer that order is significant? I like that more than add here, but I can understand that we might not want the inconsistency. Also, do you have any thoughts on making it --placement-pref-add (or --placement-pref-append) in service create as well as service update, for the same reason?

@dnephin
Copy link
Member

dnephin commented Feb 2, 2017

I would much rather stick with add. We already have at least a few flags were order is significant:

  • dns-add
  • dns-search-add
  • mount-add (I think order is significant when mounts are nested?)

@aaronlehmann
Copy link
Collaborator Author

Hm, okay, but I still think this case is a bit different from DNS servers or mounts. For those the order is usually not of primary importance, but with these "scheduling preferences" order is very crucial to what it's doing ("first spread tasks over the datacenters, then over the racks within the datacenters").

And what do you think about making the flag for service create end with -add as well to hint that the flag is adding to an ordered list rather than just providing an unordered item of data?

Another thought: Using --...-add and --...-remove does make it a little awkward to insert something in the middle of the list without going through intermediate states you don't want. I suppose we could support command lines like --placement-pref-remove foo --placement-pref-remove bar --placement-pref-add foo --placement-pref-add new --placement-pref-add bar, but this is the kind of thing I was hoping to avoid with the concept of shoving the whole configuration into a single parameter (ugly as it is).

@dnephin
Copy link
Member

dnephin commented Feb 2, 2017

Ya, I don't have any idea of how to deal with that. All of this is much easier with a Compose file.

@aaronlehmann
Copy link
Collaborator Author

Opened moby/moby#30725 for the Docker parts.

@aluzzardi
Copy link
Member

--placement-prefs spread=engine.labels.dc,spread=engine.labels.row,spread=engine.labels.rack

Regarding the flags:

  1. I wouldn't comma separate them. Instead, call it placement-pref (or whatever - just no s) and have the user repeat the flag for every preference. This is consistent with --secret, -e, etc etc.

  2. I would either use type=spread,on=engine.labels.dc for consistency with our "new" flags, or perhaps something like --placement spread:engine.labels.dc --placement spread:engine.labels.rack,weight=15

  3. add vs remove debate: There is also the possibility of having neither of those and re-using the same flag that we have for create (in which case, if passed, it will wipe whatever is there)

Regarding the syntax, I think it would be useful to sketch out what it would look like in compose and work our way down to the flags. @dnephin could you help out on this? what's the most compose-y way of representing those placement preferences?

@aluzzardi
Copy link
Member

@aaronlehmann is it fair to say that our default (and always implicitly appended) placement preference is spread over node.id? Do you think there would be room in the future to override that somehow (e.g. pack on node.id)?

@aluzzardi
Copy link
Member

Design, Protos & Code LGTM

We need to work out the flags though.

@aaronlehmann Since the flags are swarmctl only at this point, do you want to merge this one in and open a follow up with just flags (once we figure out what those should be, with the help of the compose sketch and @dnephin)?

@aaronlehmann
Copy link
Collaborator Author

is it fair to say that our default (and always implicitly appended) placement preference is spread over node.id? Do you think there would be room in the future to override that somehow (e.g. pack on node.id)?

Yes, and I don't see why it couldn't be overridden in the future. If we add support for "pack", then packing on node.id should do exactly what you're describing.

Since the flags are swarmctl only at this point, do you want to merge this one in and open a follow up with just flags (once we figure out what those should be, with the help of the compose sketch and @dnephin)?

Hm, not sure where swarmctl comes in. I haven't implemented the flags for swarmctl, but I did implement them in moby/moby#30725 today based on the thread with @dnephin.

But yes, I think we should go ahead and merge this, since the only area that still needs work is the flags, and this PR doesn't include flags.

@dnephin
Copy link
Member

dnephin commented Feb 4, 2017

I think the Compose format would be something like this:

deploy:
  placement:
    preferences:
      - spread,engine.labels.dc
      - spread,engine.labels.rack

If at some point there are more types, and they take extra properties we could add an expanded form:

deploy:
  placement:
    preferences:
      - type: spread
        value: engine.labels.dc
        weight: 4

@aaronlehmann aaronlehmann merged commit fb654e9 into moby:master Feb 7, 2017
@aaronlehmann aaronlehmann deleted the topology branch February 7, 2017 20:51
@aluzzardi
Copy link
Member

spread,engine.labels.dc

I'm not a huge fan of the comma - it's not just any field, we're spreading based on the right hand side.

how about spread:engine.labels.dc or spread=engine.labels.dc?

Sorry for post-merge comment

@aaronlehmann
Copy link
Collaborator Author

Sorry for post-merge comment

No problem at all about continuing to comment on this. No syntax is implemented in this PR and it's just a related discussion that's going on to inform the Docker PR.

how about spread:engine.labels.dc or spread=engine.labels.dc?

Previously, @thaJeztah mentioned a preference for =, because several other options are structured as k/v pairs and use =.

I think = is good, and if we add options in the future it will look like spread=engine.labels.dc,weight=40, which looks right.

Any thoughts? I can update the Docker PR (moby/moby#30725) to use = instead of ,.

@aluzzardi
Copy link
Member

Mapping to @dnephin comment about short/long version, I was thinking something like:

k/v options:
type=spread,value=engine.labels.dc,weight=4

porcelain:
spread:engine.labels.dc.weight

but maybe that's overkill, and we should go with the simpler approach you suggested: spread=engine.labels.dc,weight=40

In which case, @dnephin compose snippets would look like:

deploy:
  placement:
    preferences:
      - spread: engine.labels.dc
      - spread: engine.labels.rack
deploy:
  placement:
    preferences:
      - spread: engine.labels.dc
        weight: 4

@denverdino
Copy link
Contributor

@aluzzardi the "placement" is not supported in the Compose template V3.2 and included in the 17.04.0-ce-rc2. Is there any plan for the support? Thanks a lot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants