Skip to content

Third Party Resources support#2090

Merged
aaronlehmann merged 2 commits intomoby:masterfrom
RenaudWasTaken:tpr
May 26, 2017
Merged

Third Party Resources support#2090
aaronlehmann merged 2 commits intomoby:masterfrom
RenaudWasTaken:tpr

Conversation

@RenaudWasTaken
Copy link
Copy Markdown
Contributor

Hi !

Following the different issues that were opened and closed as well as the different
pull request which suffered the same fate, I've implemented a way to support Third Party Resources in swarmkit.

I've described my implementation choices in the following design doc: Third Party Resources.

I've also listed the different issues and PRs that were opened and closed on this subject:

I believe the main users involved were:

This proposal would solve problems for different resources such as:

  • GPUs
  • FPGAs
  • MICs (Many-Integrated Core, such as Xeon Phi)
  • More here
  • ...

Would it be possible for you to review this proposal and express your concerns / insights on wether this would fit in swarmkit and if so if my code is the best place ?

The naming conventions are also something that would be up for discussion as I don't have a very strong opinion on the current one.

Thanks !

@aaronlehmann
Copy link
Copy Markdown
Collaborator

The protobuf issues may be related to your Go version. CI is running Go 1.7.5, and with a different version of Go the descriptors compress differently. We'll be upgrading to 1.8.x soon, when Docker does.

Thanks for opening this PR. It looks like a really useful feature.

@RenaudWasTaken
Copy link
Copy Markdown
Contributor Author

RenaudWasTaken commented Apr 4, 2017

@aaronlehmann I've been wrecking my brains on that for the last hour :)
Thanks!

api/types.proto Outdated
uint64 val = 1;
}

message ThirdPartyResource {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a feeling the name ThirdPartyResource may raise some concerns. We recently added a Resource type to our data store as a way to store generic objects inside the store (as opposed to swarm-native types like Node, Task, etc). ThirdPartyResource might be similar enough to this to be confusing.

Maybe GenericNodeResource? Not sure if anyone else has suggestions...

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discussed naming with some of the other maintainers, and we liked either of the names AbstractResource or GenericResource.

We already use the term "generic" in some places, for example GenericRuntimeSpec. I'm not entirely sure whether this is an argument for or against using Generic here. GenericRuntimeSpec is a message that contains an Any message, so what's "generic" about it is that the user can embed their own data. The resource concept is a bit different.

Do either of those names sound good to you? Do you have a preference?

Copy link
Copy Markdown
Contributor Author

@RenaudWasTaken RenaudWasTaken Apr 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll go with AbstractResource, In the end it's mostly preference based :)
On second thought, generic is one less character i'll go with that :D

flags.String("memory-limit", "", "memory limit (e.g. 512m)")
flags.String("cpu-reservation", "", "number of CPU cores reserved (e.g. 0.5)")
flags.String("cpu-limit", "", "CPU cores limit (e.g. 0.5)")
flags.String(ThirdPartyResourceFlag, "", "user defined resources request (e.g. gpu=3;fpga=1)")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that this flag has slightly different semantics than the one under the same name which defines a node's avaliable resources. Would it make sense to differentiate the flag names?

Copy link
Copy Markdown
Contributor Author

@RenaudWasTaken RenaudWasTaken Apr 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would make a lot of sense, thanks for pointing that out ! I was annoyed by the fact that there was a variable where all the others definitions used strings :)

Though we would have to agree on the name before changing that :)

  • --generic-node-resource
  • --generic-service-resource
    could be the way to go if we follow the GenericeResource naming convention.

Do you have other naming suggestions :) ?

@@ -0,0 +1,154 @@
# Third Party Resources
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the design doc!

* Integer TPR
* Set TPR

Integier resources are for use cases where only an unsigned is needed to account
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: integer


var availableMemoryBytes, availableNanoCPUs int64
available := &api.Resources{}
available.ThirdParty = make(map[string]*api.ThirdPartyResource)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving this to an else clause would avoid the map allocation if we are making a copy of the node's resources anyway.

}

// HasEnoughThirdPartyResources returns true if node can satisfy task TPR's request
func HasEnoughThirdPartyResources(nodeRes *api.ThirdPartyResource, taskRes *api.ThirdPartyResource) bool {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest moving this and ClaimThirdPartyResource to an api/thirdpartyresource package (or whatever package name is appropriate given the final naming decisions). Then we can avoid constraintenforcer importing the scheduler code.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would I move the proto definitions in there ?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just the utility functions. The proto definitions are fine in api.


// HasEnoughThirdPartyResources returns true if node can satisfy task TPR's request
func HasEnoughThirdPartyResources(nodeRes *api.ThirdPartyResource, taskRes *api.ThirdPartyResource) bool {
t := taskRes.Resource.(*api.ThirdPartyResource_Integer).Integer.Val
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should use the t, ok := ... form to confirm that the resource reservation actually uses the integer type. While that's the way this is intended to be used, these protobuf messages are created by users, so there they can deviate from correct usage sometimes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the filter case, if there is an error (function check in the PR) should there be some kind of error handling / returning when calling this function ?
If so, how do I send this up the call stack ? It doesn't seem right to change the filter interface :)

Copy link
Copy Markdown
Collaborator

@aaronlehmann aaronlehmann Apr 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not too worried about bubbling the error up. If the resource reservation is not valid here, we can just not honor it. The goal of the check is avoiding a panic. The actual validation that will produce user-visible errors can go in controlapi's service validation functions.

}

func reclaimTPR(nodeRes *api.ThirdPartyResource, taskRes *api.ThirdPartyResource, taken *api.Set) {
t := taskRes.Resource.(*api.ThirdPartyResource_Integer).Integer.Val
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the t, ok := ... form as a sanity check.

// ClaimThirdPartyResource assign TPRs to the task by taking them from the node's TPR
// and storing them in the taken store
func ClaimThirdPartyResource(nodeRes *api.ThirdPartyResource, taskRes *api.ThirdPartyResource, taken *api.Set) {
t := taskRes.Resource.(*api.ThirdPartyResource_Integer).Integer.Val
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the t, ok := ... form as a sanity check.

@aaronlehmann
Copy link
Copy Markdown
Collaborator

It might be good to add something to the validation in manager/controlapi/service.go which checks that resource reservations are expressed as integers, not as sets.

"golang.org/x/net/context"
)

type tprTaskStore map[string]map[string]*api.Set
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be map[string]map[string]map[string]struct{}? That would avoid some of the awkwardness with empty &api.Set_Void{} values.

func setEnv(t *api.Task, env string) {
switch r := t.Spec.GetRuntime().(type) {
case *api.TaskSpec_Container:
r.Container.Env = append(r.Container.Env, env)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is a bit problematic. t.Spec is a template for the task which gets copied directly from the service definition. It shouldn't be changed in the scheduler. Doing so would break some things like rolling updates (t.Spec gets compared to the task spec in the service to check if the task is up to date).

If it's important to expose this information through environment variables, we'll have to introduce a mechanism for passing environment variables outside the Spec. Probably an ExtraEnv field at the top level of Task would work. The executors would have to be updated to honor this.

On second thought, it might be a better abstraction to copy the ThirdPartyResource allocations to the Task instead, and let the executor decide how to expose it. I think this would actually be much better because it means the scheduler doesn't have to know about specific runtimes.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I just realized that if there's a leader switchover, the new leader will need to know which specific resource from a set was allocated to a task, so it doesn't assign that resource to a second task. Putting the allocations on the Task object would make this possible, but if the item from the set it only put into an environment variable, the information is essentially lost.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, it might be a better abstraction to copy the ThirdPartyResource allocations to the Task instead, and let the executor decide how to expose it. I think this would actually be much better because it means the scheduler doesn't have to know about specific runtimes.

I understand the reasoning, however, if we want to pursue more advanced scheduling decisions based on resources down the line, it might be worth keeping the code in the scheduler.
Though I don't believe that question should be solved in this PR :)

I'm looking at this executor thing :)

@RenaudWasTaken
Copy link
Copy Markdown
Contributor Author

RenaudWasTaken commented Apr 7, 2017

So this update is more of a draft :) I've fixed most of the comments and tried to implement the executor.

I did not address naming as I'd like to have more feedback on this so that I don't have to change the name 3-5 times :)

@aaronlehmann I've implemented the executor but I'm a bit lost.
The way I did it works when the task is scheduled on remote nodes but if the task is scheduled on master that doesn't work.

From what I understand the solution would be to change the env in the update part but I'm confused by when the update is called and what that would mean (should I do an equivalent of a docker exec to set the env?
Do you have any insights on this ?

Side Note I switched to go 1.7.5 (and regenerated protobuff) and that doesn't seem to solve the cricleci...

$ go version
go version go1.7.5 linux/amd64

Thanks !

@aaronlehmann
Copy link
Copy Markdown
Collaborator

I did not address naming as I'd like to have more feedback on this so that I don't have to change the name 3-5 times :)

Smart.

@aaronlehmann I've implemented the executor but I'm a bit lost.
The way I did it works when the task is scheduled on remote nodes but if the task is scheduled on master that doesn't work.

From what I understand the solution would be to change the env in the update part but I'm confused by when the update is called and what that would mean (should I do an equivalent of a docker exec to set the env?
Do you have any insights on this ?

Sorry, I think I didn't do a very good job explaining this.

Basically, if I understand correctly, you want to communicate to the containers which resource was allocated for that container. For example, if there are multiple GPUs, and "GPU1" gets allocated for a task, there should be a way for the software running in that container to know which GPU it should use.

Right now, this PR sets an environment variable from the scheduler, but this doesn't really work. The main problem is that the task's Spec should come directly from the service, and if it's changed by the scheduler, it would look like this task is out-of-date because it doesn't match the current version of the service (among other issues). We'd also like to keep the scheduler as agnostic to the low-level details of the workload it's scheduling as possible (though we admittedly currently violate this a bit in some of the scheduling filters). For example, if we extended swarmkit to be able to schedule VMs instead of containers, it should ideally not take any significant changes to the scheduler to support that.

So what I proposed instead was adding a ThirdPartyResource field on the Task object, with the resources the scheduler has allocated to this task (I see you've already done that; excellent!).

Then when the node receives a task, it will be able to act on this information and set the appropriate environment variables inside the container. The interface between swarm and the containers is called the executor. We actually have two executor implementations. One, inside the swarmkit repo, is used mainly for testing, and uses the Docker REST API to create and monitor containers. The other one is inside the docker/docker repository and is the one that actually gets used if you use swarm mode in Docker.

In swarmkit's executor, here is where the environment variables for the container are set: https://github.com/docker/swarmkit/blob/59d7fb2cae146ca8a911f08d60773f3e14791671/agent/exec/dockerapi/container.go#L146

It should be pretty simple to change that so that it generates whatever environment variables are implied by the resource assignments (as the scheduler does now), and adds them to that list. The task is accessible as c.task in that method.

In the Docker executor, there's an identical config method that would need the same modifications once support for this added to Docker.

Does this help at all?

Side Note I switched to go 1.7.5 (and regenerated protobuff) and that doesn't seem to solve the cricleci...

Odd. Which version of protoc do you have installed?

@RenaudWasTaken
Copy link
Copy Markdown
Contributor Author

RenaudWasTaken commented Apr 8, 2017

Thanks for the insights !

I've updated the PR and addressed the following:

  • In NodeInfo moved to a map of ThirdPartyResources for consistency and easy copy
  • Implements the merging of Env variables and TPR variables in the executor
  • Refactors my crazy TPR reconciliation code
  • Another try at circle CI

We still need to address naming (and function naming) and the leader switch.
I'm trying to find this part of the code :)

Next step is also testing

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 8, 2017

Codecov Report

Merging #2090 into master will increase coverage by 0.05%.
The diff coverage is 65.98%.

@@            Coverage Diff             @@
##           master    #2090      +/-   ##
==========================================
+ Coverage    60.1%   60.16%   +0.05%     
==========================================
  Files         119      124       +5     
  Lines       19847    20122     +275     
==========================================
+ Hits        11930    12106     +176     
- Misses       6575     6652      +77     
- Partials     1342     1364      +22

@aaronlehmann
Copy link
Copy Markdown
Collaborator

This is shaping up nicely.

We still need to address naming (and function naming) and the leader switch.

I'll try to get back to you soon with a more definite recommendation on naming.

I'm not sure you need anything special to handle leader switches. When the scheduler starts up on a new leader, it calls addTask for every task that's already assigned to a node (see newNodeInfo). Your TPR reconciliation should take care of restoring the scheduler's local state, right?

var resources api.Resources
if n.Description != nil && n.Description.Resources != nil {
fmt.Println("reconcile")
resources = *n.Description.Resources
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be changed to resources = *n.Description.Resources.Copy(), now that the Resources structure contains pointers (changes to the local copy of resources shouldn't have side effects on the actual Node object).


i++

res.Set.Val[k] = voidVal
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about just using nil instead of voidVal?

import (
"fmt"
"github.com/docker/swarmkit/api"
"strings"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We follow the convention of putting standard library imports first, then a blank line, then other imports.

"fmt"
"strings"

"github.com/docker/swarmkit/api""

case *api.ThirdPartyResource_Integer:
nr.Integer.Val -= t.Integer.Val
taken.Resource = &api.ThirdPartyResource_Integer{
Integer: &api.Integer{Val: t.Integer.Val}}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cosmetic nit: I usually see this written as

		taken.Resource = &api.ThirdPartyResource_Integer{
			Integer: &api.Integer{Val: t.Integer.Val},
		}

It just seems a little more "balanced" that way.

i := uint64(0)
voidVal := &api.Set_Void{}
res := &api.ThirdPartyResource_Set{
Set: &api.Set{Val: make(map[string]*api.Set_Void)}}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about moving the last } onto its own line.

// Maps tasks to resource taken usage: `map[taskID][resourceName]`
// This is needed to keep track of the resources where the user specified
// an int for a node which advertises a set for the specified resource
ThirdPartyResourcesTaken tprTaskStore
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we still need this now that we have AssignedThirdPartyResources on Task. ThirdPartyResourcesTaken[taskID] seems like it could be equivalent to Tasks[taskID].AssignedThirdPartyResources.

Copy link
Copy Markdown
Contributor Author

@RenaudWasTaken RenaudWasTaken Apr 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sens :) Done

@RenaudWasTaken RenaudWasTaken force-pushed the tpr branch 2 times, most recently from fdc0d50 to c705424 Compare April 10, 2017 08:36
Copy link
Copy Markdown
Contributor

@dongluochen dongluochen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RenaudWasTaken Thanks a lot for your contribution!

@RenaudWasTaken
Copy link
Copy Markdown
Contributor Author

Thanks ! When can we expect it to be merged so that I know when to open a pull request on docker :D ?

@dongluochen
Copy link
Copy Markdown
Contributor

Ping @aluzzardi

@RenaudWasTaken
Copy link
Copy Markdown
Contributor Author

Hi !

@aaronlehmann, @dongluochen I'm currently playing with the docker integration (almost done) and I noticed two small things that might need some changes here.

The first one is about the node command line arguments, I believe it makes more sense to have them in docker node update NODEID --generic-resources-add ... instead of docker swarm [init|join] --generic-resources ... do you agree ? Should I update the PR to match this ?

The second one is a small change in the Parse function where it would probably make more sense to have the following prototype: Parse(cmd string) ([]*api.GenericResource, error) at the docker level.

Do you want me to update the PR (I'll have to rebase it anyway) ?

@aaronlehmann
Copy link
Copy Markdown
Collaborator

The first one is about the node command line arguments, I believe it makes more sense to have them in docker node update NODEID --generic-resources-add ... instead of docker swarm [init|join] --generic-resources ... do you agree ? Should I update the PR to match this ?

I think the right model is for nodes to self-report their generic resources. This is how we know how much memory each node has, for example - the nodes send it up in NodeDescription instead of needing an admin to docker node update each node to tell the manager how much memory it has.

Is it difficult to automatically detect GPUs?

The second one is a small change in the Parse function where it would probably make more sense to have the following prototype: Parse(cmd string) ([]*api.GenericResource, error) at the docker level.

I agree with this change.

@RenaudWasTaken
Copy link
Copy Markdown
Contributor Author

@aaronlehmann it's not difficult to detect GPUs if you have nvml + the Nvidia driver but you don't want CUDA as a dependency for docker :)

You have two ways to go for this:

  • Try using an unreliable way to detect GPUs (list /dev/nvidia*)
    • We really don't recommend that (kubernetes does that and has race conditions because of that).
    • Additionally you have no guarantee that our driver will continue advertising GPUs this way
    • It doesn't really solve the problem for other vendors
  • Have some kind of plugin mechanism which would be called by docker or a binary called as a hook when joining the cluster ?
    • That's another discussion :)
    • But could be a good idea

instead of needing an admin to docker node update each node to tell the manager how much memory it has.

Well you could have an explorer running as a daemon, it doesn't have to be your cluster admin

@aaronlehmann aaronlehmann changed the title [PROPOSAL] Third Party Resources support Third Party Resources support May 12, 2017
@aaronlehmann
Copy link
Copy Markdown
Collaborator

Okay. So we'll need a way to tell the local Docker engine what resources you have. But I think the resources should still be sent up in NodeDescription, instead of through docker node update. This is a better fit for the self-reporting model, and allows individual nodes to publish their information (docker node update can only be run on a manager).

As for how to tell the engine what resources it should report in NodeDescription, I'm not sure offhand what's best. A simple idea that comes to mind is providing flags for dockerd (sort of like what you added to swarmd), and corresponding entries in daemon.json. I'll preemptively ping @thaJeztah to see if he has any ideas.

@RenaudWasTaken
Copy link
Copy Markdown
Contributor Author

A simple idea that comes to mind is providing flags for dockerd (sort of like what you added to swarmd), and corresponding entries in daemon.json

That seems like a better idea than docker swarm init --generic-resources ...

Btw, when can we expect this PR to be merged :) ?

Thanks!

@aaronlehmann
Copy link
Copy Markdown
Collaborator

@RenaudWasTaken: It needs a rebase. Note that we updated to Go 1.8.1 in CI, so you'll want to generate the protobuf files with that version to make sure it matches in CI.

@RenaudWasTaken
Copy link
Copy Markdown
Contributor Author

@aaronlehmann done

@aaronlehmann
Copy link
Copy Markdown
Collaborator

Sorry, needs another rebase already!

@aluzzardi: Can you please review this when you get a chance?

@aluzzardi
Copy link
Copy Markdown
Member

@RenaudWasTaken @aaronlehmann Sorry for the slow response time!

I'll give this a review by EOW

Renaud Gaubert added 2 commits May 22, 2017 12:21
Signed-off-by: Renaud Gaubert <rgaubert@nvidia.com>
Signed-off-by: Renaud Gaubert <rgaubert@nvidia.com>
@RenaudWasTaken
Copy link
Copy Markdown
Contributor Author

Rebased !

@dongluochen
Copy link
Copy Markdown
Contributor

dongluochen commented May 22, 2017

I think docker node update to update generic resources is acceptable, or preferred in the case of hot-swappable resources. We don't need to remove/rejoin the node.

Taking that back. If the daemon can detect resource change by driver, or specification from file, and report the change to manager thru dispatcher, it'd work well with our current model.

@RenaudWasTaken
Copy link
Copy Markdown
Contributor Author

@aluzzardi any updates :) ?

@aaronlehmann aaronlehmann merged commit d4a01cc into moby:master May 26, 2017
@dongluochen
Copy link
Copy Markdown
Contributor

@RenaudWasTaken Thanks a lot for your contribution!

@chenglin-li
Copy link
Copy Markdown

hello
there are lots of discussion here.
have you commit or finish the supporting for Third Party/Generic Resource supporting?

  1. if yes, could you update which version support?
  2. do you have some user overall guide on how to supporting Third Party resource in specified case?

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants