Third Party Resources support#2090
Conversation
|
The protobuf issues may be related to your Go version. CI is running Go 1.7.5, and with a different version of Go the descriptors compress differently. We'll be upgrading to 1.8.x soon, when Docker does. Thanks for opening this PR. It looks like a really useful feature. |
|
@aaronlehmann I've been wrecking my brains on that for the last hour :) |
api/types.proto
Outdated
| uint64 val = 1; | ||
| } | ||
|
|
||
| message ThirdPartyResource { |
There was a problem hiding this comment.
I have a feeling the name ThirdPartyResource may raise some concerns. We recently added a Resource type to our data store as a way to store generic objects inside the store (as opposed to swarm-native types like Node, Task, etc). ThirdPartyResource might be similar enough to this to be confusing.
Maybe GenericNodeResource? Not sure if anyone else has suggestions...
There was a problem hiding this comment.
I discussed naming with some of the other maintainers, and we liked either of the names AbstractResource or GenericResource.
We already use the term "generic" in some places, for example GenericRuntimeSpec. I'm not entirely sure whether this is an argument for or against using Generic here. GenericRuntimeSpec is a message that contains an Any message, so what's "generic" about it is that the user can embed their own data. The resource concept is a bit different.
Do either of those names sound good to you? Do you have a preference?
There was a problem hiding this comment.
I'll go with AbstractResource, In the end it's mostly preference based :)
On second thought, generic is one less character i'll go with that :D
| flags.String("memory-limit", "", "memory limit (e.g. 512m)") | ||
| flags.String("cpu-reservation", "", "number of CPU cores reserved (e.g. 0.5)") | ||
| flags.String("cpu-limit", "", "CPU cores limit (e.g. 0.5)") | ||
| flags.String(ThirdPartyResourceFlag, "", "user defined resources request (e.g. gpu=3;fpga=1)") |
There was a problem hiding this comment.
I noticed that this flag has slightly different semantics than the one under the same name which defines a node's avaliable resources. Would it make sense to differentiate the flag names?
There was a problem hiding this comment.
It would make a lot of sense, thanks for pointing that out ! I was annoyed by the fact that there was a variable where all the others definitions used strings :)
Though we would have to agree on the name before changing that :)
- --generic-node-resource
- --generic-service-resource
could be the way to go if we follow theGenericeResourcenaming convention.
Do you have other naming suggestions :) ?
design/third_party_resources.md
Outdated
| @@ -0,0 +1,154 @@ | |||
| # Third Party Resources | |||
There was a problem hiding this comment.
Thanks for the design doc!
design/third_party_resources.md
Outdated
| * Integer TPR | ||
| * Set TPR | ||
|
|
||
| Integier resources are for use cases where only an unsigned is needed to account |
|
|
||
| var availableMemoryBytes, availableNanoCPUs int64 | ||
| available := &api.Resources{} | ||
| available.ThirdParty = make(map[string]*api.ThirdPartyResource) |
There was a problem hiding this comment.
Moving this to an else clause would avoid the map allocation if we are making a copy of the node's resources anyway.
manager/scheduler/filter.go
Outdated
| } | ||
|
|
||
| // HasEnoughThirdPartyResources returns true if node can satisfy task TPR's request | ||
| func HasEnoughThirdPartyResources(nodeRes *api.ThirdPartyResource, taskRes *api.ThirdPartyResource) bool { |
There was a problem hiding this comment.
I'd suggest moving this and ClaimThirdPartyResource to an api/thirdpartyresource package (or whatever package name is appropriate given the final naming decisions). Then we can avoid constraintenforcer importing the scheduler code.
There was a problem hiding this comment.
Would I move the proto definitions in there ?
There was a problem hiding this comment.
Just the utility functions. The proto definitions are fine in api.
manager/scheduler/filter.go
Outdated
|
|
||
| // HasEnoughThirdPartyResources returns true if node can satisfy task TPR's request | ||
| func HasEnoughThirdPartyResources(nodeRes *api.ThirdPartyResource, taskRes *api.ThirdPartyResource) bool { | ||
| t := taskRes.Resource.(*api.ThirdPartyResource_Integer).Integer.Val |
There was a problem hiding this comment.
This should use the t, ok := ... form to confirm that the resource reservation actually uses the integer type. While that's the way this is intended to be used, these protobuf messages are created by users, so there they can deviate from correct usage sometimes.
There was a problem hiding this comment.
In the filter case, if there is an error (function check in the PR) should there be some kind of error handling / returning when calling this function ?
If so, how do I send this up the call stack ? It doesn't seem right to change the filter interface :)
There was a problem hiding this comment.
I'm not too worried about bubbling the error up. If the resource reservation is not valid here, we can just not honor it. The goal of the check is avoiding a panic. The actual validation that will produce user-visible errors can go in controlapi's service validation functions.
manager/scheduler/nodeinfo.go
Outdated
| } | ||
|
|
||
| func reclaimTPR(nodeRes *api.ThirdPartyResource, taskRes *api.ThirdPartyResource, taken *api.Set) { | ||
| t := taskRes.Resource.(*api.ThirdPartyResource_Integer).Integer.Val |
There was a problem hiding this comment.
Use the t, ok := ... form as a sanity check.
manager/scheduler/nodeinfo.go
Outdated
| // ClaimThirdPartyResource assign TPRs to the task by taking them from the node's TPR | ||
| // and storing them in the taken store | ||
| func ClaimThirdPartyResource(nodeRes *api.ThirdPartyResource, taskRes *api.ThirdPartyResource, taken *api.Set) { | ||
| t := taskRes.Resource.(*api.ThirdPartyResource_Integer).Integer.Val |
There was a problem hiding this comment.
Use the t, ok := ... form as a sanity check.
|
It might be good to add something to the validation in |
manager/scheduler/nodeinfo.go
Outdated
| "golang.org/x/net/context" | ||
| ) | ||
|
|
||
| type tprTaskStore map[string]map[string]*api.Set |
There was a problem hiding this comment.
Could this be map[string]map[string]map[string]struct{}? That would avoid some of the awkwardness with empty &api.Set_Void{} values.
manager/scheduler/nodeinfo.go
Outdated
| func setEnv(t *api.Task, env string) { | ||
| switch r := t.Spec.GetRuntime().(type) { | ||
| case *api.TaskSpec_Container: | ||
| r.Container.Env = append(r.Container.Env, env) |
There was a problem hiding this comment.
This part is a bit problematic. t.Spec is a template for the task which gets copied directly from the service definition. It shouldn't be changed in the scheduler. Doing so would break some things like rolling updates (t.Spec gets compared to the task spec in the service to check if the task is up to date).
If it's important to expose this information through environment variables, we'll have to introduce a mechanism for passing environment variables outside the Spec. Probably an ExtraEnv field at the top level of Task would work. The executors would have to be updated to honor this.
On second thought, it might be a better abstraction to copy the ThirdPartyResource allocations to the Task instead, and let the executor decide how to expose it. I think this would actually be much better because it means the scheduler doesn't have to know about specific runtimes.
There was a problem hiding this comment.
Also, I just realized that if there's a leader switchover, the new leader will need to know which specific resource from a set was allocated to a task, so it doesn't assign that resource to a second task. Putting the allocations on the Task object would make this possible, but if the item from the set it only put into an environment variable, the information is essentially lost.
There was a problem hiding this comment.
On second thought, it might be a better abstraction to copy the ThirdPartyResource allocations to the Task instead, and let the executor decide how to expose it. I think this would actually be much better because it means the scheduler doesn't have to know about specific runtimes.
I understand the reasoning, however, if we want to pursue more advanced scheduling decisions based on resources down the line, it might be worth keeping the code in the scheduler.
Though I don't believe that question should be solved in this PR :)
I'm looking at this executor thing :)
|
So this update is more of a draft :) I've fixed most of the comments and tried to implement the executor. I did not address naming as I'd like to have more feedback on this so that I don't have to change the name 3-5 times :) @aaronlehmann I've implemented the executor but I'm a bit lost. From what I understand the solution would be to change the env in the update part but I'm confused by when the update is called and what that would mean (should I do an equivalent of a Side Note I switched to go 1.7.5 (and regenerated protobuff) and that doesn't seem to solve the cricleci... $ go version
go version go1.7.5 linux/amd64Thanks ! |
Smart.
Sorry, I think I didn't do a very good job explaining this. Basically, if I understand correctly, you want to communicate to the containers which resource was allocated for that container. For example, if there are multiple GPUs, and "GPU1" gets allocated for a task, there should be a way for the software running in that container to know which GPU it should use. Right now, this PR sets an environment variable from the scheduler, but this doesn't really work. The main problem is that the task's So what I proposed instead was adding a Then when the node receives a task, it will be able to act on this information and set the appropriate environment variables inside the container. The interface between swarm and the containers is called the executor. We actually have two executor implementations. One, inside the swarmkit repo, is used mainly for testing, and uses the Docker REST API to create and monitor containers. The other one is inside the In swarmkit's executor, here is where the environment variables for the container are set: https://github.com/docker/swarmkit/blob/59d7fb2cae146ca8a911f08d60773f3e14791671/agent/exec/dockerapi/container.go#L146 It should be pretty simple to change that so that it generates whatever environment variables are implied by the resource assignments (as the scheduler does now), and adds them to that list. The task is accessible as In the Docker executor, there's an identical Does this help at all?
Odd. Which version of |
|
Thanks for the insights ! I've updated the PR and addressed the following:
We still need to address naming (and function naming) and the leader switch. Next step is also testing |
Codecov Report
@@ Coverage Diff @@
## master #2090 +/- ##
==========================================
+ Coverage 60.1% 60.16% +0.05%
==========================================
Files 119 124 +5
Lines 19847 20122 +275
==========================================
+ Hits 11930 12106 +176
- Misses 6575 6652 +77
- Partials 1342 1364 +22 |
|
This is shaping up nicely.
I'll try to get back to you soon with a more definite recommendation on naming. I'm not sure you need anything special to handle leader switches. When the scheduler starts up on a new leader, it calls |
manager/scheduler/scheduler.go
Outdated
| var resources api.Resources | ||
| if n.Description != nil && n.Description.Resources != nil { | ||
| fmt.Println("reconcile") | ||
| resources = *n.Description.Resources |
There was a problem hiding this comment.
I think this should be changed to resources = *n.Description.Resources.Copy(), now that the Resources structure contains pointers (changes to the local copy of resources shouldn't have side effects on the actual Node object).
api/thirdpartyresource/helpers.go
Outdated
|
|
||
| i++ | ||
|
|
||
| res.Set.Val[k] = voidVal |
There was a problem hiding this comment.
How about just using nil instead of voidVal?
api/thirdpartyresource/helpers.go
Outdated
| import ( | ||
| "fmt" | ||
| "github.com/docker/swarmkit/api" | ||
| "strings" |
There was a problem hiding this comment.
We follow the convention of putting standard library imports first, then a blank line, then other imports.
"fmt"
"strings"
"github.com/docker/swarmkit/api""
api/thirdpartyresource/helpers.go
Outdated
| case *api.ThirdPartyResource_Integer: | ||
| nr.Integer.Val -= t.Integer.Val | ||
| taken.Resource = &api.ThirdPartyResource_Integer{ | ||
| Integer: &api.Integer{Val: t.Integer.Val}} |
There was a problem hiding this comment.
Cosmetic nit: I usually see this written as
taken.Resource = &api.ThirdPartyResource_Integer{
Integer: &api.Integer{Val: t.Integer.Val},
}It just seems a little more "balanced" that way.
api/thirdpartyresource/helpers.go
Outdated
| i := uint64(0) | ||
| voidVal := &api.Set_Void{} | ||
| res := &api.ThirdPartyResource_Set{ | ||
| Set: &api.Set{Val: make(map[string]*api.Set_Void)}} |
There was a problem hiding this comment.
Same comment about moving the last } onto its own line.
manager/scheduler/nodeinfo.go
Outdated
| // Maps tasks to resource taken usage: `map[taskID][resourceName]` | ||
| // This is needed to keep track of the resources where the user specified | ||
| // an int for a node which advertises a set for the specified resource | ||
| ThirdPartyResourcesTaken tprTaskStore |
There was a problem hiding this comment.
I'm wondering if we still need this now that we have AssignedThirdPartyResources on Task. ThirdPartyResourcesTaken[taskID] seems like it could be equivalent to Tasks[taskID].AssignedThirdPartyResources.
There was a problem hiding this comment.
makes sens :) Done
fdc0d50 to
c705424
Compare
dongluochen
left a comment
There was a problem hiding this comment.
@RenaudWasTaken Thanks a lot for your contribution!
|
Thanks ! When can we expect it to be merged so that I know when to open a pull request on docker :D ? |
|
Ping @aluzzardi |
|
Hi ! @aaronlehmann, @dongluochen I'm currently playing with the docker integration (almost done) and I noticed two small things that might need some changes here. The first one is about the node command line arguments, I believe it makes more sense to have them in The second one is a small change in the Do you want me to update the PR (I'll have to rebase it anyway) ? |
I think the right model is for nodes to self-report their generic resources. This is how we know how much memory each node has, for example - the nodes send it up in Is it difficult to automatically detect GPUs?
I agree with this change. |
|
@aaronlehmann it's not difficult to detect GPUs if you have nvml + the Nvidia driver but you don't want CUDA as a dependency for docker :) You have two ways to go for this:
Well you could have an explorer running as a daemon, it doesn't have to be your cluster admin |
|
Okay. So we'll need a way to tell the local Docker engine what resources you have. But I think the resources should still be sent up in As for how to tell the engine what resources it should report in |
That seems like a better idea than Btw, when can we expect this PR to be merged :) ? Thanks! |
|
@RenaudWasTaken: It needs a rebase. Note that we updated to Go 1.8.1 in CI, so you'll want to generate the protobuf files with that version to make sure it matches in CI. |
|
@aaronlehmann done |
|
Sorry, needs another rebase already! @aluzzardi: Can you please review this when you get a chance? |
|
@RenaudWasTaken @aaronlehmann Sorry for the slow response time! I'll give this a review by EOW |
Signed-off-by: Renaud Gaubert <rgaubert@nvidia.com>
Signed-off-by: Renaud Gaubert <rgaubert@nvidia.com>
|
Rebased ! |
|
Taking that back. If the daemon can detect resource change by driver, or specification from file, and report the change to manager thru dispatcher, it'd work well with our current model. |
|
@aluzzardi any updates :) ? |
|
@RenaudWasTaken Thanks a lot for your contribution! |
|
hello
Thanks. |
Hi !
Following the different issues that were opened and closed as well as the different
pull request which suffered the same fate, I've implemented a way to support Third Party Resources in swarmkit.
I've described my implementation choices in the following design doc: Third Party Resources.
I've also listed the different issues and PRs that were opened and closed on this subject:
I believe the main users involved were:
This proposal would solve problems for different resources such as:
Would it be possible for you to review this proposal and express your concerns / insights on wether this would fit in swarmkit and if so if my code is the best place ?
The naming conventions are also something that would be up for discussion as I don't have a very strong opinion on the current one.
Thanks !