Allow Kubelet to watch for changes to pods scheduled on it#846
Allow Kubelet to watch for changes to pods scheduled on it#846smarterclayton wants to merge 4 commits intokubernetes:masterfrom
Conversation
Yes, please. In another PR so it can go in quickly. This was on my list. I've actually implemented watch on pods for my scheduler change. Not ready to push yet, though. We don't need a separate API for this. Minions can just select on field CurrentState.Host=. Is there a reason we don't want minions to use etcd? |
|
@lavalamp isolating minion from etcd means that a compromised minion can't compromise the rest of the system. Even with read only access, being able to access container data on other hosts should be prohibited. Also, putting the API in front of etcd means that the underlying structure can be versioned independent of the kubelets and other backends can be used instead of etcd. |
|
@lavalamp your watch for pods, what schema are you depending on in etcd? Preserve existing one with /kubelet? One advantage of a separate resource is that it makes the access pattern explicit for restricting authz to the info about pods - using fields directly on pods is more complex because it requires interpreting the field selector at a higher level. I would prefer not to introduce a new resource unless it has substantial benefits though. Also, watches right now are terminating after 10s because of the writeTimeout on the default http server. I'll open a separate pull this afternoon for allowing watches to run longer (without having to allow all read/writes to continue). |
|
A few thoughts. First, I like the idea of making the master <-> slave connection be a versioned protocol instead of a blind export of data. This clarifies exactly what structures belong to the master vs kubelet (something we sorely need to do). But... This changes the scaling point of the cluster from etcd, which is designed to scale horizontally, to apiserver, which is not (yet). A principle of operation internally at Google is that nodes (almost) never reach out to other services. It's just too easy to crush an under-prepared service with accidentally synchronized traffic - the thundering herd effect is very real. I think this is one of those changes that might benefit from a short design doc - easier to discuss at that level than the code level. |
|
@thockin Happy to. I don't see the difference between etcd and the apiserver in practice here except one of them is an open security hole. Also, with the exception of operations apiserver should be horizontally scalable (or a subset of the resources therein), and I would expect us to correct that in the extreme near term. What form of design doc - just a pull of an md to docs that describes proper isolation and communication channel? |
|
Design doc could be an md file if you think it is worth commiting, or a On Mon, Aug 11, 2014 at 10:42 AM, Clayton Coleman notifications@github.com
|
|
As I understand it, the entire reason we're using etcd at all is because of scalable replication to nodes. I'm about 25% convinced by clayton's security argument that we should change that--I'm not convinced we need to stop trusting kubelet. If we do stop using etcd for this case, then we can probably find a much better database type info store, since we wouldn't be using etcd for its intended use case. Does etcd have the ability to set permissions on keys? Also, what Tim said: if we're going to use sharded and/or replicated apiservers to distribute data to kubelets, we need to actually do some math on how many nodes a single apiserver can support. Watch may not be feasible at scale, I don't know. |
|
So I'm surprised at what you guys are describing - that you think etcd is more scalable than a horizontally stateless apiserver. Etcd provides efficient watches - that alone probably justifies its use here with a pull / incremental model, and very few other data stores will offer anything close to that. However, treating any individual kubelet as trusted is extremely dangerous - a compromise of a single node (bad password security, someone ssh'd in accidentally and didn't log out) can then get directly to etcd and reduce the cluster to a heap of smoking ruin. I view the apiserver as providing authz and authn on top of caching, data format translation, and efficient watching semantics. The apiserver sitting in front of etcd actually reduces load on etcd (potentially by aggregating watches or caching watch results) and also allows behavior to be controlled. apiserver should be horizontal and stateless, and with efficient cache semantics and eventually consistent API patterns (which watch is a great example of) clients can get all of the value of etcd with none of the coupling. |
|
@smarterclayton Watch can be horizontally scalable in a system like etcd in theory. Operations like watch(index guarantees consistency) can be easily distributed by having proxies. Trading off latencies should be acceptable in this use case. |
|
@xiangli-cmu yeah in this case I view the apiserver as a semantically versioned etcd proxy - it's performing simple translation of a watch api call into an etcd get or watch primitive, but able to apply higher level reasoning about the semantics. The apiserver can introduce latency exactly like an etcd proxy - but the key difference is that the apiserver is the only server that knows the authz model of kubernetes - exposing the etcd store to the edge is not following the model of least exposure. |
|
@smarterclayton I guess I have been thinking of a kubelet security breach as root-equivalent anyway. Maybe we need to have an official position on this. It does seem prudent to give kubelets read only access to etcd if possible. |
|
It's root equivalent but if you have 500 hosts in a cluster it shouldn't escalate past root on a single box. Since there will be many more hosts than apiservers the odds of a whole system compromise increase as you add hosts, whereas I'd like it to remain constant (you can't stop a zero day kernel exploit, but you can defend against sloppy administration). Also, looking at this from a multi tenant perspective, anything in the container manifests that is secret to another tenant shouldn't be compromised by a tenant escaping their sandbox on a host unless they are collocated. So even read only access is too much since you'll see the environment variables those other hosts offer. |
|
To be clear - I am saying that etcd TODAY is design to scale horizontally We also, internally, have gotten much value from the etcd-equivalent being On Mon, Aug 11, 2014 at 12:33 PM, Clayton Coleman notifications@github.com
|
|
This thread is a good example of why a design doc laying out a position and the pros and cons is a good idea :) |
|
I buy that the kubelet interface to the data store should be:
I'm not sure that it adds more than a little bit of security.
|
|
And those are known risks, but there are whole classes of compromise unrelated to running pods where the homogenous pod environment doesn't place you at risk. Misconfiguration of host level settings, or difficult to reproduce exploits combining multiple systems, or plain old "forgot to lock the screensaver in the datacenter" are valid attack surfaces. I'd argue the strongest around API versioning and not letting people couple to the data store, and then the write access, and then info leakage. The same arguments apply to services and any other distributed "pull" client - after all, why is the replication controller more of a risk than the kubelets when it comes to access to core data? |
|
I forgot a non-root attack vector. Currently the only isolation between containers and etcd is the network configuration of the container network. It is trivially easy to misconfigure a host network such that containers could reach the etcd servers - in most production configurations allowing network access to a central datastore from the edge (even when protected) would be considered poor security. The only defense there is to configure a proxy with network isolation to etcd - it would be better for that proxy to only offer the minimal surface area, and guarding a structured apiserver proxy vs a simple http proxy against all possible data retrieval attacks is much easier. |
|
To @erictune point it would be great to have feedback [and code :)] from people on what sort of access control system would be useful for users of etcd. Relevant bugs: The plan here seems fine to me but clearly etcd should provide ways to protect the kubelet API server's data and it would be good to have the etcd mechanism mesh with k8s. |
|
@philips absolutely - two layers of defense is always better than one. |
|
Open question here is - should a subordinate parent-child relationship (minion pods) be modeled as?
I would prefer 1 or 2, because 3 increases client complexity. 2 is a bit awkward, but at least maps cleanly to a resource "pods on a single minion" and the resource for it could be different than 1 or 3 (so expose information specific to that minion). 2 is also cleaner for a 3rd party to implement (you don't have to know how to parse query parameters). 2 also is easier to secure (you can disallow access to /pods, but allow /minionPods directly). 3 reuses an existing resource structure. |
|
I prefer: which I think is the current syntax and will work as soon as I get my act together and send off a PR. |
pkg/kubelet/config/api.go
Outdated
There was a problem hiding this comment.
I guess this should say API server, not etcd?
|
Argh, sorry, I was so used to seeing this at the bottom of my queue-- I'll review today. |
eacb901 to
fb99422
Compare
|
Sorry, did not get to this today. Monday, or maybe over the weekend. |
|
No rush
|
fb99422 to
a12622a
Compare
Discusses the current security risks posed by the kubelet->etcd pattern and discusses some options. Triggered by kubernetes#846 and referenced in kubernetes#859
Will allow an API consumer to fetch a versioned list of bound pods.
|
Would really appreciate some review from the various folks, unless we want to totally nuke this and wait for Eric's wider reaching change. Rebasing tends to be a bit hairy. |
Disables etcd configuration loading if the --use_etcd_for_pods parameter is false. Depends on the Watch implementation of BoundPods and adds Client methods for that. Depends on Watch returning the full state when resourceVersion is empty.
server.go is currently source agnostic, so a method should be exposed on the Kubelet that makes it simpler for server.go to find the proper pod. Also, correct an error in stats retrieval and better indicate that the code is broken.
a12622a to
c9adb3e
Compare
|
I do apologize for sitting on this for so long :/ I think, since @erictune is back and is promising to work on making kubelet read pods directly, that it makes sense to just wait for that change to happen--it doesn't seem great to add a REST object that we plan to get rid of in a week or two. However I would like it if "Watching on invalid label/field selectors should error" would show up in its own PR, that seems useful generally. |
|
Extracted other fixes that aren't BoundPods specific in #3270, will leave this here until Eric's changes land and then close it. |
|
@erictune's change has been merged. |
|
for posterity, the last one was: #3376 |
Fix cgroup name parsing logic in ps output for centos6.
Discusses the current security risks posed by the kubelet->etcd pattern and discusses some options. Triggered by kubernetes#846 and referenced in kubernetes#859
Update node-heartbeat KEP
…y-go UPSTREAM: <drop>: bump(apiserver-library-go)
This replaces Kubelet config loading a watch call on
/watch/pods. It depends on Watch to inform of new changes.This is one step towards allowing the kubelet to be isolated from direct access to etcd which provides better security, and allows us to version the API and schema from the apiserver using schema conversions. In the future we can add more restrictive minion authorization (minion X is scoped to retrieve only information it needs).
Implements part of #860