Skip to content

Allow Kubelet to watch for changes to pods scheduled on it#846

Closed
smarterclayton wants to merge 4 commits intokubernetes:masterfrom
smarterclayton:use_api_for_minion_pods
Closed

Allow Kubelet to watch for changes to pods scheduled on it#846
smarterclayton wants to merge 4 commits intokubernetes:masterfrom
smarterclayton:use_api_for_minion_pods

Conversation

@smarterclayton
Copy link
Copy Markdown
Contributor

This replaces Kubelet config loading a watch call on /watch/pods. It depends on Watch to inform of new changes.

This is one step towards allowing the kubelet to be isolated from direct access to etcd which provides better security, and allows us to version the API and schema from the apiserver using schema conversions. In the future we can add more restrictive minion authorization (minion X is scoped to retrieve only information it needs).

Implements part of #860

@lavalamp
Copy link
Copy Markdown
Contributor

changes to watch for resourceVersion=0 case

Yes, please. In another PR so it can go in quickly. This was on my list.

I've actually implemented watch on pods for my scheduler change. Not ready to push yet, though.

We don't need a separate API for this. Minions can just select on field CurrentState.Host=.

Is there a reason we don't want minions to use etcd?

@smarterclayton
Copy link
Copy Markdown
Contributor Author

@lavalamp isolating minion from etcd means that a compromised minion can't compromise the rest of the system. Even with read only access, being able to access container data on other hosts should be prohibited. Also, putting the API in front of etcd means that the underlying structure can be versioned independent of the kubelets and other backends can be used instead of etcd.

@smarterclayton
Copy link
Copy Markdown
Contributor Author

@lavalamp your watch for pods, what schema are you depending on in etcd? Preserve existing one with /kubelet? One advantage of a separate resource is that it makes the access pattern explicit for restricting authz to the info about pods - using fields directly on pods is more complex because it requires interpreting the field selector at a higher level. I would prefer not to introduce a new resource unless it has substantial benefits though.

Also, watches right now are terminating after 10s because of the writeTimeout on the default http server. I'll open a separate pull this afternoon for allowing watches to run longer (without having to allow all read/writes to continue).

@thockin
Copy link
Copy Markdown
Member

thockin commented Aug 11, 2014

A few thoughts.

First, I like the idea of making the master <-> slave connection be a versioned protocol instead of a blind export of data. This clarifies exactly what structures belong to the master vs kubelet (something we sorely need to do).

But...

This changes the scaling point of the cluster from etcd, which is designed to scale horizontally, to apiserver, which is not (yet). A principle of operation internally at Google is that nodes (almost) never reach out to other services. It's just too easy to crush an under-prepared service with accidentally synchronized traffic - the thundering herd effect is very real.

I think this is one of those changes that might benefit from a short design doc - easier to discuss at that level than the code level.

@smarterclayton
Copy link
Copy Markdown
Contributor Author

@thockin Happy to. I don't see the difference between etcd and the apiserver in practice here except one of them is an open security hole. Also, with the exception of operations apiserver should be horizontally scalable (or a subset of the resources therein), and I would expect us to correct that in the extreme near term.

What form of design doc - just a pull of an md to docs that describes proper isolation and communication channel?

@thockin
Copy link
Copy Markdown
Member

thockin commented Aug 11, 2014

Design doc could be an md file if you think it is worth commiting, or a
google doc or other if you want to discuss the design but not commit the
doc.

On Mon, Aug 11, 2014 at 10:42 AM, Clayton Coleman notifications@github.com
wrote:

@thockin https://github.com/thockin Happy to. I don't see the
difference between etcd and the apiserver in practice here except one of
them is an open security hole. Also, with the exception of operations
apiserver should be horizontally scalable (or a subset of the resources
therein), and I would expect us to correct that in the extreme near term.

What form of design doc - just a pull of an md to docs that describes
proper isolation and communication channel?

Reply to this email directly or view it on GitHub
#846 (comment)
.

@lavalamp
Copy link
Copy Markdown
Contributor

As I understand it, the entire reason we're using etcd at all is because of scalable replication to nodes. I'm about 25% convinced by clayton's security argument that we should change that--I'm not convinced we need to stop trusting kubelet.

If we do stop using etcd for this case, then we can probably find a much better database type info store, since we wouldn't be using etcd for its intended use case.

Does etcd have the ability to set permissions on keys?

Also, what Tim said: if we're going to use sharded and/or replicated apiservers to distribute data to kubelets, we need to actually do some math on how many nodes a single apiserver can support. Watch may not be feasible at scale, I don't know.

@smarterclayton
Copy link
Copy Markdown
Contributor Author

So I'm surprised at what you guys are describing - that you think etcd is more scalable than a horizontally stateless apiserver. Etcd provides efficient watches - that alone probably justifies its use here with a pull / incremental model, and very few other data stores will offer anything close to that. However, treating any individual kubelet as trusted is extremely dangerous - a compromise of a single node (bad password security, someone ssh'd in accidentally and didn't log out) can then get directly to etcd and reduce the cluster to a heap of smoking ruin.

I view the apiserver as providing authz and authn on top of caching, data format translation, and efficient watching semantics. The apiserver sitting in front of etcd actually reduces load on etcd (potentially by aggregating watches or caching watch results) and also allows behavior to be controlled. apiserver should be horizontal and stateless, and with efficient cache semantics and eventually consistent API patterns (which watch is a great example of) clients can get all of the value of etcd with none of the coupling.

@xiang90
Copy link
Copy Markdown
Contributor

xiang90 commented Aug 11, 2014

@smarterclayton Watch can be horizontally scalable in a system like etcd in theory. Operations like watch(index guarantees consistency) can be easily distributed by having proxies. Trading off latencies should be acceptable in this use case.

@smarterclayton
Copy link
Copy Markdown
Contributor Author

@xiangli-cmu yeah in this case I view the apiserver as a semantically versioned etcd proxy - it's performing simple translation of a watch api call into an etcd get or watch primitive, but able to apply higher level reasoning about the semantics. The apiserver can introduce latency exactly like an etcd proxy - but the key difference is that the apiserver is the only server that knows the authz model of kubernetes - exposing the etcd store to the edge is not following the model of least exposure.

@lavalamp
Copy link
Copy Markdown
Contributor

@smarterclayton I guess I have been thinking of a kubelet security breach as root-equivalent anyway. Maybe we need to have an official position on this.

It does seem prudent to give kubelets read only access to etcd if possible.

@smarterclayton
Copy link
Copy Markdown
Contributor Author

It's root equivalent but if you have 500 hosts in a cluster it shouldn't escalate past root on a single box. Since there will be many more hosts than apiservers the odds of a whole system compromise increase as you add hosts, whereas I'd like it to remain constant (you can't stop a zero day kernel exploit, but you can defend against sloppy administration). Also, looking at this from a multi tenant perspective, anything in the container manifests that is secret to another tenant shouldn't be compromised by a tenant escaping their sandbox on a host unless they are collocated. So even read only access is too much since you'll see the environment variables those other hosts offer.

@thockin
Copy link
Copy Markdown
Member

thockin commented Aug 11, 2014

To be clear - I am saying that etcd TODAY is design to scale horizontally
and apiserver TODAY is not. I generally agree with the idea, overall.

We also, internally, have gotten much value from the etcd-equivalent being
decoupled from the apiserver-equivalent. We can hork the master, and
minions keep running. That said, I think we already have that property in
k8s, or we've at least discussed it.

On Mon, Aug 11, 2014 at 12:33 PM, Clayton Coleman notifications@github.com
wrote:

It's root equivalent but if you have 500 hosts in a cluster it shouldn't
escalate past root on a single box. Since there will be many more hosts
than apiservers the odds of a whole system compromise increase as you add
hosts, whereas I'd like it to remain constant (you can't stop a zero day
kernel exploit, but you can defend against sloppy administration). Also,
looking at this from a multi tenant perspective, anything in the container
manifests that is secret to another tenant shouldn't be compromised by a
tenant escaping their sandbox on a host unless they are collocated. So even
read only access is too much since you'll see the environment variables
those other hosts offer.

Reply to this email directly or view it on GitHub
#846 (comment)
.

@smarterclayton
Copy link
Copy Markdown
Contributor Author

This thread is a good example of why a design doc laying out a position and the pros and cons is a good idea :)

@erictune
Copy link
Copy Markdown
Contributor

I buy that the kubelet interface to the data store should be:

  • readonly where possible to prevent accidental writes due to bugs.
  • limited to viewing only needed data to make reasoning about the system easier.

I'm not sure that it adds more than a little bit of security.
If an attacker has root on the kubelet, I can imagine a number of ways to:

  • start new pods on the same/different machine belonging to other users, that can be similarly exploited, thus widening the range of user data the attacker has access to.
  • trick the apiserver into sending a variety of new pods to the machine, thus widening the range of user data the attacker has access to.

@smarterclayton
Copy link
Copy Markdown
Contributor Author

And those are known risks, but there are whole classes of compromise unrelated to running pods where the homogenous pod environment doesn't place you at risk. Misconfiguration of host level settings, or difficult to reproduce exploits combining multiple systems, or plain old "forgot to lock the screensaver in the datacenter" are valid attack surfaces.

I'd argue the strongest around API versioning and not letting people couple to the data store, and then the write access, and then info leakage. The same arguments apply to services and any other distributed "pull" client - after all, why is the replication controller more of a risk than the kubelets when it comes to access to core data?

@smarterclayton
Copy link
Copy Markdown
Contributor Author

I forgot a non-root attack vector. Currently the only isolation between containers and etcd is the network configuration of the container network. It is trivially easy to misconfigure a host network such that containers could reach the etcd servers - in most production configurations allowing network access to a central datastore from the edge (even when protected) would be considered poor security. The only defense there is to configure a proxy with network isolation to etcd - it would be better for that proxy to only offer the minimal surface area, and guarding a structured apiserver proxy vs a simple http proxy against all possible data retrieval attacks is much easier.

@philips
Copy link
Copy Markdown
Contributor

philips commented Aug 13, 2014

To @erictune point it would be great to have feedback [and code :)] from people on what sort of access control system would be useful for users of etcd. Relevant bugs:

The plan here seems fine to me but clearly etcd should provide ways to protect the kubelet API server's data and it would be good to have the etcd mechanism mesh with k8s.

@smarterclayton
Copy link
Copy Markdown
Contributor Author

@philips absolutely - two layers of defense is always better than one.

@smarterclayton
Copy link
Copy Markdown
Contributor Author

Open question here is - should a subordinate parent-child relationship (minion pods) be modeled as?

  1. GET /pods?field={minion=127.0.0.1} returns an api.PodList
  2. GET /minionPods/127.0.0.1 returns an api.PodList (but in the future could return something more specific)
  3. GET /minions/127.0.0.1/pods returns an api.PodList

I would prefer 1 or 2, because 3 increases client complexity. 2 is a bit awkward, but at least maps cleanly to a resource "pods on a single minion" and the resource for it could be different than 1 or 3 (so expose information specific to that minion). 2 is also cleaner for a 3rd party to implement (you don't have to know how to parse query parameters). 2 also is easier to secure (you can disallow access to /pods, but allow /minionPods directly). 3 reuses an existing resource structure.

@lavalamp
Copy link
Copy Markdown
Contributor

I prefer:

GET /pods?fields=DesiredState.Host%2d<minionhostname>

which I think is the current syntax and will work as soon as I get my act together and send off a PR.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this should say API server, not etcd?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@lavalamp
Copy link
Copy Markdown
Contributor

Argh, sorry, I was so used to seeing this at the bottom of my queue-- I'll review today.

@smarterclayton smarterclayton force-pushed the use_api_for_minion_pods branch 2 times, most recently from eacb901 to fb99422 Compare December 19, 2014 18:34
@lavalamp
Copy link
Copy Markdown
Contributor

Sorry, did not get to this today. Monday, or maybe over the weekend.

@smarterclayton
Copy link
Copy Markdown
Contributor Author

No rush

On Dec 19, 2014, at 9:29 PM, Daniel Smith notifications@github.com wrote:

Sorry, did not get to this today. Monday, or maybe over the weekend.


Reply to this email directly or view it on GitHub.

@smarterclayton
Copy link
Copy Markdown
Contributor Author

Rebased again, found some bugs with how the kubelet server is looking up pods by name. Stats lookup is still not using namespace #3204, and looks like container logs are being chosen randomly among recent attempts #3205.

smarterclayton added a commit to smarterclayton/kubernetes that referenced this pull request Jan 4, 2015
Discusses the current security risks posed by the kubelet->etcd pattern
and discusses some options.

Triggered by kubernetes#846 and referenced in kubernetes#859
@smarterclayton
Copy link
Copy Markdown
Contributor Author

Would really appreciate some review from the various folks, unless we want to totally nuke this and wait for Eric's wider reaching change. Rebasing tends to be a bit hairy.

Disables etcd configuration loading if the --use_etcd_for_pods parameter is
false.  Depends on the Watch implementation of BoundPods and adds
Client methods for that.  Depends on Watch returning the full
state when resourceVersion is empty.
server.go is currently source agnostic, so a method should be
exposed on the Kubelet that makes it simpler for server.go to
find the proper pod.  Also, correct an error in stats retrieval
and better indicate that the code is broken.
@smarterclayton smarterclayton force-pushed the use_api_for_minion_pods branch from a12622a to c9adb3e Compare January 6, 2015 17:08
@lavalamp
Copy link
Copy Markdown
Contributor

lavalamp commented Jan 6, 2015

I do apologize for sitting on this for so long :/

I think, since @erictune is back and is promising to work on making kubelet read pods directly, that it makes sense to just wait for that change to happen--it doesn't seem great to add a REST object that we plan to get rid of in a week or two.

However I would like it if "Watching on invalid label/field selectors should error" would show up in its own PR, that seems useful generally.

@smarterclayton
Copy link
Copy Markdown
Contributor Author

Extracted other fixes that aren't BoundPods specific in #3270, will leave this here until Eric's changes land and then close it.

@lavalamp
Copy link
Copy Markdown
Contributor

@erictune's change has been merged.

@lavalamp lavalamp closed this Jan 15, 2015
@erictune
Copy link
Copy Markdown
Contributor

for posterity, the last one was: #3376

vishh pushed a commit to vishh/kubernetes that referenced this pull request Apr 6, 2016
Fix cgroup name parsing logic in ps output for centos6.
xingzhou pushed a commit to xingzhou/kubernetes that referenced this pull request Dec 15, 2016
Discusses the current security risks posed by the kubelet->etcd pattern
and discusses some options.

Triggered by kubernetes#846 and referenced in kubernetes#859
seans3 pushed a commit to seans3/kubernetes that referenced this pull request Apr 10, 2019
wgahnagl pushed a commit to wgahnagl/kubernetes that referenced this pull request Jul 7, 2021
…y-go

UPSTREAM: <drop>: bump(apiserver-library-go)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/api Indicates an issue on api area. area/etcd area/security

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants