Allow Kubelet to watch for changes to pods scheduled on it by smarterclayton · Pull Request #846 · kubernetes/kubernetes

smarterclayton · 2014-08-11T04:23:24Z

This replaces Kubelet config loading a watch call on /watch/pods. It depends on Watch to inform of new changes.

This is one step towards allowing the kubelet to be isolated from direct access to etcd which provides better security, and allows us to version the API and schema from the apiserver using schema conversions. In the future we can add more restrictive minion authorization (minion X is scoped to retrieve only information it needs).

Implements part of #860

lavalamp · 2014-08-11T06:30:34Z

changes to watch for resourceVersion=0 case

Yes, please. In another PR so it can go in quickly. This was on my list.

I've actually implemented watch on pods for my scheduler change. Not ready to push yet, though.

We don't need a separate API for this. Minions can just select on field CurrentState.Host=.

Is there a reason we don't want minions to use etcd?

smarterclayton · 2014-08-11T16:58:46Z

@lavalamp isolating minion from etcd means that a compromised minion can't compromise the rest of the system. Even with read only access, being able to access container data on other hosts should be prohibited. Also, putting the API in front of etcd means that the underlying structure can be versioned independent of the kubelets and other backends can be used instead of etcd.

smarterclayton · 2014-08-11T17:01:47Z

@lavalamp your watch for pods, what schema are you depending on in etcd? Preserve existing one with /kubelet? One advantage of a separate resource is that it makes the access pattern explicit for restricting authz to the info about pods - using fields directly on pods is more complex because it requires interpreting the field selector at a higher level. I would prefer not to introduce a new resource unless it has substantial benefits though.

Also, watches right now are terminating after 10s because of the writeTimeout on the default http server. I'll open a separate pull this afternoon for allowing watches to run longer (without having to allow all read/writes to continue).

thockin · 2014-08-11T17:24:41Z

A few thoughts.

First, I like the idea of making the master <-> slave connection be a versioned protocol instead of a blind export of data. This clarifies exactly what structures belong to the master vs kubelet (something we sorely need to do).

But...

This changes the scaling point of the cluster from etcd, which is designed to scale horizontally, to apiserver, which is not (yet). A principle of operation internally at Google is that nodes (almost) never reach out to other services. It's just too easy to crush an under-prepared service with accidentally synchronized traffic - the thundering herd effect is very real.

I think this is one of those changes that might benefit from a short design doc - easier to discuss at that level than the code level.

smarterclayton · 2014-08-11T17:41:50Z

@thockin Happy to. I don't see the difference between etcd and the apiserver in practice here except one of them is an open security hole. Also, with the exception of operations apiserver should be horizontally scalable (or a subset of the resources therein), and I would expect us to correct that in the extreme near term.

What form of design doc - just a pull of an md to docs that describes proper isolation and communication channel?

thockin · 2014-08-11T18:07:30Z

Design doc could be an md file if you think it is worth commiting, or a
google doc or other if you want to discuss the design but not commit the
doc.

On Mon, Aug 11, 2014 at 10:42 AM, Clayton Coleman notifications@github.com
wrote:

@thockin https://github.com/thockin Happy to. I don't see the
difference between etcd and the apiserver in practice here except one of
them is an open security hole. Also, with the exception of operations
apiserver should be horizontally scalable (or a subset of the resources
therein), and I would expect us to correct that in the extreme near term.

What form of design doc - just a pull of an md to docs that describes
proper isolation and communication channel?

Reply to this email directly or view it on GitHub
#846 (comment)
.

lavalamp · 2014-08-11T18:29:22Z

As I understand it, the entire reason we're using etcd at all is because of scalable replication to nodes. I'm about 25% convinced by clayton's security argument that we should change that--I'm not convinced we need to stop trusting kubelet.

If we do stop using etcd for this case, then we can probably find a much better database type info store, since we wouldn't be using etcd for its intended use case.

Does etcd have the ability to set permissions on keys?

Also, what Tim said: if we're going to use sharded and/or replicated apiservers to distribute data to kubelets, we need to actually do some math on how many nodes a single apiserver can support. Watch may not be feasible at scale, I don't know.

smarterclayton · 2014-08-11T18:36:18Z

So I'm surprised at what you guys are describing - that you think etcd is more scalable than a horizontally stateless apiserver. Etcd provides efficient watches - that alone probably justifies its use here with a pull / incremental model, and very few other data stores will offer anything close to that. However, treating any individual kubelet as trusted is extremely dangerous - a compromise of a single node (bad password security, someone ssh'd in accidentally and didn't log out) can then get directly to etcd and reduce the cluster to a heap of smoking ruin.

I view the apiserver as providing authz and authn on top of caching, data format translation, and efficient watching semantics. The apiserver sitting in front of etcd actually reduces load on etcd (potentially by aggregating watches or caching watch results) and also allows behavior to be controlled. apiserver should be horizontal and stateless, and with efficient cache semantics and eventually consistent API patterns (which watch is a great example of) clients can get all of the value of etcd with none of the coupling.

xiang90 · 2014-08-11T18:42:54Z

@smarterclayton Watch can be horizontally scalable in a system like etcd in theory. Operations like watch(index guarantees consistency) can be easily distributed by having proxies. Trading off latencies should be acceptable in this use case.

smarterclayton · 2014-08-11T18:47:30Z

@xiangli-cmu yeah in this case I view the apiserver as a semantically versioned etcd proxy - it's performing simple translation of a watch api call into an etcd get or watch primitive, but able to apply higher level reasoning about the semantics. The apiserver can introduce latency exactly like an etcd proxy - but the key difference is that the apiserver is the only server that knows the authz model of kubernetes - exposing the etcd store to the edge is not following the model of least exposure.

lavalamp · 2014-08-11T19:01:18Z

@smarterclayton I guess I have been thinking of a kubelet security breach as root-equivalent anyway. Maybe we need to have an official position on this.

It does seem prudent to give kubelets read only access to etcd if possible.

smarterclayton · 2014-08-11T19:33:12Z

It's root equivalent but if you have 500 hosts in a cluster it shouldn't escalate past root on a single box. Since there will be many more hosts than apiservers the odds of a whole system compromise increase as you add hosts, whereas I'd like it to remain constant (you can't stop a zero day kernel exploit, but you can defend against sloppy administration). Also, looking at this from a multi tenant perspective, anything in the container manifests that is secret to another tenant shouldn't be compromised by a tenant escaping their sandbox on a host unless they are collocated. So even read only access is too much since you'll see the environment variables those other hosts offer.

thockin · 2014-08-11T21:00:34Z

To be clear - I am saying that etcd TODAY is design to scale horizontally
and apiserver TODAY is not. I generally agree with the idea, overall.

We also, internally, have gotten much value from the etcd-equivalent being
decoupled from the apiserver-equivalent. We can hork the master, and
minions keep running. That said, I think we already have that property in
k8s, or we've at least discussed it.

On Mon, Aug 11, 2014 at 12:33 PM, Clayton Coleman notifications@github.com
wrote:

It's root equivalent but if you have 500 hosts in a cluster it shouldn't
escalate past root on a single box. Since there will be many more hosts
than apiservers the odds of a whole system compromise increase as you add
hosts, whereas I'd like it to remain constant (you can't stop a zero day
kernel exploit, but you can defend against sloppy administration). Also,
looking at this from a multi tenant perspective, anything in the container
manifests that is secret to another tenant shouldn't be compromised by a
tenant escaping their sandbox on a host unless they are collocated. So even
read only access is too much since you'll see the environment variables
those other hosts offer.

Reply to this email directly or view it on GitHub
#846 (comment)
.

smarterclayton · 2014-08-11T21:10:43Z

This thread is a good example of why a design doc laying out a position and the pros and cons is a good idea :)

erictune · 2014-08-11T21:45:50Z

I buy that the kubelet interface to the data store should be:

readonly where possible to prevent accidental writes due to bugs.
limited to viewing only needed data to make reasoning about the system easier.

I'm not sure that it adds more than a little bit of security.
If an attacker has root on the kubelet, I can imagine a number of ways to:

start new pods on the same/different machine belonging to other users, that can be similarly exploited, thus widening the range of user data the attacker has access to.
trick the apiserver into sending a variety of new pods to the machine, thus widening the range of user data the attacker has access to.

smarterclayton · 2014-08-11T21:56:02Z

And those are known risks, but there are whole classes of compromise unrelated to running pods where the homogenous pod environment doesn't place you at risk. Misconfiguration of host level settings, or difficult to reproduce exploits combining multiple systems, or plain old "forgot to lock the screensaver in the datacenter" are valid attack surfaces.

I'd argue the strongest around API versioning and not letting people couple to the data store, and then the write access, and then info leakage. The same arguments apply to services and any other distributed "pull" client - after all, why is the replication controller more of a risk than the kubelets when it comes to access to core data?

smarterclayton · 2014-08-12T00:17:13Z

I forgot a non-root attack vector. Currently the only isolation between containers and etcd is the network configuration of the container network. It is trivially easy to misconfigure a host network such that containers could reach the etcd servers - in most production configurations allowing network access to a central datastore from the edge (even when protected) would be considered poor security. The only defense there is to configure a proxy with network isolation to etcd - it would be better for that proxy to only offer the minimal surface area, and guarding a structured apiserver proxy vs a simple http proxy against all possible data retrieval attacks is much easier.

philips · 2014-08-13T04:49:26Z

To @erictune point it would be great to have feedback [and code :)] from people on what sort of access control system would be useful for users of etcd. Relevant bugs:

The plan here seems fine to me but clearly etcd should provide ways to protect the kubelet API server's data and it would be good to have the etcd mechanism mesh with k8s.

smarterclayton · 2014-08-13T13:08:27Z

@philips absolutely - two layers of defense is always better than one.

smarterclayton · 2014-08-13T19:21:01Z

Open question here is - should a subordinate parent-child relationship (minion pods) be modeled as?

GET /pods?field={minion=127.0.0.1} returns an api.PodList
GET /minionPods/127.0.0.1 returns an api.PodList (but in the future could return something more specific)
GET /minions/127.0.0.1/pods returns an api.PodList

I would prefer 1 or 2, because 3 increases client complexity. 2 is a bit awkward, but at least maps cleanly to a resource "pods on a single minion" and the resource for it could be different than 1 or 3 (so expose information specific to that minion). 2 is also cleaner for a 3rd party to implement (you don't have to know how to parse query parameters). 2 also is easier to secure (you can disallow access to /pods, but allow /minionPods directly). 3 reuses an existing resource structure.

lavalamp · 2014-08-13T20:17:17Z

I prefer:

GET /pods?fields=DesiredState.Host%2d<minionhostname>

which I think is the current syntax and will work as soon as I get my act together and send off a PR.

davidopp · 2014-12-19T08:02:27Z

pkg/kubelet/config/api.go

I guess this should say API server, not etcd?

lavalamp · 2014-12-19T17:32:58Z

Argh, sorry, I was so used to seeing this at the bottom of my queue-- I'll review today.

lavalamp · 2014-12-20T02:03:41Z

Sorry, did not get to this today. Monday, or maybe over the weekend.

smarterclayton · 2014-12-20T02:39:31Z

No rush

On Dec 19, 2014, at 9:29 PM, Daniel Smith notifications@github.com wrote:

Sorry, did not get to this today. Monday, or maybe over the weekend.

—
Reply to this email directly or view it on GitHub.

smarterclayton · 2015-01-04T05:06:29Z

Rebased again, found some bugs with how the kubelet server is looking up pods by name. Stats lookup is still not using namespace #3204, and looks like container logs are being chosen randomly among recent attempts #3205.

Discusses the current security risks posed by the kubelet->etcd pattern and discusses some options. Triggered by kubernetes#846 and referenced in kubernetes#859

Will allow an API consumer to fetch a versioned list of bound pods.

smarterclayton · 2015-01-06T16:43:15Z

Would really appreciate some review from the various folks, unless we want to totally nuke this and wait for Eric's wider reaching change. Rebasing tends to be a bit hairy.

Disables etcd configuration loading if the --use_etcd_for_pods parameter is false. Depends on the Watch implementation of BoundPods and adds Client methods for that. Depends on Watch returning the full state when resourceVersion is empty.

server.go is currently source agnostic, so a method should be exposed on the Kubelet that makes it simpler for server.go to find the proper pod. Also, correct an error in stats retrieval and better indicate that the code is broken.

lavalamp · 2015-01-06T21:34:35Z

I do apologize for sitting on this for so long :/

I think, since @erictune is back and is promising to work on making kubelet read pods directly, that it makes sense to just wait for that change to happen--it doesn't seem great to add a REST object that we plan to get rid of in a week or two.

However I would like it if "Watching on invalid label/field selectors should error" would show up in its own PR, that seems useful generally.

smarterclayton · 2015-01-07T15:48:46Z

Extracted other fixes that aren't BoundPods specific in #3270, will leave this here until Eric's changes land and then close it.

lavalamp · 2015-01-15T22:51:39Z

@erictune's change has been merged.

erictune · 2015-01-16T00:45:10Z

for posterity, the last one was: #3376

Fix cgroup name parsing logic in ps output for centos6.

Discusses the current security risks posed by the kubelet->etcd pattern and discusses some options. Triggered by kubernetes#846 and referenced in kubernetes#859

Update node-heartbeat KEP

…y-go UPSTREAM: <drop>: bump(apiserver-library-go)

smarterclayton mentioned this pull request Aug 11, 2014

Watch delivers current state for resourceVersion=0 #852

Merged

smarterclayton mentioned this pull request Aug 12, 2014

First draft of security doc covering minion/etcd isolation #859

Merged

smarterclayton mentioned this pull request Aug 12, 2014

Proposal: Isolate kubelet from etcd #860

Merged

davidopp reviewed Dec 19, 2014
View reviewed changes

smarterclayton force-pushed the use_api_for_minion_pods branch 2 times, most recently from eacb901 to fb99422 Compare December 19, 2014 18:34

erictune mentioned this pull request Dec 30, 2014

Secure node -> master communication #3168

Closed

a-robinson mentioned this pull request Jan 2, 2015

WIP: Modify kube-push for GCE to bring down the existing master VM and completely replace it with a new one #3174

Closed

smarterclayton force-pushed the use_api_for_minion_pods branch from fb99422 to a12622a Compare January 4, 2015 05:01

smarterclayton added 2 commits January 6, 2015 11:42

Expose an API endpoint for BoundPods per host at /boundPods/<host>

fda1073

Will allow an API consumer to fetch a versioned list of bound pods.

Watching on invalid label/field selectors should error

87601b7

smarterclayton added 2 commits January 6, 2015 11:47

Replaces etcd watching on the Kubelet with API

376b727

Disables etcd configuration loading if the --use_etcd_for_pods parameter is false. Depends on the Watch implementation of BoundPods and adds Client methods for that. Depends on Watch returning the full state when resourceVersion is empty.

smarterclayton force-pushed the use_api_for_minion_pods branch from a12622a to c9adb3e Compare January 6, 2015 17:08

smarterclayton mentioned this pull request Jan 7, 2015

Cleanup to client initialization in Kubelet #3270

Merged

lavalamp closed this Jan 15, 2015

smarterclayton unassigned lavalamp Aug 12, 2015

smarterclayton mentioned this pull request Oct 20, 2015

Speed-up LIST operations by serving them from memory #15945

Closed

vishh pushed a commit to vishh/kubernetes that referenced this pull request Apr 6, 2016

Merge pull request kubernetes#846 from rjnagal/docker

2a7218e

Fix cgroup name parsing logic in ps output for centos6.

hongchaodeng mentioned this pull request Apr 11, 2016

etcd3/store: watcher implementation #23694

Merged

seans3 pushed a commit to seans3/kubernetes that referenced this pull request Apr 10, 2019

Merge pull request kubernetes#846 from wojtek-t/update_hartbeat_kep

021046f

Update node-heartbeat KEP

wgahnagl pushed a commit to wgahnagl/kubernetes that referenced this pull request Jul 7, 2021

Merge pull request kubernetes#846 from atiratree/bump-apiserver-libar…

8eda29f

…y-go UPSTREAM: <drop>: bump(apiserver-library-go)

Conversation

smarterclayton commented Aug 11, 2014

Uh oh!

lavalamp commented Aug 11, 2014

Uh oh!

smarterclayton commented Aug 11, 2014

Uh oh!

smarterclayton commented Aug 11, 2014

Uh oh!

thockin commented Aug 11, 2014

Uh oh!

smarterclayton commented Aug 11, 2014

Uh oh!

thockin commented Aug 11, 2014

Uh oh!

lavalamp commented Aug 11, 2014

Uh oh!

smarterclayton commented Aug 11, 2014

Uh oh!

xiang90 commented Aug 11, 2014

Uh oh!

smarterclayton commented Aug 11, 2014

Uh oh!

lavalamp commented Aug 11, 2014

Uh oh!

smarterclayton commented Aug 11, 2014

Uh oh!

thockin commented Aug 11, 2014

Uh oh!

smarterclayton commented Aug 11, 2014

Uh oh!

erictune commented Aug 11, 2014

Uh oh!

smarterclayton commented Aug 11, 2014

Uh oh!

smarterclayton commented Aug 12, 2014

Uh oh!

philips commented Aug 13, 2014

Uh oh!

smarterclayton commented Aug 13, 2014

Uh oh!

smarterclayton commented Aug 13, 2014

Uh oh!

lavalamp commented Aug 13, 2014

Uh oh!

davidopp Dec 19, 2014

Choose a reason for hiding this comment

Uh oh!

smarterclayton Dec 19, 2014

Choose a reason for hiding this comment

Uh oh!

lavalamp commented Dec 19, 2014

Uh oh!

lavalamp commented Dec 20, 2014

Uh oh!

smarterclayton commented Dec 20, 2014

Uh oh!

smarterclayton commented Jan 4, 2015

Uh oh!

smarterclayton commented Jan 6, 2015

Uh oh!

lavalamp commented Jan 6, 2015

Uh oh!

smarterclayton commented Jan 7, 2015

Uh oh!

lavalamp commented Jan 15, 2015

Uh oh!

erictune commented Jan 16, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants