add optional mutation checks for shared informer cache#27784
add optional mutation checks for shared informer cache#27784k8s-github-robot merged 1 commit intokubernetes:masterfrom
Conversation
|
I think you need to also edit hack/jenkins/e2e-runner.sh / cc @pwittrock |
There was a problem hiding this comment.
How many of these threads are running? How will the panic interact with GC? Is there a chance of running into an issue like #23210 ?
There was a problem hiding this comment.
How many of these threads are running? How will the panic interact with GC? Is there a chance of running into an issue like #23210 ?
I don't think that is the same as a timer.
I'm not sure what the panic does to GC, but you'll notice that we aren't attempting to recover from it. We want to die. We'll set this env var during testing to try to flush out naughty controllers.
There was a problem hiding this comment.
Would this code catch the bug in deployment we recently had?
There was a problem hiding this comment.
Would this code catch the bug in deployment we recently had?
There were two, both related to annotation map mutation on shallow copies. One was a consistent failure (and this code was actually how we found it). The other would have been a flake, but the window for catching it would be small since it change and changed back very quickly.
There was a problem hiding this comment.
I have a hard time convincing myself this is worth it when transient
changes will probably slip by and even constant changes that happen rarely
will merely result in flaky behavior.
OTOH I don't have better ideas.
On Thu, Aug 18, 2016 at 5:42 AM, David Eads notifications@github.com
wrote:
In pkg/controller/framework/mutation_detector.go
#27784 (comment)
:+// cacheObj holds the actual object and a copy
+type cacheObj struct {
- cached interface{}
- copied interface{}
+}
+func (d *defaultCacheMutationDetector) Run(stopCh <-chan struct{}) {
- // we DON'T want protection from panics. If we're running this code, we want to die
- go func() {
for {d.CompareObjects()
select {case <-stopCh:returncase <-time.After(d.period):Would this code catch the bug in deployment we recently had?
There were two, both related to annotation map mutation on shallow copies.
One was a consistent failure (and this code was actually how we found it).
The other would have been a flake, but the window for catching it would be
small since it change and changed back very quickly.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/kubernetes/pull/27784/files/0cea212f3d60e5ceb403607af009ef1633d7bbfa#r75300013,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAnglrlqXS87dNKbNNRrZHDefT7B1W2lks5qhFMdgaJpZM4I69qi
.
There was a problem hiding this comment.
I have a hard time convincing myself this is worth it when transient
changes will probably slip by and even constant changes that happen rarely
will merely result in flaky behavior.OTOH I don't have better ideas.
This has caught problems, both in openshift and kube. The problems aren't imaginary and they are difficult for reviewers to catch.
There was a problem hiding this comment.
OK. I will stop objecting.
On Thu, Aug 18, 2016 at 1:08 PM, David Eads notifications@github.com
wrote:
In pkg/controller/framework/mutation_detector.go
#27784 (comment)
:+// cacheObj holds the actual object and a copy
+type cacheObj struct {
- cached interface{}
- copied interface{}
+}
+func (d *defaultCacheMutationDetector) Run(stopCh <-chan struct{}) {
- // we DON'T want protection from panics. If we're running this code, we want to die
- go func() {
for {d.CompareObjects()
select {case <-stopCh:returncase <-time.After(d.period):I have a hard time convincing myself this is worth it when transient
changes will probably slip by and even constant changes that happen rarely
will merely result in flaky behavior.OTOH I don't have better ideas.
This has caught problems, both in openshift and kube. The problems aren't
imaginary and they are difficult for reviewers to catch.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/kubernetes/pull/27784/files/0cea212f3d60e5ceb403607af009ef1633d7bbfa#r75379572,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAngloIYVTdJqYfHbnO9ywwCNWu8BAE5ks5qhLvPgaJpZM4I69qi
.
|
For the time being jenkins calls |
| func (dummyMutationDetector) CompareObjects() { | ||
| } | ||
|
|
||
| // cacheMutationDetector gives a way to detect if a cached object has been mutated |
There was a problem hiding this comment.
s/cacheMutationDetector/defaultCacheMutationDetector/
|
Good idea. Two nits aside, lgtm |
|
@deads2k where was this added? Would it be part of the node e2e ginkgo tests? Does it need to be added to them? |
I've tried to add them it to all the e2e runs, but I'm not certain which of the numerous scripts needs to be modified. |
|
@vishh could you take a look to make sure this does what we need for node e2e? |
|
@deads2k You need to edit |
|
BTW is there an issue that warrants this PR? |
|
It's primarily defensive, but the implications of a cache mutation are very On Fri, Jun 24, 2016 at 5:15 PM, Vish Kannan notifications@github.com
|
|
Ack. This feature has to be made discover-able somehow. On Fri, Jun 24, 2016 at 4:40 PM, Clayton Coleman notifications@github.com
|
| listerWatcher: lw, | ||
| objectType: objType, | ||
| fullResyncPeriod: resyncPeriod, | ||
| cacheMutationDetector: NewCacheMutationDetector(fmt.Sprintf("%v", reflect.TypeOf(objType))), |
There was a problem hiding this comment.
use %t and don't import reflect
|
This is a good goal but can we brainstorm about other designs? @rsc If go had a "const" keyword we wouldn't need to do things like this... |
I'm open to other suggestions, but this is all I came up with. Since everything gets direct access to the struct, there's no spot to intercept a mutation. We caught one from the deployment controller with it. This still doesn't catch deletions, but those are a little harder to do accidentally. |
dc51257 to
dee1f2a
Compare
|
@jayunit100 Is there a way for me to make the CI e2e tests use this flag on their processes? I've added it to |
|
Jenkins verification failed for commit dee1f2a761b906acc7d1da86056bd8aa0630cebc. Full PR test history. The magic incantation to run this job again is |
dee1f2a to
5a70651
Compare
|
@Kargakis ping |
|
|
||
| lock sync.Mutex | ||
| cachedObjs []cacheObj | ||
| failureFunc func(message string) |
There was a problem hiding this comment.
add godoc that if this is not specified, we panic
There was a problem hiding this comment.
add godoc that if this is not specified, we panic
Sure though that's desired behavior. This func only exists for unit testing.
|
|
||
| d.lock.Lock() | ||
| defer d.lock.Unlock() | ||
| d.cachedObjs = append(d.cachedObjs, cacheObj{cached: obj, copied: copiedObj}) |
There was a problem hiding this comment.
Don't we want to dedup?
There was a problem hiding this comment.
Don't we want to dedup?
I think we want a reference to every reference that has existed since we don't know which reference any particular observer (and naughty mutator) may have gotten.
| for _, d := range obj.(Deltas) { | ||
| switch d.Type { | ||
| case Sync, Added, Updated: | ||
| s.cacheMutationDetector.AddObject(d.Object) |
There was a problem hiding this comment.
Don't we also want to remove on Deleted?
There was a problem hiding this comment.
Don't we also want to remove on Deleted?
Deletion doesn't guarantee references are gone. If we end up with memory pressure problems during runs, we could consider that.
|
Aside the comment for failureFunc, this LGTM |
5a70651 to
aee54ae
Compare
comment added. |
|
Jenkins GKE smoke e2e failed for commit aee54ae. Full PR test history. The magic incantation to run this job again is |
|
@k8s-bot test this [submit-queue is verifying that this PR is safe to merge] |
|
Jenkins GCI GKE smoke e2e failed for commit aee54ae. Full PR test history. The magic incantation to run this job again is |
|
Automatic merge from submit-queue |
We need to make sure that no one is mutating caches if they're using a shared informer. It is important that whatever is tracking those changes gets the object before anyone else possibly could.
This adds the ability to track the original objects in the cache and their current values. Go doesn't have an exit hook or a way to say "wait for non-daemon go-funcs to complete before exit", so this runs a gofunc on a loop that can panic the entire process. It's gated behind an env var.
@derekwaynecarr did I get the right spots to make sure that e2e runs with this flag?
@smarterclayton @kubernetes/rh-cluster-infra
This change is