Fix bug: when pod occur later than sidecar connection, the sidecar in…#13228
Fix bug: when pod occur later than sidecar connection, the sidecar in…#13228hzxuzhonghu wants to merge 3 commits intoistio:release-1.1from
Conversation
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hzxuzhonghu The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/cc @StefanCenusa |
|
@hzxuzhonghu: GitHub didn't allow me to request PR reviews from the following users: StefanCenusa. Note that only istio members and repo collaborators can review this PR, and authors cannot review their own PRs. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
pilot/pkg/proxy/envoy/v2/eds.go
Outdated
There was a problem hiding this comment.
do we have to push to Everyone? is there a way to push to just this sidecar? Otherwise, every time a new pod connects, everyone will have a full config push.
There was a problem hiding this comment.
Yes, I was thinking how to push only to this sidecar. Will try.
|
I think this change is too risky for release-1.1 without additional testing |
Honestly i agree, I think only performance test can weigh this. |
Yes.. if we can get it to do a full push just for the sidecar, then great. |
…bound listener will not be pushed
7a7c75b to
6e33475
Compare
83af716 to
12486d0
Compare
…onnected with pilot
12486d0 to
7a5ad67
Compare
|
@hzxuzhonghu: The following tests failed, say
DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
fixes: #9504 |
|
where is the code that pushes to specific proxies? I only see code related to map updates.. |
| if !full { | ||
| s.proxyUpdatesMutex.Lock() | ||
| if _, ok := s.proxyUpdates[client.modelNode.IPAddresses[0]]; ok { | ||
| proxyFull = true |
There was a problem hiding this comment.
@rshriram take a look at here, if the proxyUpdates contains the proxy ip address, then trigger full push no matter the original push type(full/eds only) and remove it from the map.
| edsOnly := edsUpdates | ||
| if full { | ||
| if proxyFull { | ||
| edsOnly = nil |
There was a problem hiding this comment.
make this nil indicates full push
case client.pushChannel <- &XdsEvent{
push: push,
pending: &pendingPush,
version: version,
edsUpdatedServices: edsOnly,
}:
|
@howardjohn further thoughts? |
I am not convinced this will fix the issue in the bug without testing, or at least manual verification. It might, but I think we should be sure before we make a change to 1.1 that could have perf implications. Although I am also not familiar with the readiness stuff so maybe it is clearly the right fix. @nmittler was doing some similar work, do you have any thoughts? |
It is not always possible to test some extreme case. At least for e2e, it is hard to test. I have no idea now. @howardjohn Any idea is welcome.
Then i think it should be safe first let it go into master and watch the performance test from testgrid. After a long time performance test, I think it is safe for 1.1. These kind of explicitly bug should be resolved quickly, to make a good user experience. |
|
We need a unit test for this. |
|
What happens for a pod without service ( batch job, etc ) ? We need at least an integration test for this case, as well for the bug that's supposed to be fixed, and certainly a test in the stability cluster since this is likely to happen in corner cases ( and subject to monitoring ). |
|
I mean for master - in 1.1 branch it seems far too risky given that it doesn't have any tests in master, and has quite a bit of risks. We're in code mauve... |
|
@hzxuzhonghu Could you please elaborate this case? What's the meaning of "Pod occur later than sidecar connection"? |
|
@mailzyok Pod update event may occur later to pilot than Its sidecar connected with pilot. |
|
Endpoint event related to this pod can also occur later, |
|
I am still facing this issue in 1.1.7. Pods can take over 5 minutes to be ready in a very small workload. Is there anything I can do to help test this fix? |
|
Should this be in 1.1 or should i close it? |
Same issue observed in 1.1.7 as well. |
|
We're experiencing this issue on GKE Interesting is that restarting pilot seems to fix this issue for a while. Any idea why? And is there any known workaround how to avoid this? |
|
@mailzyok you can test this patch only updating the pilot image. |
Maybe the cause is what the pr fixing, but without more details I can hardly say. |
|
@hzxuzhonghu I think this PR is important enough that we could merge it into 1.1 if we were sure it doesn't introduce other issues. So at this point it's only a testing issue. Since it's a testing issue, I think it's actually a perfect candidate for 1.2. That will be released in about 3 weeks and will receive much more testing than any 1.1.x patch release. If we later discover that it's a very safe and much needed change, we can always reopen this/backport |
|
🤔 🐛 You appear to be fixing a bug in Go code, yet your PR doesn't include updates to any test files. Did you forget to add a test? Courtesy of your friendly test nag. |
|
Hi, are you sure that this fix the issue?
|
…bound listener will not be pushed
fixes : #13106
@waynz0r Would you apply this patch, and have a test.