Skip to content

Fix bug: when pod occur later than sidecar connection, the sidecar in…#13228

Closed
hzxuzhonghu wants to merge 3 commits intoistio:release-1.1from
hzxuzhonghu:fix-long-initial-ready
Closed

Fix bug: when pod occur later than sidecar connection, the sidecar in…#13228
hzxuzhonghu wants to merge 3 commits intoistio:release-1.1from
hzxuzhonghu:fix-long-initial-ready

Conversation

@hzxuzhonghu
Copy link
Copy Markdown
Member

…bound listener will not be pushed

fixes : #13106

@waynz0r Would you apply this patch, and have a test.

@istio-testing
Copy link
Copy Markdown
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hzxuzhonghu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@hzxuzhonghu
Copy link
Copy Markdown
Member Author

/cc @StefanCenusa

@istio-testing
Copy link
Copy Markdown
Collaborator

@hzxuzhonghu: GitHub didn't allow me to request PR reviews from the following users: StefanCenusa.

Note that only istio members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

/cc @StefanCenusa

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have to push to Everyone? is there a way to push to just this sidecar? Otherwise, every time a new pod connects, everyone will have a full config push.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was thinking how to push only to this sidecar. Will try.

@howardjohn
Copy link
Copy Markdown
Member

I think this change is too risky for release-1.1 without additional testing

@hzxuzhonghu
Copy link
Copy Markdown
Member Author

I think this change is too risky for release-1.1 without additional testing

Honestly i agree, I think only performance test can weigh this.
But this is absolutely a potential bug, what do you think ff we do a full push only to the associated sidecar,? In this way, i think it is of little risk.

@rshriram
Copy link
Copy Markdown
Member

But this is absolutely a potential bug, what do you think ff we do a full push only to the associated sidecar,? In this way, i think it is of little risk.

Yes.. if we can get it to do a full push just for the sidecar, then great.

@hzxuzhonghu hzxuzhonghu force-pushed the fix-long-initial-ready branch from 7a7c75b to 6e33475 Compare April 12, 2019 04:08
@hzxuzhonghu hzxuzhonghu force-pushed the fix-long-initial-ready branch from 83af716 to 12486d0 Compare April 12, 2019 06:38
@hzxuzhonghu hzxuzhonghu force-pushed the fix-long-initial-ready branch from 12486d0 to 7a5ad67 Compare April 12, 2019 07:08
@istio-testing
Copy link
Copy Markdown
Collaborator

istio-testing commented Apr 12, 2019

@hzxuzhonghu: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
prow/istio-integ-k8s-tests.sh 7a5ad67 link /test istio-integ-k8s-tests
prow/istio-pilot-multicluster-e2e.sh 7a5ad67 link /test istio-pilot-multicluster-e2e
Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@hzxuzhonghu
Copy link
Copy Markdown
Member Author

fixes: #9504

@rshriram
Copy link
Copy Markdown
Member

where is the code that pushes to specific proxies? I only see code related to map updates..

if !full {
s.proxyUpdatesMutex.Lock()
if _, ok := s.proxyUpdates[client.modelNode.IPAddresses[0]]; ok {
proxyFull = true
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rshriram take a look at here, if the proxyUpdates contains the proxy ip address, then trigger full push no matter the original push type(full/eds only) and remove it from the map.

edsOnly := edsUpdates
if full {
if proxyFull {
edsOnly = nil
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this nil indicates full push

	case client.pushChannel <- &XdsEvent{
				push:               push,
				pending:            &pendingPush,
				version:            version,
				edsUpdatedServices: edsOnly,
			}:

@rshriram
Copy link
Copy Markdown
Member

@howardjohn further thoughts?

@howardjohn
Copy link
Copy Markdown
Member

@howardjohn further thoughts?

I am not convinced this will fix the issue in the bug without testing, or at least manual verification. It might, but I think we should be sure before we make a change to 1.1 that could have perf implications. Although I am also not familiar with the readiness stuff so maybe it is clearly the right fix. @nmittler was doing some similar work, do you have any thoughts?

@hzxuzhonghu
Copy link
Copy Markdown
Member Author

I am not convinced this will fix the issue in the bug without testing, or at least manual verification.

It is not always possible to test some extreme case. At least for e2e, it is hard to test. I have no idea now. @howardjohn Any idea is welcome.

before we make a change to 1.1 that could have perf implications.

Then i think it should be safe first let it go into master and watch the performance test from testgrid. After a long time performance test, I think it is safe for 1.1. These kind of explicitly bug should be resolved quickly, to make a good user experience.

@nmittler
Copy link
Copy Markdown
Contributor

We need a unit test for this.

@costinm
Copy link
Copy Markdown
Contributor

costinm commented Apr 29, 2019

What happens for a pod without service ( batch job, etc ) ? We need at least an integration test for this case, as well for the bug that's supposed to be fixed, and certainly a test in the stability cluster since this is likely to happen in corner cases ( and subject to monitoring ).
@mandarjog

@costinm
Copy link
Copy Markdown
Contributor

costinm commented Apr 29, 2019

I mean for master - in 1.1 branch it seems far too risky given that it doesn't have any tests in master, and has quite a bit of risks.

We're in code mauve...

@mailzyok
Copy link
Copy Markdown

@hzxuzhonghu Could you please elaborate this case? What's the meaning of "Pod occur later than sidecar connection"?
Thanks in advance.

@hzxuzhonghu
Copy link
Copy Markdown
Member Author

@mailzyok Pod update event may occur later to pilot than Its sidecar connected with pilot.
In this case, the proxy will have no serviceInstances, which will cause no inbound listeners configure push to this sidecar.

@hzxuzhonghu
Copy link
Copy Markdown
Member Author

Endpoint event related to this pod can also occur later,

@hzxuzhonghu
Copy link
Copy Markdown
Member Author

@duderino

@joberdick
Copy link
Copy Markdown

I am still facing this issue in 1.1.7. Pods can take over 5 minutes to be ready in a very small workload.

Is there anything I can do to help test this fix?

@hzxuzhonghu
Copy link
Copy Markdown
Member Author

Should this be in 1.1 or should i close it?

@mailzyok
Copy link
Copy Markdown

I am still facing this issue in 1.1.7. Pods can take over 5 minutes to be ready in a very small workload.

Is there anything I can do to help test this fix?

Same issue observed in 1.1.7 as well.

@stepanstipl
Copy link
Copy Markdown

We're experiencing this issue on GKE 1.12.7-gke.10 with Istio 1.0.6. The envoy will not get inbound listener until 5m timeout happens ([warning][upstream] external/envoy/source/common/config/grpc_mux_impl.cc:240] gRPC config stream closed: 13) and then the additional inbound gets added.

Interesting is that restarting pilot seems to fix this issue for a while. Any idea why?

And is there any known workaround how to avoid this?

@hzxuzhonghu
Copy link
Copy Markdown
Member Author

@mailzyok you can test this patch only updating the pilot image.

@hzxuzhonghu
Copy link
Copy Markdown
Member Author

@joberdick

@hzxuzhonghu
Copy link
Copy Markdown
Member Author

And is there any known workaround how to avoid this?

Maybe the cause is what the pr fixing, but without more details I can hardly say.

@duderino
Copy link
Copy Markdown

duderino commented Jun 2, 2019

@hzxuzhonghu I think this PR is important enough that we could merge it into 1.1 if we were sure it doesn't introduce other issues. So at this point it's only a testing issue.

Since it's a testing issue, I think it's actually a perfect candidate for 1.2. That will be released in about 3 weeks and will receive much more testing than any 1.1.x patch release.

If we later discover that it's a very safe and much needed change, we can always reopen this/backport

@duderino duderino closed this Jun 2, 2019
@istio-testing
Copy link
Copy Markdown
Collaborator

🤔 🐛 You appear to be fixing a bug in Go code, yet your PR doesn't include updates to any test files. Did you forget to add a test?

Courtesy of your friendly test nag.

@hanymahajna
Copy link
Copy Markdown

Hi, are you sure that this fix the issue?
cause i still see this in the log and the sidecar does not start.
This just happened when we have "readinessProbe"

  • failed checking application ports. listeners="0.0.0.0:15090","172.20.97.172:9090","172.20.60.163:15000","172.20.22.223:42422","172.20.209.251:15029","172.20.249.134:443","172.20.114.25:9092","172.20.142.227:15001","172.20.142.227:3306","172.20.195.138:2379","172.20.128.52:443","172.20.128.52:15030","172.20.235.245:16686","172.20.106.130:8080","172.20.192.225:44134","172.20.128.52:15443","172.20.67.75:2181","0.0.0.0:2379","172.20.44.203:443","0.0.0.0:5432","172.20.209.251:15443","172.20.81.45:3306","0.0.0.0:15002","172.20.209.251:15031","172.20.209.251:15032","172.20.11.228:2020","172.20.180.91:80","172.20.38.235:8080","172.20.128.52:15020","172.20.77.190:14267","0.0.0.0:9090","172.20.136.99:8082","172.20.128.52:15031","172.20.20.76:44134","0.0.0.0:2380","172.20.128.52:15029","172.20.146.108:443","0.0.0.0:10250","172.20.128.52:31400","172.20.218.101:443","172.20.209.251:15030","0.0.0.0:443","172.20.0.10:53","172.20.0.1:443","172.20.150.19:5432","172.20.209.251:443","172.20.74.231:44134","172.20.209.251:15020","172.20.77.190:14268","172.20.163.47:8081","172.20.56.220:2379","172.20.209.251:31400","172.20.128.52:15032","172.20.232.5:15011","0.0.0.0:8060","0.0.0.0:15991","0.0.0.0:15014","0.0.0.0:15010","0.0.0.0:3000","0.0.0.0:8080","0.0.0.0:8088","0.0.0.0:9901","0.0.0.0:9091","0.0.0.0:80","0.0.0.0:16002","0.0.0.0:9411","0.0.0.0:15999","0.0.0.0:8000","0.0.0.0:15004","10.25.1.64:15020","0.0.0.0:15001"
    * envoy missing listener for inbound application port: 3000`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.