Fix bug: when pod occur later than sidecar connection, the sidecar in… by hzxuzhonghu · Pull Request #13228 · istio/istio

hzxuzhonghu · 2019-04-11T01:36:36Z

…bound listener will not be pushed

fixes : #13106

@waynz0r Would you apply this patch, and have a test.

istio-testing · 2019-04-11T01:36:48Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hzxuzhonghu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~pilot/OWNERS~~ [hzxuzhonghu]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

hzxuzhonghu · 2019-04-11T01:41:59Z

/cc @StefanCenusa

istio-testing · 2019-04-11T01:42:00Z

@hzxuzhonghu: GitHub didn't allow me to request PR reviews from the following users: StefanCenusa.

Note that only istio members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

/cc @StefanCenusa

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

rshriram · 2019-04-11T14:41:44Z

pilot/pkg/proxy/envoy/v2/eds.go

do we have to push to Everyone? is there a way to push to just this sidecar? Otherwise, every time a new pod connects, everyone will have a full config push.

Yes, I was thinking how to push only to this sidecar. Will try.

howardjohn · 2019-04-11T16:58:42Z

I think this change is too risky for release-1.1 without additional testing

hzxuzhonghu · 2019-04-12T01:34:34Z

I think this change is too risky for release-1.1 without additional testing

Honestly i agree, I think only performance test can weigh this.
But this is absolutely a potential bug, what do you think ff we do a full push only to the associated sidecar,? In this way, i think it is of little risk.

rshriram · 2019-04-12T01:36:38Z

But this is absolutely a potential bug, what do you think ff we do a full push only to the associated sidecar,? In this way, i think it is of little risk.

Yes.. if we can get it to do a full push just for the sidecar, then great.

…bound listener will not be pushed

…onnected with pilot

istio-testing · 2019-04-12T07:39:01Z

@hzxuzhonghu: The following tests failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
prow/istio-integ-k8s-tests.sh	`7a5ad67`	link	`/test istio-integ-k8s-tests`
prow/istio-pilot-multicluster-e2e.sh	`7a5ad67`	link	`/test istio-pilot-multicluster-e2e`

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

hzxuzhonghu · 2019-04-15T04:45:55Z

fixes: #9504

rshriram · 2019-04-17T02:27:09Z

where is the code that pushes to specific proxies? I only see code related to map updates..

hzxuzhonghu · 2019-04-18T01:23:40Z

pilot/pkg/proxy/envoy/v2/ads.go

+		if !full {
+			s.proxyUpdatesMutex.Lock()
+			if _, ok := s.proxyUpdates[client.modelNode.IPAddresses[0]]; ok {
+				proxyFull = true


@rshriram take a look at here, if the proxyUpdates contains the proxy ip address, then trigger full push no matter the original push type(full/eds only) and remove it from the map.

hzxuzhonghu · 2019-04-18T01:25:14Z

pilot/pkg/proxy/envoy/v2/ads.go

 			edsOnly := edsUpdates
-			if full {
+			if proxyFull {
 				edsOnly = nil


make this nil indicates full push

case client.pushChannel <- &XdsEvent{ push: push, pending: &pendingPush, version: version, edsUpdatedServices: edsOnly, }:

rshriram · 2019-04-23T12:34:21Z

@howardjohn further thoughts?

howardjohn · 2019-04-23T14:42:01Z

@howardjohn further thoughts?

I am not convinced this will fix the issue in the bug without testing, or at least manual verification. It might, but I think we should be sure before we make a change to 1.1 that could have perf implications. Although I am also not familiar with the readiness stuff so maybe it is clearly the right fix. @nmittler was doing some similar work, do you have any thoughts?

hzxuzhonghu · 2019-04-24T01:56:21Z

I am not convinced this will fix the issue in the bug without testing, or at least manual verification.

It is not always possible to test some extreme case. At least for e2e, it is hard to test. I have no idea now. @howardjohn Any idea is welcome.

before we make a change to 1.1 that could have perf implications.

Then i think it should be safe first let it go into master and watch the performance test from testgrid. After a long time performance test, I think it is safe for 1.1. These kind of explicitly bug should be resolved quickly, to make a good user experience.

nmittler · 2019-04-24T17:24:17Z

We need a unit test for this.

costinm · 2019-04-29T23:48:24Z

What happens for a pod without service ( batch job, etc ) ? We need at least an integration test for this case, as well for the bug that's supposed to be fixed, and certainly a test in the stability cluster since this is likely to happen in corner cases ( and subject to monitoring ).
@mandarjog

costinm · 2019-04-29T23:49:21Z

I mean for master - in 1.1 branch it seems far too risky given that it doesn't have any tests in master, and has quite a bit of risks.

We're in code mauve...

mailzyok · 2019-05-10T06:23:43Z

@hzxuzhonghu Could you please elaborate this case? What's the meaning of "Pod occur later than sidecar connection"?
Thanks in advance.

hzxuzhonghu · 2019-05-10T06:41:58Z

@mailzyok Pod update event may occur later to pilot than Its sidecar connected with pilot.
In this case, the proxy will have no serviceInstances, which will cause no inbound listeners configure push to this sidecar.

hzxuzhonghu · 2019-05-10T06:43:33Z

Endpoint event related to this pod can also occur later,

hzxuzhonghu · 2019-05-15T01:41:03Z

@duderino

joberdick · 2019-05-21T21:13:18Z

I am still facing this issue in 1.1.7. Pods can take over 5 minutes to be ready in a very small workload.

Is there anything I can do to help test this fix?

hzxuzhonghu · 2019-05-25T07:24:10Z

Should this be in 1.1 or should i close it?

mailzyok · 2019-05-30T08:56:59Z

I am still facing this issue in 1.1.7. Pods can take over 5 minutes to be ready in a very small workload.

Is there anything I can do to help test this fix?

Same issue observed in 1.1.7 as well.

stepanstipl · 2019-05-30T18:44:50Z

We're experiencing this issue on GKE 1.12.7-gke.10 with Istio 1.0.6. The envoy will not get inbound listener until 5m timeout happens ([warning][upstream] external/envoy/source/common/config/grpc_mux_impl.cc:240] gRPC config stream closed: 13) and then the additional inbound gets added.

Interesting is that restarting pilot seems to fix this issue for a while. Any idea why?

And is there any known workaround how to avoid this?

hzxuzhonghu · 2019-06-01T06:15:42Z

@mailzyok you can test this patch only updating the pilot image.

hzxuzhonghu · 2019-06-01T06:16:13Z

@joberdick

hzxuzhonghu · 2019-06-01T06:51:04Z

And is there any known workaround how to avoid this?

Maybe the cause is what the pr fixing, but without more details I can hardly say.

duderino · 2019-06-02T17:13:35Z

@hzxuzhonghu I think this PR is important enough that we could merge it into 1.1 if we were sure it doesn't introduce other issues. So at this point it's only a testing issue.

Since it's a testing issue, I think it's actually a perfect candidate for 1.2. That will be released in about 3 weeks and will receive much more testing than any 1.1.x patch release.

If we later discover that it's a very safe and much needed change, we can always reopen this/backport

istio-testing · 2019-06-02T17:13:38Z

🤔 🐛 You appear to be fixing a bug in Go code, yet your PR doesn't include updates to any test files. Did you forget to add a test?

Courtesy of your friendly test nag.

hanymahajna · 2019-06-20T13:24:00Z

Hi, are you sure that this fix the issue?
cause i still see this in the log and the sidecar does not start.
This just happened when we have "readinessProbe"

failed checking application ports. listeners="0.0.0.0:15090","172.20.97.172:9090","172.20.60.163:15000","172.20.22.223:42422","172.20.209.251:15029","172.20.249.134:443","172.20.114.25:9092","172.20.142.227:15001","172.20.142.227:3306","172.20.195.138:2379","172.20.128.52:443","172.20.128.52:15030","172.20.235.245:16686","172.20.106.130:8080","172.20.192.225:44134","172.20.128.52:15443","172.20.67.75:2181","0.0.0.0:2379","172.20.44.203:443","0.0.0.0:5432","172.20.209.251:15443","172.20.81.45:3306","0.0.0.0:15002","172.20.209.251:15031","172.20.209.251:15032","172.20.11.228:2020","172.20.180.91:80","172.20.38.235:8080","172.20.128.52:15020","172.20.77.190:14267","0.0.0.0:9090","172.20.136.99:8082","172.20.128.52:15031","172.20.20.76:44134","0.0.0.0:2380","172.20.128.52:15029","172.20.146.108:443","0.0.0.0:10250","172.20.128.52:31400","172.20.218.101:443","172.20.209.251:15030","0.0.0.0:443","172.20.0.10:53","172.20.0.1:443","172.20.150.19:5432","172.20.209.251:443","172.20.74.231:44134","172.20.209.251:15020","172.20.77.190:14268","172.20.163.47:8081","172.20.56.220:2379","172.20.209.251:31400","172.20.128.52:15032","172.20.232.5:15011","0.0.0.0:8060","0.0.0.0:15991","0.0.0.0:15014","0.0.0.0:15010","0.0.0.0:3000","0.0.0.0:8080","0.0.0.0:8088","0.0.0.0:9901","0.0.0.0:9091","0.0.0.0:80","0.0.0.0:16002","0.0.0.0:9411","0.0.0.0:15999","0.0.0.0:8000","0.0.0.0:15004","10.25.1.64:15020","0.0.0.0:15001"
* envoy missing listener for inbound application port: 3000`

googlebot added the cla: yes label Apr 11, 2019

istio-testing requested review from andraxylia and rshriram April 11, 2019 01:36

istio-testing added the approved label Apr 11, 2019

hzxuzhonghu mentioned this pull request Apr 11, 2019

Fix bug: when pod occur later than sidecar connection, the sidecar in… #13229

Merged

rshriram reviewed Apr 11, 2019

View reviewed changes

Fix bug: when pod occur later than sidecar connection, the sidecar in…

358f6f8

…bound listener will not be pushed

hzxuzhonghu force-pushed the fix-long-initial-ready branch from 7a7c75b to 6e33475 Compare April 12, 2019 04:08

fix comments: only do a full push to the added sidecar

2715184

hzxuzhonghu force-pushed the fix-long-initial-ready branch from 83af716 to 12486d0 Compare April 12, 2019 06:38

optimize: do not record workloads that have no sidecars or have not c…

7a5ad67

…onnected with pilot

hzxuzhonghu force-pushed the fix-long-initial-ready branch from 12486d0 to 7a5ad67 Compare April 12, 2019 07:08

hzxuzhonghu commented Apr 18, 2019

View reviewed changes

hzxuzhonghu mentioned this pull request Apr 19, 2019

Istio-proxy fails to start with Istio 1.1 #9504

Closed

hzxuzhonghu mentioned this pull request Apr 24, 2019

1.1 Proxy very slow to get ready - agent unable to access envoy stats #12382

Closed

hzxuzhonghu mentioned this pull request Jun 1, 2019

Sidecar envoys gets wrong initial config #13106

Closed

duderino closed this Jun 2, 2019

stepanstipl mentioned this pull request Oct 10, 2019

GPII-4181: Increase wait for Locust workers gpii-ops/gpii-infra#523

Merged

Conversation

hzxuzhonghu commented Apr 11, 2019

Uh oh!

istio-testing commented Apr 11, 2019

Uh oh!

hzxuzhonghu commented Apr 11, 2019

Uh oh!

istio-testing commented Apr 11, 2019

Uh oh!

rshriram Apr 11, 2019

Choose a reason for hiding this comment

Uh oh!

hzxuzhonghu Apr 12, 2019

Choose a reason for hiding this comment

Uh oh!

howardjohn commented Apr 11, 2019

Uh oh!

hzxuzhonghu commented Apr 12, 2019

Uh oh!

rshriram commented Apr 12, 2019

Uh oh!

istio-testing commented Apr 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hzxuzhonghu commented Apr 15, 2019

Uh oh!

rshriram commented Apr 17, 2019

Uh oh!

hzxuzhonghu Apr 18, 2019

Choose a reason for hiding this comment

Uh oh!

hzxuzhonghu Apr 18, 2019

Choose a reason for hiding this comment

Uh oh!

rshriram commented Apr 23, 2019

Uh oh!

howardjohn commented Apr 23, 2019

Uh oh!

hzxuzhonghu commented Apr 24, 2019

Uh oh!

nmittler commented Apr 24, 2019

Uh oh!

costinm commented Apr 29, 2019

Uh oh!

costinm commented Apr 29, 2019

Uh oh!

mailzyok commented May 10, 2019

Uh oh!

hzxuzhonghu commented May 10, 2019

Uh oh!

hzxuzhonghu commented May 10, 2019

Uh oh!

hzxuzhonghu commented May 15, 2019

Uh oh!

joberdick commented May 21, 2019

Uh oh!

hzxuzhonghu commented May 25, 2019

Uh oh!

mailzyok commented May 30, 2019

Uh oh!

stepanstipl commented May 30, 2019

Uh oh!

hzxuzhonghu commented Jun 1, 2019

Uh oh!

hzxuzhonghu commented Jun 1, 2019

Uh oh!

hzxuzhonghu commented Jun 1, 2019

Uh oh!

duderino commented Jun 2, 2019

Uh oh!

istio-testing commented Jun 2, 2019

Uh oh!

hanymahajna commented Jun 20, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

istio-testing commented Apr 12, 2019 •

edited

Loading