Skip to content

ads: ensure ads refuses connections when caches are not synced#20296

Merged
istio-testing merged 3 commits intoistio:masterfrom
ramaraochavali:fix/readiness
Jan 19, 2020
Merged

ads: ensure ads refuses connections when caches are not synced#20296
istio-testing merged 3 commits intoistio:masterfrom
ramaraochavali:fix/readiness

Conversation

@ramaraochavali
Copy link
Copy Markdown
Contributor

Fixes #20295
[ ] Configuration Infrastructure
[ ] Docs
[ ] Installation
[X ] Networking
[ ] Performance and Scalability
[ ] Policies and Telemetry
[ ] Security
[ ] Test and Release
[ ] User Experience
[ ] Developer Infrastructure

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
@ramaraochavali ramaraochavali requested review from a team as code owners January 18, 2020 12:45
@googlebot googlebot added the cla: yes Set by the Google CLA bot to indicate the author of a PR has signed the Google CLA. label Jan 18, 2020
@istio-policy-bot
Copy link
Copy Markdown

🤔 🐛 You appear to be fixing a bug in Go code, yet your PR doesn't include updates to any test files. Did you forget to add a test?

Courtesy of your friendly test nag.

@istio-testing istio-testing added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jan 18, 2020
Copy link
Copy Markdown
Member

@howardjohn howardjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we mark ourselves as not ready rather than refuse connections?

@ramaraochavali
Copy link
Copy Markdown
Contributor Author

shouldn't we mark ourselves as not ready rather than refuse connections?

This is just to reject local Envoy connection. We are not marked ready till cache sync is done so external Envoys can not connect. But by allowing local pilot to connect we are initializing push context with empty config.

@ramaraochavali
Copy link
Copy Markdown
Contributor Author

And Pilot's envoy does not care about readiness because it connects via localhost.

@howardjohn
Copy link
Copy Markdown
Member

But no one connects to the local proxy until it's ready? As far as I know the local proxy is 100% static so it's not connecting to pilot for xds, is that not true?

@howardjohn
Copy link
Copy Markdown
Member

I'm fine with having a second mechanism to catch weird cases but I think we need to enhance the readiness probe for the normal case, unless I misunderstood

@ramaraochavali
Copy link
Copy Markdown
Contributor Author

I am phone to verify what Istio’s master template does, but in our case local proxy definitely connects to pilot and as I mentioned in the issue it definitely connected to Pilot before caches synced up.
We already have readiness probe that takes care of normal cases - we open the http listener that serves readiness probe only after cache is synced- so normal flow is taken care.
This is issue is caused now because the istiod PR changed order and opened gRPC listener before caches are synced.

@howardjohn
Copy link
Copy Markdown
Member

I see, I didn't realize you had a custom setup.

When another sidecar connects after this, since it is already intialized, it uses the same push context.

to me it seems like this is the part we should be fixing. shouldn't it be creating a new context? I'm on a phone so this is from memory though, maybe that doesn't make sense

@hzxuzhonghu
Copy link
Copy Markdown
Member

/cc i will have a look

Copy link
Copy Markdown
Member

@hzxuzhonghu hzxuzhonghu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

}

// grpcServer is shared by Galley, CA, XDS - must Serve at the end, but before 'wait'
log.Infof("starting gRPC discovery service at %s", s.GRPCListener.Addr())
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move into the below go func?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. makes sense. Moved this and the below http log line in to go func

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
@istio-testing istio-testing merged commit 09441dd into istio:master Jan 19, 2020
@howardjohn
Copy link
Copy Markdown
Member

I don't think this was a good idea to merge. Should we fix the real problem instead of hacking around it?

@ramaraochavali ramaraochavali deleted the fix/readiness branch January 20, 2020 03:04
@ramaraochavali
Copy link
Copy Markdown
Contributor Author

@howardjohn

Should we fix the real problem instead of hacking around it?

I could not think of better solution immediately than what is here in this PR and regarding your question about

to me it seems like this is the part we should be fixing. shouldn't it be creating a new context?

This is a global context so I guess it is assumed to be created once. I do not know of the implications of creating a new context/updating it every time apart from obvious perf reason (which in it self may be good enough reason to not create a new context).

But if you have any better ideas, please let me know - I am happy to implement that and revert this change.

If you do not have any better ideas immediately, I am OK open a new issue that we can track to improve this if you think that is better to do. Please let me know.

// OnServerReady is called when caches have been synced so that it can process requests.
func (s *DiscoveryServer) OnServerReady() {
s.updateMutex.Lock()
s.serverReady = true
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just recreate the context here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that seems to be an option - We can add RefreshContext method in PushContext that just reloads every thing without checking if init is done or not, and we can call it once caches have been synced and before we declare our self ready.

Only downside of it is pilot's local proxy would get incomplete config in the first call but would immediately get refreshed because of config change and since it is any way starting afresh, it should be OK.

If this is an option, I can make that change.

@hzxuzhonghu What do you think about this?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, i could not catch up, what do you mean by context here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PushContext - recreate the PushContext once caches have been synced

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me think about

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#20323 - This is what he is suggesting. PTAL if it is OK or it causes any issues

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Then we can do this: trigger full push to all connected clients here.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But there is one concern: there may be race condition between the first xds request/push and the new push here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There wont be any new clients here except for local pilot proxy because Pilot is not ready it self here, it will only be ready after this line. Regarding pilot's proxy, I guess it would get updates because of regular configupdate push and since it is any way starting now, It should not be an issue.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/networking cla: yes Set by the Google CLA bot to indicate the author of a PR has signed the Google CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Proxies loose clusters some times when Pilot is restarted

6 participants