Skip to content

eds: do not trigger continuous full pushes when a pod is in crash loop#18574

Merged
istio-testing merged 13 commits intoistio:masterfrom
ramaraochavali:fix/eds_full_push
Nov 6, 2019
Merged

eds: do not trigger continuous full pushes when a pod is in crash loop#18574
istio-testing merged 13 commits intoistio:masterfrom
ramaraochavali:fix/eds_full_push

Conversation

@ramaraochavali
Copy link
Copy Markdown
Contributor

@ramaraochavali ramaraochavali commented Nov 2, 2019

Prevents full push being triggered when a pod is in crash loop and thus its endpoints flipflop between 0 and 1. Fixes #13822

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
@ramaraochavali ramaraochavali requested a review from a team as a code owner November 2, 2019 03:37
@istio-testing istio-testing added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 2, 2019
@googlebot googlebot added the cla: yes Set by the Google CLA bot to indicate the author of a PR has signed the Google CLA. label Nov 2, 2019
@ramaraochavali
Copy link
Copy Markdown
Contributor Author

@hzxuzhonghu @rshriram @howardjohn PTAL

@ramaraochavali
Copy link
Copy Markdown
Contributor Author

/test integ-pilot-k8s-tests_istio

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
@ramaraochavali ramaraochavali requested a review from a team as a code owner November 2, 2019 06:37
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
@ramaraochavali
Copy link
Copy Markdown
Contributor Author

SyntheticServiceEntryController seems to be incorrectly dependent on removing the service key from EndpointShardsByService when endpoints are zero (and then subsequent updates triggering full push) - It also not triggering full push when service changes in end-end-pilot integ test. So I changed it to trigger full push on every configUpdate (Similar to how service entries work) with a TODO to fix later. If there are better ideas/simple fix - I can try but it seems to be problem with SyntheticServiceEntryController which should be handled separately. @hzxuzhonghu any idea?

Copy link
Copy Markdown
Member

@howardjohn howardjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall lgtm, thanks for working on this and adding good tests.

I only skimmed it right now so I'll give a chance for Nino to look at the SSE code or others then give a proper review during the week

t.Fatal("Expecting only EDS update as part of a partial push. But received CDS also +v", upd)
}

if len(upd) > 0 && !contains(upd, "eds") {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing % on the format

@howardjohn
Copy link
Copy Markdown
Member

@Nino-K

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
// deleteEndpointShards deletes matching endpoints from EndpointShardsByService map. This is called when
// endpoints are deleted or the service is deleted. If deleteKeys is true, this method will also delete the
// associated entries from map.
func (s *DiscoveryServer) deleteEndpointShards(cluster, serviceName, namespace string, deleteKeys bool) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Donot like deleteKeys param, this kind of param is confusing.

Can you only do delete in SvcUpdate? Thus we don't need to care about endpoints=0 or not?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I changed it to two functions and called delete only from SvcUpdate and deleting shards from edsupdate.

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
// functioning correctly. Currently it is working because on edsUpdate if we set endpoints to 0, we remove
// the service from EndpointShardsByService and subsequent eds updates trigger a full push. That is being
// fixed in https://github.com/istio/istio/pull/18574. Need to fix this issue and re-enable conditional
// full push. For now, any configupdate triggers a full push much like service entries.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ramaraochavali can you please put this in an issue, being able to do a EDSUpdate vs ConfigUpdate was one of the main goals of this controller.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Created this issue #18625

oldEpVersion := c.endpointVersion(conf.Namespace, conf.Name)
newEpVersion := version(conf.Annotations, endpointKey)
if oldEpVersion != newEpVersion {
if err := c.edsUpdate(conf); err != nil {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why remove this?

Copy link
Copy Markdown
Contributor Author

@ramaraochavali ramaraochavali Nov 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because it is ineffective now - any update needs a full push

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't we only do eds push when svc not changed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. That is what we have to do. But it is not working for SSE. See the TODO below and #18625

delete(s.EndpointShardsByService[serviceName][namespace].Shards, cluster)
s.EndpointShardsByService[serviceName][namespace].mutex.Unlock()
delete(s.EndpointShardsByService[serviceName], namespace)
delete(s.EndpointShardsByService, serviceName)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not right, we can delete EndpointShardsByService[hostname] only when len(EndpointShardsByService[hostname]) = 0

s.EndpointShardsByService[serviceName][namespace].mutex.Lock()
delete(s.EndpointShardsByService[serviceName][namespace].Shards, cluster)
s.EndpointShardsByService[serviceName][namespace].mutex.Unlock()
delete(s.EndpointShardsByService[serviceName], namespace)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same only can delete when s.EndpointShardsByService[serviceName][namesapce].Shards = 0

// updated to force a EDS and CDS recomputation and incremental push, as it doesn't affect
// LDS/RDS.
SvcUpdate(shard, hostname string, ports map[string]uint32, rports map[uint32]string)
SvcUpdate(shard, hostname string, namespace string, event Event, ports map[string]uint32, rports map[uint32]string)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We donot need the ports and rports param

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think may be better to leave them for future? Because they were there from the beginning.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or if you are ok to clean it up - I can do that. I do not have strong opinion on that.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's clean up it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Cleaned it up. PTAL

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
@ramaraochavali
Copy link
Copy Markdown
Contributor Author

/test unit-tests_istio

@ramaraochavali
Copy link
Copy Markdown
Contributor Author

@rshriram @howardjohn @hzxuzhonghu - Can you PTAL? Addressed all the comments

@istio-testing istio-testing merged commit c5d10e6 into istio:master Nov 6, 2019
@ramaraochavali ramaraochavali deleted the fix/eds_full_push branch November 6, 2019 04:56
sdake pushed a commit to sdake/istio that referenced this pull request Dec 1, 2019
istio#18574)

* wip

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

* add test cases

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

* revert controller change

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

* add debug error

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

* try deleting completely

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

* trigger full push for ss3 config update

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

* change sse to trigger full push

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

* lint

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

* review comments

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

* split delete in to two functions

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

* address comments

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>

* clean up ports

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/networking cla: yes Set by the Google CLA bot to indicate the author of a PR has signed the Google CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

STALE (Never Acknowledged) comes & ago every few seconds on all parameters on random portions of all pods

7 participants