Skip to content

Experiment for improving pilot handling of large services#3477

Merged
istio-merge-robot merged 5 commits intomasterfrom
costin-pilot
Feb 14, 2018
Merged

Experiment for improving pilot handling of large services#3477
istio-merge-robot merged 5 commits intomasterfrom
costin-pilot

Conversation

@costinm
Copy link
Copy Markdown
Contributor

@costinm costinm commented Feb 14, 2018

The throttle on event processing is removed by default, can be added back if needed (but so far my tests show it's better without). By processing events faster there is less cache recomputation, for a given sidecar QPS. By prolonging the update, more cache clear were happening and more requests from envoy with an empty cache.

Also added logs, to evaluate how many requests from envoy and what the latency is. The sidecar for pilot doesn't report mixer metrics, and pilot doesn't log response times.

@costinm costinm requested a review from a team February 14, 2018 01:53
@ldemailly
Copy link
Copy Markdown
Member

whyunolint ;-)

@mandarjog
Copy link
Copy Markdown
Contributor

Costin, if Pilot is an upstream cluster in envoy, the stats should be collected in Prometheus so long as you deploy mixer and Prometheus add one.

@costinm
Copy link
Copy Markdown
Contributor Author

costinm commented Feb 14, 2018 via email

Copy link
Copy Markdown
Member

@rshriram rshriram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks okay. But what do you mean by “By processing events faster there is less cache recomputation, ” ? Every event blows up the cache.

@costinm
Copy link
Copy Markdown
Contributor Author

costinm commented Feb 14, 2018

Each event blows up the cache - but the re-computation happens when envoy requests something.
By processing the events faster ( 4 seconds instead of 100 ) we get fewer envoy requests and re-computations.

Part 2 of the fix - in separate PR (maybe Martin or someone else can help:-) is to not blow the cache
immediately after an event if it was recently cleared, but instead set a timer so the cache cleaning is
delayed to a reasonable rate.

And Part 3 obviously (as discussed) is to have the most common events - workload/endpoint changes -
use a more direct path and not go trough full cache blowing or event-by-event loading.

@mandarjog
Copy link
Copy Markdown
Contributor

@costinm
envoy_cluster_out_istio_pilot_istio_system_svc_cluster_local_http_discovery_upstream_rq_time

This is a metric that we already collect and you can look at it on prometheus.

@rshriram
Copy link
Copy Markdown
Member

Let me clarify. Each event from AppendServiceHandler and AppendInstanceHandler in Pilot (not mixer filter in envoy), blows up the cache (out.ClearCache() ). So, by increasing the event frequency, you are merely flushing the cache more times more quickly as opposed to the previous mode.

What we really need is a way to squash the events from k8s (not throttle). i.e. we need a level triggered interrupt mode that sets a flag saying there have been changes in k8s.. You unset it, flush the cache. Since we don't process the content of these events, we just need to know whether there was a platform event or not..

@costinm
Copy link
Copy Markdown
Contributor Author

costinm commented Feb 14, 2018 via email

@costinm
Copy link
Copy Markdown
Contributor Author

costinm commented Feb 14, 2018

If I can get a quick approval/lgtm, maybe it can make it into 0.6...

@ldemailly
Copy link
Copy Markdown
Member

/lgtm

@istio-merge-robot
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ldemailly

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@istio-merge-robot
Copy link
Copy Markdown

/test all [submit-queue is verifying that this PR is safe to merge]

@istio-merge-robot
Copy link
Copy Markdown

Automatic merge from submit-queue.

@ldemailly
Copy link
Copy Markdown
Member

please put in summary how this is configured/used
(and expected change if any of default behavior)

@ldemailly ldemailly deleted the costin-pilot branch February 15, 2018 05:39
PetrMc pushed a commit to PetrMc/istio-petrmc-upstream-fork that referenced this pull request Jan 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants