Experiment for improving pilot handling of large services by costinm · Pull Request #3477 · istio/istio

costinm · 2018-02-14T01:53:46Z

The throttle on event processing is removed by default, can be added back if needed (but so far my tests show it's better without). By processing events faster there is less cache recomputation, for a given sidecar QPS. By prolonging the update, more cache clear were happening and more requests from envoy with an empty cache.

Also added logs, to evaluate how many requests from envoy and what the latency is. The sidecar for pilot doesn't report mixer metrics, and pilot doesn't log response times.

ldemailly · 2018-02-14T02:31:57Z

whyunolint ;-)

mandarjog · 2018-02-14T03:05:54Z

Costin, if Pilot is an upstream cluster in envoy, the stats should be collected in Prometheus so long as you deploy mixer and Prometheus add one.

costinm · 2018-02-14T03:15:41Z

Pilot uses a custom config - will need to be changed to add mixer calls (and mixer cluster). Same for mixer - it has an envoy in front, but it doesn't report to itself. Right now the envoys are only used to secure the communication. If someone familiar with mixer can do this in a separate PR - we may even see nice graphs !

…

On Tue, Feb 13, 2018 at 7:05 PM, mandarjog ***@***.***> wrote: Costin, if Pilot is an upstream cluster in envoy, the stats should be collected in Prometheus so long as you deploy mixer and Prometheus add one. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3477 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAFI6lSmz9RJa-i_ef6zKsJfRyOz22fZks5tUk2TgaJpZM4SEr1T> .

rshriram

The change looks okay. But what do you mean by “By processing events faster there is less cache recomputation, ” ? Every event blows up the cache.

costinm · 2018-02-14T18:34:02Z

Each event blows up the cache - but the re-computation happens when envoy requests something.
By processing the events faster ( 4 seconds instead of 100 ) we get fewer envoy requests and re-computations.

Part 2 of the fix - in separate PR (maybe Martin or someone else can help:-) is to not blow the cache
immediately after an event if it was recently cleared, but instead set a timer so the cache cleaning is
delayed to a reasonable rate.

And Part 3 obviously (as discussed) is to have the most common events - workload/endpoint changes -
use a more direct path and not go trough full cache blowing or event-by-event loading.

mandarjog · 2018-02-14T19:40:24Z

@costinm
envoy_cluster_out_istio_pilot_istio_system_svc_cluster_local_http_discovery_upstream_rq_time

This is a metric that we already collect and you can look at it on prometheus.

rshriram · 2018-02-14T21:55:40Z

Let me clarify. Each event from AppendServiceHandler and AppendInstanceHandler in Pilot (not mixer filter in envoy), blows up the cache (out.ClearCache() ). So, by increasing the event frequency, you are merely flushing the cache more times more quickly as opposed to the previous mode.

What we really need is a way to squash the events from k8s (not throttle). i.e. we need a level triggered interrupt mode that sets a flag saying there have been changes in k8s.. You unset it, flush the cache. Since we don't process the content of these events, we just need to know whether there was a platform event or not..

costinm · 2018-02-14T22:00:47Z

Yes, you are right - we will process the events faster and flush the cache faster ( still once per event ). The number of clearCache doesn't change with this PR, just the frequency. Where it helps is that a lot of memory allocation happens when envoy makes a request and populates the cache. If you happen to finish with 100 events in one second - before any envoy makes a request - you clean the cache 100 times but it's never filled. Agree, separate PR needed to also make sure we can squash (not throttle), I used the wrong term. But I would squash the 'clear cache' events - batching k8s events is also desirable, but I think it's a bit longer term.

…

On Wed, Feb 14, 2018 at 1:55 PM, Shriram Rajagopalan < ***@***.***> wrote: Let me clarify. Each event from AppendServiceHandler and AppendInstanceHandler in Pilot (not mixer filter in envoy), blows up the cache (out.ClearCache() ). So, by increasing the event frequency, you are merely flushing the cache more times more quickly as opposed to the previous mode. What we really need is a way to squash the events from k8s (not throttle). i.e. we need a level triggered interrupt mode that sets a flag saying there have been changes in k8s.. You unset it, flush the cache. Since we don't process the content of these events, we just need to know whether there was a platform event or not.. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3477 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAFI6jmw3aE0XLhxVbaBun8S_xQWXx5pks5tU1ZcgaJpZM4SEr1T> .

costinm · 2018-02-14T22:27:57Z

If I can get a quick approval/lgtm, maybe it can make it into 0.6...

ldemailly · 2018-02-14T22:56:57Z

/lgtm

istio-merge-robot · 2018-02-14T22:57:06Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ldemailly

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these OWNERS Files:

~~pilot/OWNERS~~ [ldemailly]

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

istio-merge-robot · 2018-02-14T22:58:01Z

/test all [submit-queue is verifying that this PR is safe to merge]

istio-merge-robot · 2018-02-14T23:40:34Z

Automatic merge from submit-queue.

ldemailly · 2018-02-14T23:42:54Z

please put in summary how this is configured/used
(and expected change if any of default behavior)

costinm added 2 commits February 13, 2018 17:11

Temporarily remove the throttle, to verify impact on scalability

91a0add

Add log file to pilot (temp?), fix queue

9437d9c

costinm requested a review from a team February 14, 2018 01:53

googlebot added the cla: yes label Feb 14, 2018

rshriram reviewed Feb 14, 2018

View reviewed changes

Lint errors

be0bc17

costinm added 2 commits February 14, 2018 12:54

Revert broken access log, it's a tcp filter

3239a15

Lint error

ebd352c

istio-testing assigned ldemailly Feb 14, 2018

istio-testing added the lgtm label Feb 14, 2018

istio-merge-robot added the approved label Feb 14, 2018

istio-merge-robot merged commit 8fce954 into master Feb 14, 2018

ldemailly added the kind/important for release notes label Feb 14, 2018

ldemailly deleted the costin-pilot branch February 15, 2018 05:39

costinm mentioned this pull request Feb 17, 2018

Customize and fix the event queue in k8s (pilot too slow for 200 services) #3359

Closed

PetrMc pushed a commit to PetrMc/istio-petrmc-upstream-fork that referenced this pull request Jan 14, 2026

Bump Istio dependencies for master-solo branch (istio#3477)

44eb5f6

Conversation

costinm commented Feb 14, 2018

Uh oh!

ldemailly commented Feb 14, 2018

Uh oh!

mandarjog commented Feb 14, 2018

Uh oh!

costinm commented Feb 14, 2018 via email

Uh oh!

rshriram left a comment

Choose a reason for hiding this comment

Uh oh!

costinm commented Feb 14, 2018

Uh oh!

mandarjog commented Feb 14, 2018

Uh oh!

rshriram commented Feb 14, 2018

Uh oh!

costinm commented Feb 14, 2018 via email

Uh oh!

costinm commented Feb 14, 2018

Uh oh!

ldemailly commented Feb 14, 2018

Uh oh!

istio-merge-robot commented Feb 14, 2018

Uh oh!

istio-merge-robot commented Feb 14, 2018

Uh oh!

istio-merge-robot commented Feb 14, 2018

Uh oh!

ldemailly commented Feb 14, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants