Experiment for improving pilot handling of large services#3477
Experiment for improving pilot handling of large services#3477istio-merge-robot merged 5 commits intomasterfrom
Conversation
|
whyunolint ;-) |
|
Costin, if Pilot is an upstream cluster in envoy, the stats should be collected in Prometheus so long as you deploy mixer and Prometheus add one. |
|
Pilot uses a custom config - will need to be changed to add mixer calls
(and mixer cluster).
Same for mixer - it has an envoy in front, but it doesn't report to itself.
Right now the envoys are only used to secure the communication.
If someone familiar with mixer can do this in a separate PR - we may even
see nice graphs !
…On Tue, Feb 13, 2018 at 7:05 PM, mandarjog ***@***.***> wrote:
Costin, if Pilot is an upstream cluster in envoy, the stats should be
collected in Prometheus so long as you deploy mixer and Prometheus add one.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3477 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAFI6lSmz9RJa-i_ef6zKsJfRyOz22fZks5tUk2TgaJpZM4SEr1T>
.
|
rshriram
left a comment
There was a problem hiding this comment.
The change looks okay. But what do you mean by “By processing events faster there is less cache recomputation, ” ? Every event blows up the cache.
|
Each event blows up the cache - but the re-computation happens when envoy requests something. Part 2 of the fix - in separate PR (maybe Martin or someone else can help:-) is to not blow the cache And Part 3 obviously (as discussed) is to have the most common events - workload/endpoint changes - |
|
@costinm This is a metric that we already collect and you can look at it on prometheus. |
|
Let me clarify. Each event from AppendServiceHandler and AppendInstanceHandler in Pilot (not mixer filter in envoy), blows up the cache (out.ClearCache() ). So, by increasing the event frequency, you are merely flushing the cache more times more quickly as opposed to the previous mode. What we really need is a way to squash the events from k8s (not throttle). i.e. we need a level triggered interrupt mode that sets a flag saying there have been changes in k8s.. You unset it, flush the cache. Since we don't process the content of these events, we just need to know whether there was a platform event or not.. |
|
Yes, you are right - we will process the events faster and flush the cache
faster ( still once per event ). The number of
clearCache doesn't change with this PR, just the frequency. Where it helps
is that a lot of memory allocation happens when
envoy makes a request and populates the cache. If you happen to finish with
100 events in one second - before any envoy
makes a request - you clean the cache 100 times but it's never filled.
Agree, separate PR needed to also make sure we can squash (not throttle), I
used the wrong term. But I would
squash the 'clear cache' events - batching k8s events is also desirable,
but I think it's a bit longer term.
…On Wed, Feb 14, 2018 at 1:55 PM, Shriram Rajagopalan < ***@***.***> wrote:
Let me clarify. Each event from AppendServiceHandler and
AppendInstanceHandler in Pilot (not mixer filter in envoy), blows up the
cache (out.ClearCache() ). So, by increasing the event frequency, you are
merely flushing the cache more times more quickly as opposed to the
previous mode.
What we really need is a way to squash the events from k8s (not throttle).
i.e. we need a level triggered interrupt mode that sets a flag saying there
have been changes in k8s.. You unset it, flush the cache. Since we don't
process the content of these events, we just need to know whether there was
a platform event or not..
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3477 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAFI6jmw3aE0XLhxVbaBun8S_xQWXx5pks5tU1ZcgaJpZM4SEr1T>
.
|
|
If I can get a quick approval/lgtm, maybe it can make it into 0.6... |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ldemailly The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
|
/test all [submit-queue is verifying that this PR is safe to merge] |
|
Automatic merge from submit-queue. |
|
please put in summary how this is configured/used |
The throttle on event processing is removed by default, can be added back if needed (but so far my tests show it's better without). By processing events faster there is less cache recomputation, for a given sidecar QPS. By prolonging the update, more cache clear were happening and more requests from envoy with an empty cache.
Also added logs, to evaluate how many requests from envoy and what the latency is. The sidecar for pilot doesn't report mixer metrics, and pilot doesn't log response times.