Cache compressed cluster state size by DaveCTurner · Pull Request #39827 · elastic/elasticsearch

DaveCTurner · 2019-03-08T09:50:47Z

Today we compute the size of the compressed cluster state for each
cluster:monitor/state action. We do so by serializing the whole cluster state
and compressing it, and this happens on the network thread. This calculation
can be rather expensive if the cluster state is large, and these actions can be
rather frequent particularly if there are sniffing transport clients in use.
Also the calculation is a simple function of the cluster state, so there is
a lot of duplicated work here.

This change introduces a small cache for this size computation to avoid all
this duplicated work, and to avoid blocking network threads.

Fixes #39806.

Today we compute the size of the compressed cluster state for each `cluster:monitor/state` action. We do so by serializing the whole cluster state and compressing it, and this happens on the network thread. This calculation can be rather expensive if the cluster state is large, and these actions can be rather frequent particularly if there are sniffing transport clients in use. Also the calculation is a simple function of the cluster state, so there is a lot of duplicated work here. This change introduces a small cache for this size computation to avoid all this duplicated work, and to avoid blocking network threads. Fixes elastic#39806.

elasticmachine · 2019-03-08T09:51:18Z

Pinging @elastic/es-distributed

original-brownbear

I think we could be a little less blocking here, but otherwise LGTM :)

EDIT: seems the test failure in ci-1 could be related though?

original-brownbear · 2019-03-08T10:12:41Z

.../src/main/java/org/elasticsearch/action/admin/cluster/state/TransportClusterStateAction.java

+
+        clusterStateSizeByVersionCache.getOrComputeCachedSize(currentState.version(),
+            () -> PublicationTransportHandler.serializeFullClusterState(currentState, Version.CURRENT).length(),
+            ActionListener.wrap(size ->


NIT: you could use the new ActionListener#map here to make this a little nicer :)

Today is a day of learning :)

original-brownbear · 2019-03-08T10:37:26Z

.../src/main/java/org/elasticsearch/action/admin/cluster/state/TransportClusterStateAction.java

+            }
+        };
+
+        synchronized void getOrComputeCachedSize(final long clusterStateVersion,


Maybe block a little less here, by simply using a Collections.synchronizedMap() on the map (that'll be safe with computeIfAbsent) since the addListener call is thread-safe anyway and it would probably be nice to have it resolve outside a synchronized block which currently it won't for cached values?

Collections.synchronizedMap() doesn't let me bound the size of the map by overriding removeEldestEntry(). But you're right about calling addListener outside the mutex, I'll do that.

@DaveCTurner you can just create the same map you're already creating and wrap it with java.util.Collections#synchronizedMap? :) That said it doesn't really matter, it's just less code than having your own mutex handling.

Oh right I was thinking of newConcurrentMap(). Got it.

…one)

original-brownbear

LGTM, thanks!

jasontedor

Why are we opting for a more intricate solution (forking to another thread pool, and a synchronized cache of futures) instead of merely forking the entirety of these requests over to the generic thread pool? The problem went from "we shouldn't do this on the network thread" to a solution that takes part of it off the network thread, and has an optimization built in. Are we convinced we need the latter part?

original-brownbear · 2019-03-08T13:20:34Z

@jasontedor @DaveCTurner as far as I understand it there can be situations here were the request rate gets pretty high in large clusters (see the SDH attached to the issue), that's why I figured this may be worth it. I was thinking the same though: #39806 (comment) ... but not sure if we can do a reasonable benchmark to decide.

DaveCTurner · 2019-03-08T13:20:36Z

The hot threads output from the support case that triggered this shows that almost every node is using 30%+ of a CPU just calculating the size of the compressed cluster state, with some of them using 300%+. If we simply forked onto the generic threadpool we would avoid doing this calculation on the network thread but we'd still be repeating all that work for each call.

original-brownbear · 2019-03-08T14:14:05Z

@DaveCTurner (not wanting to put words in Jason's mouth) I think the contention isn't so much with the caching per se, but rather with the complexity of resolving the listeners on the transport thread if there's a cached value and forking to the generic thread-pool if there isn't manually.
If we just did all the resolving and answering on the generic thread pool (which would just be a change to the value returned by org.elasticsearch.action.admin.cluster.state.TransportClusterStateAction#executor) we'd not have to worry about synchronization on the io thread and its implications and just save quite a bit of code.

jasontedor · 2019-03-08T14:36:59Z

I also want to push back and question whether or not we really need to be reporting this. That is, I want to revisit #3415 and wonder if we can we get away with removing this altogether?

DaveCTurner · 2019-03-08T14:56:30Z

@jasontedor I agree that we should simply stop reporting this in 8.0. However I think we shouldn't introduce that breaking change into 7.0, but I don't hold that opinion very tightly so if you think that's ok then I can do that instead.

Dismissing this to indicate that we're still discussing if this is the right approach.

jasontedor · 2019-03-08T15:03:46Z

I'm okay with deprecating in 6.7 and breaking in 7.0.

andrershov · 2019-03-11T14:57:54Z

@DaveCTurner @jasontedor I was using compressed_size_in_bytes heavily when performing scaling tests with huge cluster state, do you propose to get rid of it? Do we have a replacement for it?

jasontedor · 2019-03-11T15:03:40Z

The proposal is that we remove this indeed, to me it has questionable value. Cluster states are tens to hundreds of megabytes, and typically anything beyond that is a sign that something is wrong. I don't think we need reporting within the API for the compressed size of a single file given that its size doesn't vary too wildly, and its uncompressed size can be read off disk. If we really need API reporting here, I would favor that we report the uncompressed size on disk, but I question the value of that too.

DaveCTurner · 2019-03-12T14:53:49Z

Closing in favour of #39951.

DaveCTurner requested review from andrershov and original-brownbear March 8, 2019 09:50

DaveCTurner added >bug v7.0.0 :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v8.0.0 v7.2.0 labels Mar 8, 2019

DaveCTurner mentioned this pull request Mar 8, 2019

Cluster state is serialized on the network thread #39806

Closed

original-brownbear reviewed Mar 8, 2019

View reviewed changes

DaveCTurner added 2 commits March 8, 2019 10:45

State request is no longer synchronous, and no need to assertBusy(isD…

48ec9a9

…one)

ActionListener.map, Collections.synchronizedMap, tidyup oh my

8d44cf5

DaveCTurner requested a review from original-brownbear March 8, 2019 11:37

original-brownbear previously approved these changes Mar 8, 2019

View reviewed changes

jasontedor reviewed Mar 8, 2019

View reviewed changes

jasontedor mentioned this pull request Mar 13, 2019

Stop returning cluster state size by default #40016

Merged

andrershov closed this Mar 14, 2019

This was referenced Mar 14, 2019

Remove cluster state size #40061

Merged

Remove es.cluster_state.size hard failure #40111

Merged

jakelandis added v7.0.0-rc2 and removed v7.0.0 labels Apr 3, 2019

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

DaveCTurner deleted the 2019-03-08-cluster-size-cache branch July 23, 2022 10:44

Conversation

DaveCTurner commented Mar 8, 2019

Uh oh!

elasticmachine commented Mar 8, 2019

Uh oh!

original-brownbear left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

original-brownbear Mar 8, 2019

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Mar 8, 2019

Choose a reason for hiding this comment

Uh oh!

original-brownbear Mar 8, 2019

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Mar 8, 2019

Choose a reason for hiding this comment

Uh oh!

original-brownbear Mar 8, 2019

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Mar 8, 2019

Choose a reason for hiding this comment

Uh oh!

original-brownbear left a comment

Choose a reason for hiding this comment

Uh oh!

jasontedor left a comment

Choose a reason for hiding this comment

Uh oh!

original-brownbear commented Mar 8, 2019

Uh oh!

DaveCTurner commented Mar 8, 2019

Uh oh!

original-brownbear commented Mar 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jasontedor commented Mar 8, 2019

Uh oh!

DaveCTurner commented Mar 8, 2019

Uh oh!

jasontedor commented Mar 8, 2019

Uh oh!

andrershov commented Mar 11, 2019

Uh oh!

jasontedor commented Mar 11, 2019

Uh oh!

DaveCTurner commented Mar 12, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

original-brownbear left a comment •

edited

Loading

original-brownbear commented Mar 8, 2019 •

edited

Loading