Improving cache lookup to reduce recomputing / searches by mjmbischoff · Pull Request #77259 · elastic/elasticsearch

mjmbischoff · 2021-09-03T14:04:23Z

Adressing:
// intentionally non-locking for simplicity...it's OK if we re-put the same key/value in the cache during a race condition.

Avoiding race condition while keeping cache lookup fast / non-blocking using a combination of computeIfAbsent(..) and CompletableFuture
Invalidation of the cache is made idempotent by using the original value to avoid invalidating again but is called for each in flight request
Cache entries are slightly bigger due to bookkeeping but seems like a good trade-off

- Avoiding race condition while keeping cache lookup fast / non-blocking using a combination of computeIfAbsent and CompletableFuture - Invalidation of the cache is idempotent avoiding invalidating again but does require a lookup. - Cache entries are slightly bigger due to bookkeeping but seems like a good trade-off

martijnvg · 2021-09-15T09:10:06Z

Hey @mjmbischoff, thanks for putting this PR up. However I wonder whether the current approach in the PR can still lead to blocking threads? If I understand the Cache#computeIfAbsent(...) method correctly then in the case when there is no entry for key and it needs to be loaded, the first thread attempts to load it. But other threads during this process will wait until that thread has completed loading/computing a value (see line 435 in Cache.java and else statement at line 431).

mjmbischoff · 2021-09-15T09:35:58Z

Yes the first thread will do work. Technically all threads 'block', because as you know any call blocks, just for a very short time. The thread 'winning' out will have a little bit more work to do, although the others are blocked / parked at that point in time. This is because the computeIfAbsent(..) guarantees that one wins / only one value is calculated under contention. But all calls should return an Value.

The computation however is only the matter of creating an CompletableFuture Object and then dispatching an async call which should be short (in the sense that ppl will call it non blocking) The computation finishes and the CompletableFuture is returned on which we register a listener. The listener is called by the thread setting the value on the CompletableFuture so that will be the thread on which the async call was dispatched. Which is long running / blocking anyway because a EnrichCoordinatorProxyAction is executed there.

TL:DR the thread on the threadpoll of the originClient is the one that does all the work the rest is callbacks and cheap calls.

Feel free to make the EnrichCoordinatorProxyAction slow to assert this. Now that I think of it I guess I could write a test for this.

mjmbischoff · 2021-09-15T09:56:05Z

Ah also forgot to mention that this approach also allows to cache failures, which it now immediately invalidates but perhaps it makes sense to observe some time out to save resources.

martijnvg · 2021-09-15T12:04:41Z

Yes the first thread will do work. Technically all threads 'block', because as you know any call blocks, just for a very short time.

Apologies, I somehow thought that the remote call was done as part of other threads waiting... That is wrong. Just the creation of the Future and I agree that blocks a thread for a very short time. This shouldn't be an issue.

martijnvg

Thanks for explaining @mjmbischoff. I left a few comments.

x-pack/plugin/enrich/src/main/java/org/elasticsearch/xpack/enrich/EnrichCache.java

x-pack/plugin/enrich/src/main/java/org/elasticsearch/xpack/enrich/EnrichProcessorFactory.java

elasticmachine · 2021-09-15T15:34:07Z

Pinging @elastic/es-data-management (Team:Data Management)

Made the cache read-through Added / updated tests

…chUncaughtExceptionHandler

mjmbischoff · 2021-09-18T07:30:28Z

Think we're getting close. Added comments to the code to clarify (or to make my miss-assumptions explicit. ;-))

martijnvg

Thanks @mjmbischoff for the update. I left a few comments.

x-pack/plugin/enrich/src/main/java/org/elasticsearch/xpack/enrich/EnrichCache.java

martijnvg · 2021-10-01T15:17:35Z

x-pack/plugin/enrich/src/test/java/org/elasticsearch/xpack/enrich/EnrichCacheTests.java

        assertThat(cacheStats.getCount(), equalTo(3L));
        assertThat(cacheStats.getHits(), equalTo(0L));
-        assertThat(cacheStats.getMisses(), equalTo(0L));
+        assertThat(cacheStats.getMisses(), equalTo(3L));


why is the expected miscount increased here?

Without introducing a method specifically for the tests, warming the cache requires a cache miss to trigger a resolve.

The class is also final so can't do the usual trickery to break encapsulation. A specific put or constructor where we can pass in a Cache<CacheKey, CompletableFuture<SearchResponse>> cache are alternatives - I don't know whats preferred?

I see. Maybe we should make EnrichCache non final?

Adressed in 14ef0a2

martijnvg · 2021-10-01T15:18:51Z

@elasticmachine update branch

Remove indirection as caching negative lookups is unlikely to be a good idea.

martijnvg

LGTM 👍

mjmbischoff · 2021-10-04T13:10:29Z

@martijnvg As always many thanks for reviewing! Looking at #76800 I think it's targeted for 7.16 but I'm unsure if that train has already left the station, can I leave the backporting, if any, to you?

martijnvg · 2021-10-04T13:14:45Z

@mjmbischoff I will backport this to 7.16, there is still some time before feature freeze.

Backporting elastic#77259 to 7.x branch. Improved the cache logic to avoid duplicate searches when multiple requests target the same, not-yet-cached, value.

Backporting #77259 to 7.x branch. Improved the cache logic to avoid duplicate searches when multiple requests target the same, not-yet-cached, value. Co-authored-by: Michael Bischoff <michael.bischoff@elastic.co>

This PR reverts the optimisation that was added via #77259. This optimisation cleverly ensures no duplicate searches happen if multiple threads concurrently execute the same search. However there are issues with the implementation that cause issues like #84781. The optimisation make use of CompletableFuture and in this case we don't check whether the result has completed exceptionally. Which causes the callback not being invoked and this leads to bulk request not being completed and hanging around. The ingest framework due to its asynchronous nature is already complex and adding CompletableFuture into the mix makes debugging these issues very time consuming. This is the main reason why we like to revert this commit.

Forwardporting elastic#85000 to master branch. This PR reverts the optimisation that was added via elastic#77259. This optimisation cleverly ensures no duplicate searches happen if multiple threads concurrently execute the same search. However there are issues with the implementation that cause issues like elastic#84781. The optimisation make use of CompletableFuture and in this case we don't check whether the result has completed exceptionally. Which causes the callback not being invoked and this leads to bulk request not being completed and hanging around. The ingest framework due to its asynchronous nature is already complex and adding CompletableFuture into the mix makes debugging these issues very time consuming. This is the main reason why we like to revert this commit.

Forwardporting #85000 to master branch. This PR reverts the optimisation that was added via #77259. This optimisation cleverly ensures no duplicate searches happen if multiple threads concurrently execute the same search. However there are issues with the implementation that cause issues like #84781. The optimisation make use of CompletableFuture and in this case we don't check whether the result has completed exceptionally. Which causes the callback not being invoked and this leads to bulk request not being completed and hanging around. The ingest framework due to its asynchronous nature is already complex and adding CompletableFuture into the mix makes debugging these issues very time consuming. This is the main reason why we like to revert this commit.

Forward port #85000 (Revert enrich cache lookup optimisation ) and #84838 (CompoundProcessor should also catch exceptions when executing a processor) to 8.1 branch. Revert enrich cache lookup optimisation (#85028) Forwardporting #85000 to master branch. This PR reverts the optimisation that was added via #77259. This optimisation cleverly ensures no duplicate searches happen if multiple threads concurrently execute the same search. However there are issues with the implementation that cause issues like #84781. The optimisation make use of CompletableFuture and in this case we don't check whether the result has completed exceptionally. Which causes the callback not being invoked and this leads to bulk request not being completed and hanging around. The ingest framework due to its asynchronous nature is already complex and adding CompletableFuture into the mix makes debugging these issues very time consuming. This is the main reason why we like to revert this commit. * CompoundProcessor should also catch exceptions when executing a processor (#84838) (#85035) Currently, CompoundProcessor does not catch Exception and if a processor throws an error and a method higher in the call stack doesn't catch the exception then pipeline execution stalls and bulk requests may not complete. Usually these exceptions are caught by IngestService#executePipelines(...) method, but when a processor executes async (for example: enrich processor) and the thread that executes enrich is no longer the original write thread then there is no logic that deals with failing pipeline execution and cleaning resources up. This then leads to memory leaks. Closes #84781 Also change how 'pipeline doesn't exist' error is thrown in TrackingResultProcessor. With the change to CompoundProcessor thrown exceptions are caught and delegated to handler. SimulateExecutionService in verbose mode ignores exceptions delegated to its handler, since it assumes that processorResultList contains the result (successful or not successful) of every processor in the pipeline. In case TrackingResultProcessor for PipelineProcessor couldn't find the mentioned pipeline then it just throws an error without updating the processorResultList. This commit addresses that.

mjmbischoff requested a review from martijnvg September 3, 2021 14:04

elasticsearchmachine added v8.0.0 external-contributor Pull request authored by a developer outside the Elasticsearch team labels Sep 3, 2021

mjmbischoff force-pushed the reduce_cache_computing_values branch from 5ba57cf to b994286 Compare September 3, 2021 14:10

mjmbischoff mentioned this pull request Sep 3, 2021

Add enrich node cache #76800

Merged

mjmbischoff and others added 2 commits September 3, 2021 16:17

cleanup: Fixing accidental raw type, lambda -> method reference

6f2cad3

Merge branch 'elastic:master' into reduce_cache_computing_values

c0614f5

martijnvg reviewed Sep 15, 2021

View reviewed changes

martijnvg added the :Distributed/Ingest Node Execution or management of Ingest Pipelines label Sep 15, 2021

elasticmachine added the Team:Data Management (obsolete) DO NOT USE. This team no longer exists. label Sep 15, 2021

mjmbischoff and others added 3 commits September 16, 2021 13:56

Reduced public surface of cache

11cd48d

Made the cache read-through Added / updated tests

Merge branch 'elastic:master' into reduce_cache_computing_values

f8574c8

Adding comments, adjusting throwable handling in light of Elasticsear…

5b9d9bc

…chUncaughtExceptionHandler

mjmbischoff and others added 3 commits September 18, 2021 15:38

Merge branch 'elastic:master' into reduce_cache_computing_values

af4811f

Missing happy path, d0h!

da28a40

putting methods of same visibility next to each other.

7eb833b

mjmbischoff requested a review from martijnvg September 20, 2021 11:28

martijnvg reviewed Oct 1, 2021

View reviewed changes

elasticmachine and others added 4 commits October 2, 2021 01:18

Merge branch 'master' into reduce_cache_computing_values

6f057c7

Assume throwable is of type Exception or Error.

c38f097

Remove indirection as caching negative lookups is unlikely to be a good idea.

Improving extendability and testability.

14ef0a2

unused import :x

73aad79

Superseded comment removed.

5026f72

martijnvg approved these changes Oct 4, 2021

View reviewed changes

mjmbischoff merged commit 608ff36 into elastic:master Oct 4, 2021

mjmbischoff deleted the reduce_cache_computing_values branch October 4, 2021 13:05

martijnvg added the backport pending label Oct 4, 2021

martijnvg mentioned this pull request Oct 5, 2021

[7.x] Improving cache lookup to reduce recomputing / searches #78671

Merged

martijnvg removed the backport pending label Oct 5, 2021

jakelandis added v8.0.0-beta1 v7.16.0 >enhancement and removed v8.0.0 labels Oct 27, 2021

martijnvg mentioned this pull request Mar 15, 2022

Revert enrich cache lookup optimisation #85000

Merged

martijnvg mentioned this pull request Mar 16, 2022

Revert enrich cache lookup optimisation #85028

Merged

Conversation

mjmbischoff commented Sep 3, 2021

Uh oh!

martijnvg commented Sep 15, 2021

Uh oh!

mjmbischoff commented Sep 15, 2021

Uh oh!

mjmbischoff commented Sep 15, 2021

Uh oh!

martijnvg commented Sep 15, 2021

Uh oh!

martijnvg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

elasticmachine commented Sep 15, 2021

Uh oh!

mjmbischoff commented Sep 18, 2021

Uh oh!

martijnvg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

martijnvg Oct 1, 2021

Choose a reason for hiding this comment

Uh oh!

mjmbischoff Oct 4, 2021

Choose a reason for hiding this comment

Uh oh!

martijnvg Oct 4, 2021

Choose a reason for hiding this comment

Uh oh!

mjmbischoff Oct 4, 2021

Choose a reason for hiding this comment

Uh oh!

martijnvg commented Oct 1, 2021

Uh oh!

martijnvg left a comment

Choose a reason for hiding this comment

Uh oh!

mjmbischoff commented Oct 4, 2021

Uh oh!

martijnvg commented Oct 4, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants