Skip to content

Improving cache lookup to reduce recomputing / searches#77259

Merged
mjmbischoff merged 14 commits intoelastic:masterfrom
mjmbischoff:reduce_cache_computing_values
Oct 4, 2021
Merged

Improving cache lookup to reduce recomputing / searches#77259
mjmbischoff merged 14 commits intoelastic:masterfrom
mjmbischoff:reduce_cache_computing_values

Conversation

@mjmbischoff
Copy link
Copy Markdown
Contributor

Adressing:
// intentionally non-locking for simplicity...it's OK if we re-put the same key/value in the cache during a race condition.

  • Avoiding race condition while keeping cache lookup fast / non-blocking using a combination of computeIfAbsent(..) and CompletableFuture
  • Invalidation of the cache is made idempotent by using the original value to avoid invalidating again but is called for each in flight request
  • Cache entries are slightly bigger due to bookkeeping but seems like a good trade-off

@elasticsearchmachine elasticsearchmachine added v8.0.0 external-contributor Pull request authored by a developer outside the Elasticsearch team labels Sep 3, 2021
- Avoiding race condition while keeping cache lookup fast / non-blocking using a combination of computeIfAbsent and CompletableFuture
- Invalidation of the cache is idempotent avoiding invalidating again but does require a lookup.
- Cache entries are slightly bigger due to bookkeeping but seems like a good trade-off
@mjmbischoff mjmbischoff force-pushed the reduce_cache_computing_values branch from 5ba57cf to b994286 Compare September 3, 2021 14:10
@martijnvg
Copy link
Copy Markdown
Member

Hey @mjmbischoff, thanks for putting this PR up. However I wonder whether the current approach in the PR can still lead to blocking threads? If I understand the Cache#computeIfAbsent(...) method correctly then in the case when there is no entry for key and it needs to be loaded, the first thread attempts to load it. But other threads during this process will wait until that thread has completed loading/computing a value (see line 435 in Cache.java and else statement at line 431).

@mjmbischoff
Copy link
Copy Markdown
Contributor Author

Yes the first thread will do work. Technically all threads 'block', because as you know any call blocks, just for a very short time. The thread 'winning' out will have a little bit more work to do, although the others are blocked / parked at that point in time. This is because the computeIfAbsent(..) guarantees that one wins / only one value is calculated under contention. But all calls should return an Value.

The computation however is only the matter of creating an CompletableFuture Object and then dispatching an async call which should be short (in the sense that ppl will call it non blocking) The computation finishes and the CompletableFuture is returned on which we register a listener. The listener is called by the thread setting the value on the CompletableFuture so that will be the thread on which the async call was dispatched. Which is long running / blocking anyway because a EnrichCoordinatorProxyAction is executed there.

TL:DR the thread on the threadpoll of the originClient is the one that does all the work the rest is callbacks and cheap calls.

Feel free to make the EnrichCoordinatorProxyAction slow to assert this. Now that I think of it I guess I could write a test for this.

@mjmbischoff
Copy link
Copy Markdown
Contributor Author

Ah also forgot to mention that this approach also allows to cache failures, which it now immediately invalidates but perhaps it makes sense to observe some time out to save resources.

@martijnvg
Copy link
Copy Markdown
Member

Yes the first thread will do work. Technically all threads 'block', because as you know any call blocks, just for a very short time.

Apologies, I somehow thought that the remote call was done as part of other threads waiting... That is wrong. Just the creation of the Future and I agree that blocks a thread for a very short time. This shouldn't be an issue.

Copy link
Copy Markdown
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining @mjmbischoff. I left a few comments.

@martijnvg martijnvg added the :Distributed/Ingest Node Execution or management of Ingest Pipelines label Sep 15, 2021
@elasticmachine elasticmachine added the Team:Data Management (obsolete) DO NOT USE. This team no longer exists. label Sep 15, 2021
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@mjmbischoff
Copy link
Copy Markdown
Contributor Author

Think we're getting close. Added comments to the code to clarify (or to make my miss-assumptions explicit. ;-))

Copy link
Copy Markdown
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mjmbischoff for the update. I left a few comments.

assertThat(cacheStats.getCount(), equalTo(3L));
assertThat(cacheStats.getHits(), equalTo(0L));
assertThat(cacheStats.getMisses(), equalTo(0L));
assertThat(cacheStats.getMisses(), equalTo(3L));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the expected miscount increased here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without introducing a method specifically for the tests, warming the cache requires a cache miss to trigger a resolve.

The class is also final so can't do the usual trickery to break encapsulation. A specific put or constructor where we can pass in a Cache<CacheKey, CompletableFuture<SearchResponse>> cache are alternatives - I don't know whats preferred?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Maybe we should make EnrichCache non final?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adressed in 14ef0a2

@martijnvg
Copy link
Copy Markdown
Member

@elasticmachine update branch

Copy link
Copy Markdown
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@mjmbischoff mjmbischoff merged commit 608ff36 into elastic:master Oct 4, 2021
@mjmbischoff mjmbischoff deleted the reduce_cache_computing_values branch October 4, 2021 13:05
@mjmbischoff
Copy link
Copy Markdown
Contributor Author

@martijnvg As always many thanks for reviewing! Looking at #76800 I think it's targeted for 7.16 but I'm unsure if that train has already left the station, can I leave the backporting, if any, to you?

@martijnvg
Copy link
Copy Markdown
Member

@mjmbischoff I will backport this to 7.16, there is still some time before feature freeze.

martijnvg pushed a commit to martijnvg/elasticsearch that referenced this pull request Oct 5, 2021
Backporting elastic#77259 to 7.x branch.

Improved the cache logic to avoid duplicate searches when multiple requests target the same, not-yet-cached, value.
elasticsearchmachine pushed a commit that referenced this pull request Oct 5, 2021
Backporting #77259 to 7.x branch.

Improved the cache logic to avoid duplicate searches when multiple requests target the same, not-yet-cached, value.

Co-authored-by: Michael Bischoff <michael.bischoff@elastic.co>
martijnvg added a commit that referenced this pull request Mar 16, 2022
This PR reverts the optimisation that was added via #77259.

This optimisation cleverly ensures no duplicate searches happen if multiple threads concurrently execute the same search.
However there are issues with the implementation that cause issues like #84781. The optimisation make use of CompletableFuture and in this case we don't check whether the result has completed exceptionally. Which causes the callback not being invoked and this leads to bulk request not being completed and hanging around.

The ingest framework due to its asynchronous nature is already complex and adding CompletableFuture into the mix makes debugging these issues very time consuming. This is the main reason why we like to revert this commit.
martijnvg added a commit to martijnvg/elasticsearch that referenced this pull request Mar 16, 2022
Forwardporting elastic#85000 to master branch.

This PR reverts the optimisation that was added via elastic#77259.

This optimisation cleverly ensures no duplicate searches happen if multiple threads concurrently execute the same search.
However there are issues with the implementation that cause issues like elastic#84781. The optimisation make use of CompletableFuture and in this case we don't check whether the result has completed exceptionally. Which causes the callback not being invoked and this leads to bulk request not being completed and hanging around.

The ingest framework due to its asynchronous nature is already complex and adding CompletableFuture into the mix makes debugging these issues very time consuming. This is the main reason why we like to revert this commit.
martijnvg added a commit that referenced this pull request Mar 16, 2022
Forwardporting #85000 to master branch.

This PR reverts the optimisation that was added via #77259.

This optimisation cleverly ensures no duplicate searches happen if multiple threads concurrently execute the same search.
However there are issues with the implementation that cause issues like #84781. The optimisation make use of CompletableFuture and in this case we don't check whether the result has completed exceptionally. Which causes the callback not being invoked and this leads to bulk request not being completed and hanging around.

The ingest framework due to its asynchronous nature is already complex and adding CompletableFuture into the mix makes debugging these issues very time consuming. This is the main reason why we like to revert this commit.
martijnvg added a commit that referenced this pull request Mar 16, 2022
Forward port #85000 (Revert enrich cache lookup optimisation ) and #84838 (CompoundProcessor should also catch exceptions when executing a processor) to 8.1 branch.

Revert enrich cache lookup optimisation (#85028)

Forwardporting #85000 to master branch.
This PR reverts the optimisation that was added via #77259.
This optimisation cleverly ensures no duplicate searches happen if multiple threads concurrently execute the same search. However there are issues with the implementation that cause issues like #84781. The optimisation make use of CompletableFuture and in this case we don't check whether the result has completed exceptionally. Which causes the callback not being invoked and this leads to bulk request not being completed and hanging around.

The ingest framework due to its asynchronous nature is already complex and adding CompletableFuture into the mix makes debugging these issues very time consuming. This is the main reason why we like to revert this commit.

* CompoundProcessor should also catch exceptions when executing a processor (#84838) (#85035)

Currently, CompoundProcessor does not catch Exception and
if a processor throws an error and a method higher in the
call stack doesn't catch the exception then pipeline execution
stalls and bulk requests may not complete.

Usually these exceptions are caught by IngestService#executePipelines(...) method,
but when a processor executes async (for example: enrich processor) and the thread
that executes enrich is no longer the original write thread then there is no logic
that deals with failing pipeline execution and cleaning resources up. This then leads
to memory leaks.

Closes #84781

Also change how 'pipeline doesn't exist' error is thrown in TrackingResultProcessor.

With the change to CompoundProcessor thrown exceptions are caught and delegated to handler.
SimulateExecutionService in verbose mode ignores exceptions delegated to its handler, since
it assumes that processorResultList contains the result (successful or not successful) of every
processor in the pipeline.

In case TrackingResultProcessor for PipelineProcessor couldn't find the mentioned pipeline then
it just throws an error without updating the processorResultList. This commit addresses that.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Ingest Node Execution or management of Ingest Pipelines >enhancement external-contributor Pull request authored by a developer outside the Elasticsearch team Team:Data Management (obsolete) DO NOT USE. This team no longer exists. v7.16.0 v8.0.0-beta1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants