Abort pending blocking search tasks when we drop/invalidate the search#7530
Merged
Abort pending blocking search tasks when we drop/invalidate the search#7530
Conversation
ffuugoo
approved these changes
Nov 13, 2025
agourlay
approved these changes
Nov 13, 2025
Member
agourlay
left a comment
There was a problem hiding this comment.
I trust that you have tested this manually 👍
Member
Author
Correct. We can add a test for it, but I'd consider that obsolete. Because then we'd be testing tokio fundamentals. |
timvisee
added a commit
that referenced
this pull request
Nov 14, 2025
All shard read operations, such as retrieve, scroll, facets and more can be safely aborted prematurely. Related to: <#7530>
6 tasks
timvisee
added a commit
that referenced
this pull request
Nov 14, 2025
#7530) * Cancel pending blocking search tasks when we drop/invalidate the search * Mention PR with explanation in comment
timvisee
added a commit
that referenced
this pull request
Nov 14, 2025
* Prematurely abort blocking task in `spawn_cancel_on_drop` on drop These tasks are intended to be cancellable. Now we prematurely abort the task if the future was dropped before the task is executed. * Prematurely abort blocking task in `spawn_cancel_on_token` on cancel These tasks are intended to be cancellable. Now we prematurely abort the task if the cancellation token is triggered before the task is executed. * Prematurely abort blocking task for fetching telemetry * Prematurely abort stoppable task on drop, all are safe to abort early * Make `move_dir` either move everything, or nothing at all That is with the exception of file IO errors in which case data may be partially moved. Before this PR it was possible for the new target directory to be created without moving all data into it. Now we either do all, or nothing. * Prematurely abort task for creating full snapshot It is fine to either create it, or not at all. * Prematurely abort blocking task for waiting on consensus leader * Prematurely abort blocking cardinality estimation and shard info tasks * Prematurely abort blocking point deduplication task * Prematurely abort blocking task for checking available disk space * Prematurely abort blocking shard read operations All shard read operations, such as retrieve, scroll, facets and more can be safely aborted prematurely. Related to: <#7530> * Prematurely abort blocking task for waiting on replica state * Prematurely abort blocking task for waiting on transfer replica states * Prematurely abort blocking task for loading segment This can safely be aborted before the task is started * Prematurely abort blocking task waiting for replica states * Prematurely abort blocking task for creating snapshot file Safe because it aborts before writing any snapshot files to disk
timvisee
added a commit
that referenced
this pull request
Nov 14, 2025
* Prematurely abort blocking task in `spawn_cancel_on_drop` on drop These tasks are intended to be cancellable. Now we prematurely abort the task if the future was dropped before the task is executed. * Prematurely abort blocking task in `spawn_cancel_on_token` on cancel These tasks are intended to be cancellable. Now we prematurely abort the task if the cancellation token is triggered before the task is executed. * Prematurely abort blocking task for fetching telemetry * Prematurely abort stoppable task on drop, all are safe to abort early * Make `move_dir` either move everything, or nothing at all That is with the exception of file IO errors in which case data may be partially moved. Before this PR it was possible for the new target directory to be created without moving all data into it. Now we either do all, or nothing. * Prematurely abort task for creating full snapshot It is fine to either create it, or not at all. * Prematurely abort blocking task for waiting on consensus leader * Prematurely abort blocking cardinality estimation and shard info tasks * Prematurely abort blocking point deduplication task * Prematurely abort blocking task for checking available disk space * Prematurely abort blocking shard read operations All shard read operations, such as retrieve, scroll, facets and more can be safely aborted prematurely. Related to: <#7530> * Prematurely abort blocking task for waiting on replica state * Prematurely abort blocking task for waiting on transfer replica states * Prematurely abort blocking task for loading segment This can safely be aborted before the task is started * Prematurely abort blocking task waiting for replica states * Prematurely abort blocking task for creating snapshot file Safe because it aborts before writing any snapshot files to disk
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Critical fix for large clusters with huge search load.
Searches spawn blocking search tasks on a dedicated thread pool. The pool size is limited, and so these search tasks may be queued. If there is a humongous amount of incoming searches, this queue may get infinitely long.
When a search is invalidated (timed out, or completed through fan out), we drop the async task. The idea is that we abort all work related to this search to release resources.
It turns out that these spawned tasks will always remain queued and will be run to completion, even if the owning future was already dropped and aborted. Instead, we must explicitly cancel these pending spawned blocking tasks.
Luckily tokio provides us the
AbortOnDropHandleutility, which is what I've implemented in this PR.On a huge cluster that accumulate a massive amount of pending searches, it was possible to keep segments busy for more than an hour. Even though each search itself might only be a few seconds of work. This can break optimizations if they cannot release old segments for over an hour. This PR will help prevent this issue from happening by cancelling all invalidated searches early. More specifically, it helps prevent this error from happening:
All Submissions:
devbranch. Did you create your branch fromdev?New Feature Submissions:
cargo +nightly fmt --allcommand prior to submission?cargo clippy --all --all-featurescommand?