Skip to content

Abort pending blocking search tasks when we drop/invalidate the search#7530

Merged
timvisee merged 2 commits intodevfrom
cancel-search-tasks-on-drop
Nov 14, 2025
Merged

Abort pending blocking search tasks when we drop/invalidate the search#7530
timvisee merged 2 commits intodevfrom
cancel-search-tasks-on-drop

Conversation

@timvisee
Copy link
Member

@timvisee timvisee commented Nov 13, 2025

Critical fix for large clusters with huge search load.

Searches spawn blocking search tasks on a dedicated thread pool. The pool size is limited, and so these search tasks may be queued. If there is a humongous amount of incoming searches, this queue may get infinitely long.

When a search is invalidated (timed out, or completed through fan out), we drop the async task. The idea is that we abort all work related to this search to release resources.

It turns out that these spawned tasks will always remain queued and will be run to completion, even if the owning future was already dropped and aborted. Instead, we must explicitly cancel these pending spawned blocking tasks.

Luckily tokio provides us the AbortOnDropHandle utility, which is what I've implemented in this PR.

On a huge cluster that accumulate a massive amount of pending searches, it was possible to keep segments busy for more than an hour. Even though each search itself might only be a few seconds of work. This can break optimizations if they cannot release old segments for over an hour. This PR will help prevent this issue from happening by cancelling all invalidated searches early. More specifically, it helps prevent this error from happening:

Service internal error: Removing proxy segment which is still in use

All Submissions:

  • Contributions should target the dev branch. Did you create your branch from dev?
  • Have you followed the guidelines in our Contributing document?
  • Have you checked to ensure there aren't other open Pull Requests for the same update/change?

New Feature Submissions:

  1. Does your submission pass tests?
  2. Have you formatted your code locally using cargo +nightly fmt --all command prior to submission?
  3. Have you checked your code using cargo clippy --all --all-features command?

@timvisee timvisee added bug Something isn't working release:1.16.0 Pull requests that should be merged for the Qdrant 1.16.0 release. labels Nov 13, 2025
coderabbitai[bot]

This comment was marked as resolved.

@qdrant qdrant deleted a comment from coderabbitai bot Nov 13, 2025
Copy link
Member

@agourlay agourlay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I trust that you have tested this manually 👍

@timvisee timvisee changed the title Cancel pending blocking search tasks when we drop/invalidate the search Abort pending blocking search tasks when we drop/invalidate the search Nov 14, 2025
@timvisee
Copy link
Member Author

timvisee commented Nov 14, 2025

I trust that you have tested this manually 👍

Correct.

We can add a test for it, but I'd consider that obsolete. Because then we'd be testing tokio fundamentals. spawn_ functions are intended to keep the task even if the caller is dropped. It was simply an oversight from our side when implementing this.

@timvisee timvisee merged commit c86aa1a into dev Nov 14, 2025
16 checks passed
@timvisee timvisee deleted the cancel-search-tasks-on-drop branch November 14, 2025 07:49
timvisee added a commit that referenced this pull request Nov 14, 2025
All shard read operations, such as retrieve, scroll, facets and more can
be safely aborted prematurely.

Related to: <#7530>
timvisee added a commit that referenced this pull request Nov 14, 2025
#7530)

* Cancel pending blocking search tasks when we drop/invalidate the search

* Mention PR with explanation in comment
timvisee added a commit that referenced this pull request Nov 14, 2025
* Prematurely abort blocking task in `spawn_cancel_on_drop` on drop

These tasks are intended to be cancellable. Now we prematurely abort the
task if the future was dropped before the task is executed.

* Prematurely abort blocking task in `spawn_cancel_on_token` on cancel

These tasks are intended to be cancellable. Now we prematurely abort the
task if the cancellation token is triggered before the task is executed.

* Prematurely abort blocking task for fetching telemetry

* Prematurely abort stoppable task on drop, all are safe to abort early

* Make `move_dir` either move everything, or nothing at all

That is with the exception of file IO errors in which case data may be
partially moved.

Before this PR it was possible for the new target directory to be
created without moving all data into it. Now we either do all, or
nothing.

* Prematurely abort task for creating full snapshot

It is fine to either create it, or not at all.

* Prematurely abort blocking task for waiting on consensus leader

* Prematurely abort blocking cardinality estimation and shard info tasks

* Prematurely abort blocking point deduplication task

* Prematurely abort blocking task for checking available disk space

* Prematurely abort blocking shard read operations

All shard read operations, such as retrieve, scroll, facets and more can
be safely aborted prematurely.

Related to: <#7530>

* Prematurely abort blocking task for waiting on replica state

* Prematurely abort blocking task for waiting on transfer replica states

* Prematurely abort blocking task for loading segment

This can safely be aborted before the task is started

* Prematurely abort blocking task waiting for replica states

* Prematurely abort blocking task for creating snapshot file

Safe because it aborts before writing any snapshot files to disk
timvisee added a commit that referenced this pull request Nov 14, 2025
* Prematurely abort blocking task in `spawn_cancel_on_drop` on drop

These tasks are intended to be cancellable. Now we prematurely abort the
task if the future was dropped before the task is executed.

* Prematurely abort blocking task in `spawn_cancel_on_token` on cancel

These tasks are intended to be cancellable. Now we prematurely abort the
task if the cancellation token is triggered before the task is executed.

* Prematurely abort blocking task for fetching telemetry

* Prematurely abort stoppable task on drop, all are safe to abort early

* Make `move_dir` either move everything, or nothing at all

That is with the exception of file IO errors in which case data may be
partially moved.

Before this PR it was possible for the new target directory to be
created without moving all data into it. Now we either do all, or
nothing.

* Prematurely abort task for creating full snapshot

It is fine to either create it, or not at all.

* Prematurely abort blocking task for waiting on consensus leader

* Prematurely abort blocking cardinality estimation and shard info tasks

* Prematurely abort blocking point deduplication task

* Prematurely abort blocking task for checking available disk space

* Prematurely abort blocking shard read operations

All shard read operations, such as retrieve, scroll, facets and more can
be safely aborted prematurely.

Related to: <#7530>

* Prematurely abort blocking task for waiting on replica state

* Prematurely abort blocking task for waiting on transfer replica states

* Prematurely abort blocking task for loading segment

This can safely be aborted before the task is started

* Prematurely abort blocking task waiting for replica states

* Prematurely abort blocking task for creating snapshot file

Safe because it aborts before writing any snapshot files to disk
@timvisee timvisee mentioned this pull request Nov 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working release:1.16.0 Pull requests that should be merged for the Qdrant 1.16.0 release.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants