Audit all spawn blocking calls, prematurely abort them by timvisee · Pull Request #7533 · qdrant/qdrant

timvisee · 2025-11-14T09:23:41Z

Similar to #7530, but for all other spawned blocking tasks.

Here I have audited all usages of spawning a blocking task. It turns out that we can prematurely cancel almost all of them. Some are in a similar vein as #7530, being for retrieve, scroll or facet operations.

Some of these tasks already use a cancellation token inside of them. But I still argue also prematurely aborting the task is better because it is free.

Other cases include aborting blocking tasks for fetching state, waiting on state, loading data.

I'd recommend to review this PR on a per-commit basis. I've separated each logical change into a dedicated commit. Each needs careful review.

I consider these changes critical as they might have significant effect on search performance under high load.

#7531 takes care of better stop flag handling inside blocking tasks.

All Submissions:

Contributions should target the dev branch. Did you create your branch from dev?
Have you followed the guidelines in our Contributing document?
Have you checked to ensure there aren't other open Pull Requests for the same update/change?

New Feature Submissions:

Does your submission pass tests?
Have you formatted your code locally using cargo +nightly fmt --all command prior to submission?
Have you checked your code using cargo clippy --all --all-features command?

These tasks are intended to be cancellable. Now we prematurely abort the task if the future was dropped before the task is executed.

These tasks are intended to be cancellable. Now we prematurely abort the task if the cancellation token is triggered before the task is executed.

That is with the exception of file IO errors in which case data may be partially moved. Before this PR it was possible for the new target directory to be created without moving all data into it. Now we either do all, or nothing.

It is fine to either create it, or not at all.

All shard read operations, such as retrieve, scroll, facets and more can be safely aborted prematurely. Related to: <#7530>

This can safely be aborted before the task is started

Safe because it aborts before writing any snapshot files to disk

agourlay · 2025-11-14T11:15:55Z

lib/collection/src/shards/local_shard/snapshot.rs

-        })
-        .await??;
+        });
+        AbortOnDropHandle::new(handle).await??;


Not sure about this one 🤔

I thought the snapshot should NOT be terminated even if the user drops the connection.

I agree it would be better to also terminate the snapshot operation half way. But I suggest to do that in a separate PR.

This PR was only intended to abort pending blocking tasks before they start to prematurely drain a humongous queue of pending blocking tasks.

This is a blocking task, which we cannot abort/cancel without implementing a cancellation token in there.

A critical part in this snapshot logic is the proxying/unproxying of segments. All segments must be unproxied gracefully to recover the original segment holder state, even if cancelled.

I think there should be one (or two??) spawn calls up the callstack from here, so it should be fine to cancel this task, but I can double-check. 😬

I don't see a spawn on all call stacks.

For example, this path does not spawn:

qdrant/src/actix/api/snapshot_api.rs

Line 378 in 7994816

async fn create_shard_snapshot(

helpers::time_or_accept does spawn

lib/common/cancel/src/blocking.rs

lib/common/cancel/src/future.rs

src/common/telemetry.rs

lib/collection/src/common/stoppable_task.rs

lib/collection/src/collection/shard_transfer.rs

agourlay · 2025-11-14T11:43:08Z

I am not sure about some of those changes TBH.

If the idea is to drain a humongous queue of pending blocking tasks, then we only need to apply the AbortOnDropHandle transformation to high frequency operations.

That is, covering the read requests should be enough, I feel like applying the pattern everywhere might be an over reaction and introduce some subtle cancellation changes.

timvisee · 2025-11-14T11:45:09Z

If the idea is to drain a humongous queue of pending blocking tasks, then we only need to apply the AbortOnDropHandle transformation to high frequency operations.

I see no point in distinguishing between the two as this change is practically free - except for a few extra lines of code.

I'd argue its the other way around. In many places we 'just' use spawn_blocking to run some blocking task, while we'd actually like to have the cancellation behavior. And so I don't see a point in throwing away this behavior for simple tasks. We never seem to have considered it properly.

agourlay

Got into a call with Tim and I finally got what I was missing!

The blocking tasks won't be aborted mid-way when the calling future drops.

It is only applied as a special case when trying to abort before starting the task, most likely after dequeuing it.

I am fine with this change although it does make things a bit more complicated to understand.

timvisee · 2025-11-14T12:27:51Z

For future readers: here is documentation on spawn_blocking, and abort which states it only prevents a task from running if not started yet, not aborting it mid-way.

* Prematurely abort blocking task in `spawn_cancel_on_drop` on drop These tasks are intended to be cancellable. Now we prematurely abort the task if the future was dropped before the task is executed. * Prematurely abort blocking task in `spawn_cancel_on_token` on cancel These tasks are intended to be cancellable. Now we prematurely abort the task if the cancellation token is triggered before the task is executed. * Prematurely abort blocking task for fetching telemetry * Prematurely abort stoppable task on drop, all are safe to abort early * Make `move_dir` either move everything, or nothing at all That is with the exception of file IO errors in which case data may be partially moved. Before this PR it was possible for the new target directory to be created without moving all data into it. Now we either do all, or nothing. * Prematurely abort task for creating full snapshot It is fine to either create it, or not at all. * Prematurely abort blocking task for waiting on consensus leader * Prematurely abort blocking cardinality estimation and shard info tasks * Prematurely abort blocking point deduplication task * Prematurely abort blocking task for checking available disk space * Prematurely abort blocking shard read operations All shard read operations, such as retrieve, scroll, facets and more can be safely aborted prematurely. Related to: <#7530> * Prematurely abort blocking task for waiting on replica state * Prematurely abort blocking task for waiting on transfer replica states * Prematurely abort blocking task for loading segment This can safely be aborted before the task is started * Prematurely abort blocking task waiting for replica states * Prematurely abort blocking task for creating snapshot file Safe because it aborts before writing any snapshot files to disk

timvisee added 16 commits November 14, 2025 09:51

Prematurely abort blocking task in spawn_cancel_on_drop on drop

1036cd0

These tasks are intended to be cancellable. Now we prematurely abort the task if the future was dropped before the task is executed.

Prematurely abort blocking task in spawn_cancel_on_token on cancel

49c74bb

These tasks are intended to be cancellable. Now we prematurely abort the task if the cancellation token is triggered before the task is executed.

Prematurely abort blocking task for fetching telemetry

cc0f0e1

Prematurely abort stoppable task on drop, all are safe to abort early

1b17817

Make move_dir either move everything, or nothing at all

d9abd48

That is with the exception of file IO errors in which case data may be partially moved. Before this PR it was possible for the new target directory to be created without moving all data into it. Now we either do all, or nothing.

Prematurely abort task for creating full snapshot

f08a701

It is fine to either create it, or not at all.

Prematurely abort blocking task for waiting on consensus leader

68d3e7c

Prematurely abort blocking cardinality estimation and shard info tasks

fa7c91b

Prematurely abort blocking point deduplication task

8630ca7

Prematurely abort blocking task for checking available disk space

af57c2c

Prematurely abort blocking shard read operations

bf2a254

All shard read operations, such as retrieve, scroll, facets and more can be safely aborted prematurely. Related to: <#7530>

Prematurely abort blocking task for waiting on replica state

3e07f0d

Prematurely abort blocking task for waiting on transfer replica states

8134548

Prematurely abort blocking task for loading segment

514d08c

This can safely be aborted before the task is started

Prematurely abort blocking task waiting for replica states

f9b1af1

Prematurely abort blocking task for creating snapshot file

a6d858c

Safe because it aborts before writing any snapshot files to disk

timvisee requested review from agourlay, ffuugoo and generall November 14, 2025 09:23

timvisee added bug Something isn't working release:1.16.0 Pull requests that should be merged for the Qdrant 1.16.0 release. labels Nov 14, 2025

timvisee changed the title ~~Audit spawn blocking~~ Audit all spawn blocking calls, prematurely abort them Nov 14, 2025

This comment was marked as resolved.

Sign in to view

qdrant deleted a comment from coderabbitai bot Nov 14, 2025

agourlay reviewed Nov 14, 2025

View reviewed changes

ffuugoo approved these changes Nov 14, 2025

View reviewed changes

agourlay approved these changes Nov 14, 2025

View reviewed changes

timvisee merged commit 1b6e525 into dev Nov 14, 2025
16 checks passed

timvisee deleted the audit-spawn-blocking branch November 14, 2025 12:29

timvisee mentioned this pull request Nov 14, 2025

Bump version to 1.16.0 #7535

Merged

Conversation

timvisee commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

All Submissions:

New Feature Submissions:

Uh oh!

This comment was marked as resolved.

Uh oh!

agourlay Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

timvisee Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ffuugoo Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

timvisee Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

ffuugoo Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

agourlay commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timvisee commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

agourlay left a comment

Choose a reason for hiding this comment

Uh oh!

timvisee commented Nov 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

timvisee commented Nov 14, 2025 •

edited

Loading

agourlay Nov 14, 2025 •

edited

Loading

timvisee Nov 14, 2025 •

edited

Loading

agourlay commented Nov 14, 2025 •

edited

Loading

timvisee commented Nov 14, 2025 •

edited

Loading