Skip to content

Audit all spawn blocking calls, prematurely abort them#7533

Merged
timvisee merged 16 commits intodevfrom
audit-spawn-blocking
Nov 14, 2025
Merged

Audit all spawn blocking calls, prematurely abort them#7533
timvisee merged 16 commits intodevfrom
audit-spawn-blocking

Conversation

@timvisee
Copy link
Member

@timvisee timvisee commented Nov 14, 2025

Similar to #7530, but for all other spawned blocking tasks.

Here I have audited all usages of spawning a blocking task. It turns out that we can prematurely cancel almost all of them. Some are in a similar vein as #7530, being for retrieve, scroll or facet operations.

Some of these tasks already use a cancellation token inside of them. But I still argue also prematurely aborting the task is better because it is free.

Other cases include aborting blocking tasks for fetching state, waiting on state, loading data.

I'd recommend to review this PR on a per-commit basis. I've separated each logical change into a dedicated commit. Each needs careful review.

I consider these changes critical as they might have significant effect on search performance under high load.

#7531 takes care of better stop flag handling inside blocking tasks.

All Submissions:

  • Contributions should target the dev branch. Did you create your branch from dev?
  • Have you followed the guidelines in our Contributing document?
  • Have you checked to ensure there aren't other open Pull Requests for the same update/change?

New Feature Submissions:

  1. Does your submission pass tests?
  2. Have you formatted your code locally using cargo +nightly fmt --all command prior to submission?
  3. Have you checked your code using cargo clippy --all --all-features command?

These tasks are intended to be cancellable. Now we prematurely abort the
task if the future was dropped before the task is executed.
These tasks are intended to be cancellable. Now we prematurely abort the
task if the cancellation token is triggered before the task is executed.
That is with the exception of file IO errors in which case data may be
partially moved.

Before this PR it was possible for the new target directory to be
created without moving all data into it. Now we either do all, or
nothing.
It is fine to either create it, or not at all.
All shard read operations, such as retrieve, scroll, facets and more can
be safely aborted prematurely.

Related to: <#7530>
This can safely be aborted before the task is started
Safe because it aborts before writing any snapshot files to disk
@timvisee timvisee added bug Something isn't working release:1.16.0 Pull requests that should be merged for the Qdrant 1.16.0 release. labels Nov 14, 2025
@timvisee timvisee changed the title Audit spawn blocking Audit all spawn blocking calls, prematurely abort them Nov 14, 2025
coderabbitai[bot]

This comment was marked as resolved.

@qdrant qdrant deleted a comment from coderabbitai bot Nov 14, 2025
})
.await??;
});
AbortOnDropHandle::new(handle).await??;
Copy link
Member

@agourlay agourlay Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this one 🤔

I thought the snapshot should NOT be terminated even if the user drops the connection.

Copy link
Member Author

@timvisee timvisee Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it would be better to also terminate the snapshot operation half way. But I suggest to do that in a separate PR.

This PR was only intended to abort pending blocking tasks before they start to prematurely drain a humongous queue of pending blocking tasks.

This is a blocking task, which we cannot abort/cancel without implementing a cancellation token in there.

A critical part in this snapshot logic is the proxying/unproxying of segments. All segments must be unproxied gracefully to recover the original segment holder state, even if cancelled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there should be one (or two??) spawn calls up the callstack from here, so it should be fine to cancel this task, but I can double-check. 😬

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a spawn on all call stacks.

For example, this path does not spawn:

async fn create_shard_snapshot(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

helpers::time_or_accept does spawn

@agourlay
Copy link
Member

agourlay commented Nov 14, 2025

I am not sure about some of those changes TBH.

If the idea is to drain a humongous queue of pending blocking tasks, then we only need to apply the AbortOnDropHandle transformation to high frequency operations.

That is, covering the read requests should be enough, I feel like applying the pattern everywhere might be an over reaction and introduce some subtle cancellation changes.

@timvisee
Copy link
Member Author

timvisee commented Nov 14, 2025

If the idea is to drain a humongous queue of pending blocking tasks, then we only need to apply the AbortOnDropHandle transformation to high frequency operations.

I see no point in distinguishing between the two as this change is practically free - except for a few extra lines of code.

I'd argue its the other way around. In many places we 'just' use spawn_blocking to run some blocking task, while we'd actually like to have the cancellation behavior. And so I don't see a point in throwing away this behavior for simple tasks. We never seem to have considered it properly.

Copy link
Member

@agourlay agourlay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got into a call with Tim and I finally got what I was missing!

The blocking tasks won't be aborted mid-way when the calling future drops.

It is only applied as a special case when trying to abort before starting the task, most likely after dequeuing it.

I am fine with this change although it does make things a bit more complicated to understand.

@timvisee
Copy link
Member Author

For future readers: here is documentation on spawn_blocking, and abort which states it only prevents a task from running if not started yet, not aborting it mid-way.

@timvisee timvisee merged commit 1b6e525 into dev Nov 14, 2025
16 checks passed
@timvisee timvisee deleted the audit-spawn-blocking branch November 14, 2025 12:29
timvisee added a commit that referenced this pull request Nov 14, 2025
* Prematurely abort blocking task in `spawn_cancel_on_drop` on drop

These tasks are intended to be cancellable. Now we prematurely abort the
task if the future was dropped before the task is executed.

* Prematurely abort blocking task in `spawn_cancel_on_token` on cancel

These tasks are intended to be cancellable. Now we prematurely abort the
task if the cancellation token is triggered before the task is executed.

* Prematurely abort blocking task for fetching telemetry

* Prematurely abort stoppable task on drop, all are safe to abort early

* Make `move_dir` either move everything, or nothing at all

That is with the exception of file IO errors in which case data may be
partially moved.

Before this PR it was possible for the new target directory to be
created without moving all data into it. Now we either do all, or
nothing.

* Prematurely abort task for creating full snapshot

It is fine to either create it, or not at all.

* Prematurely abort blocking task for waiting on consensus leader

* Prematurely abort blocking cardinality estimation and shard info tasks

* Prematurely abort blocking point deduplication task

* Prematurely abort blocking task for checking available disk space

* Prematurely abort blocking shard read operations

All shard read operations, such as retrieve, scroll, facets and more can
be safely aborted prematurely.

Related to: <#7530>

* Prematurely abort blocking task for waiting on replica state

* Prematurely abort blocking task for waiting on transfer replica states

* Prematurely abort blocking task for loading segment

This can safely be aborted before the task is started

* Prematurely abort blocking task waiting for replica states

* Prematurely abort blocking task for creating snapshot file

Safe because it aborts before writing any snapshot files to disk
@timvisee timvisee mentioned this pull request Nov 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working release:1.16.0 Pull requests that should be merged for the Qdrant 1.16.0 release.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants