Audit all spawn blocking calls, prematurely abort them#7533
Conversation
These tasks are intended to be cancellable. Now we prematurely abort the task if the future was dropped before the task is executed.
These tasks are intended to be cancellable. Now we prematurely abort the task if the cancellation token is triggered before the task is executed.
That is with the exception of file IO errors in which case data may be partially moved. Before this PR it was possible for the new target directory to be created without moving all data into it. Now we either do all, or nothing.
It is fine to either create it, or not at all.
All shard read operations, such as retrieve, scroll, facets and more can be safely aborted prematurely. Related to: <#7530>
This can safely be aborted before the task is started
Safe because it aborts before writing any snapshot files to disk
| }) | ||
| .await??; | ||
| }); | ||
| AbortOnDropHandle::new(handle).await??; |
There was a problem hiding this comment.
Not sure about this one 🤔
I thought the snapshot should NOT be terminated even if the user drops the connection.
There was a problem hiding this comment.
I agree it would be better to also terminate the snapshot operation half way. But I suggest to do that in a separate PR.
This PR was only intended to abort pending blocking tasks before they start to prematurely drain a humongous queue of pending blocking tasks.
This is a blocking task, which we cannot abort/cancel without implementing a cancellation token in there.
A critical part in this snapshot logic is the proxying/unproxying of segments. All segments must be unproxied gracefully to recover the original segment holder state, even if cancelled.
There was a problem hiding this comment.
I think there should be one (or two??) spawn calls up the callstack from here, so it should be fine to cancel this task, but I can double-check. 😬
There was a problem hiding this comment.
I don't see a spawn on all call stacks.
For example, this path does not spawn:
qdrant/src/actix/api/snapshot_api.rs
Line 378 in 7994816
There was a problem hiding this comment.
helpers::time_or_accept does spawn
|
I am not sure about some of those changes TBH. If the idea is to drain a humongous queue of pending blocking tasks, then we only need to apply the That is, covering the read requests should be enough, I feel like applying the pattern everywhere might be an over reaction and introduce some subtle cancellation changes. |
I see no point in distinguishing between the two as this change is practically free - except for a few extra lines of code. I'd argue its the other way around. In many places we 'just' use |
agourlay
left a comment
There was a problem hiding this comment.
Got into a call with Tim and I finally got what I was missing!
The blocking tasks won't be aborted mid-way when the calling future drops.
It is only applied as a special case when trying to abort before starting the task, most likely after dequeuing it.
I am fine with this change although it does make things a bit more complicated to understand.
|
For future readers: here is documentation on |
* Prematurely abort blocking task in `spawn_cancel_on_drop` on drop These tasks are intended to be cancellable. Now we prematurely abort the task if the future was dropped before the task is executed. * Prematurely abort blocking task in `spawn_cancel_on_token` on cancel These tasks are intended to be cancellable. Now we prematurely abort the task if the cancellation token is triggered before the task is executed. * Prematurely abort blocking task for fetching telemetry * Prematurely abort stoppable task on drop, all are safe to abort early * Make `move_dir` either move everything, or nothing at all That is with the exception of file IO errors in which case data may be partially moved. Before this PR it was possible for the new target directory to be created without moving all data into it. Now we either do all, or nothing. * Prematurely abort task for creating full snapshot It is fine to either create it, or not at all. * Prematurely abort blocking task for waiting on consensus leader * Prematurely abort blocking cardinality estimation and shard info tasks * Prematurely abort blocking point deduplication task * Prematurely abort blocking task for checking available disk space * Prematurely abort blocking shard read operations All shard read operations, such as retrieve, scroll, facets and more can be safely aborted prematurely. Related to: <#7530> * Prematurely abort blocking task for waiting on replica state * Prematurely abort blocking task for waiting on transfer replica states * Prematurely abort blocking task for loading segment This can safely be aborted before the task is started * Prematurely abort blocking task waiting for replica states * Prematurely abort blocking task for creating snapshot file Safe because it aborts before writing any snapshot files to disk
Similar to #7530, but for all other spawned blocking tasks.
Here I have audited all usages of spawning a blocking task. It turns out that we can prematurely cancel almost all of them. Some are in a similar vein as #7530, being for retrieve, scroll or facet operations.
Some of these tasks already use a cancellation token inside of them. But I still argue also prematurely aborting the task is better because it is free.
Other cases include aborting blocking tasks for fetching state, waiting on state, loading data.
I'd recommend to review this PR on a per-commit basis. I've separated each logical change into a dedicated commit. Each needs careful review.
I consider these changes critical as they might have significant effect on search performance under high load.
#7531 takes care of better stop flag handling inside blocking tasks.
All Submissions:
devbranch. Did you create your branch fromdev?New Feature Submissions:
cargo +nightly fmt --allcommand prior to submission?cargo clippy --all --all-featurescommand?