Skip to content

Fix BufferedDynamicFlags flushing with concurrent drop/deletion#7621

Merged
agourlay merged 2 commits intodevfrom
safe-flush-deleted-buffered-dynamic-flags
Nov 27, 2025
Merged

Fix BufferedDynamicFlags flushing with concurrent drop/deletion#7621
agourlay merged 2 commits intodevfrom
safe-flush-deleted-buffered-dynamic-flags

Conversation

@agourlay
Copy link
Member

@agourlay agourlay commented Nov 26, 2025

Fixes flushing error happening when deleting a payload index concurrently with flushing.

2025-11-26T10:31:55.833087Z DEBUG collection::update_handler: Optimizer 'indexing' running on segments: [18, 19]
2025-11-26T10:31:55.834739Z DEBUG collection::collection_manager::optimizers::segment_optimizer: Available space: 801.99 GiB, needed for optimization: 110.93 MiB
2025-11-26T10:31:56.052782Z DEBUG segment::index::hnsw_index::hnsw: building HNSW for 10693 vectors with 8 CPUs
2025-11-26T10:32:00.428258Z  INFO actix_web::middleware::logger: 127.0.0.1 "GET /collections/benchmark HTTP/1.1" 200 462 "http://localhost:6333/dashboard" "Mozilla/5.0 (X11; Linux x86_64; rv:145.0) Gecko/20100101 Firefox/145.0" 0.009905
2025-11-26T10:32:04.295185Z DEBUG storage::content_manager::toc::collection_meta_ops: Drop payload index DropPayloadIndex { collection_name: "benchmark", field_name: JsonPath { first_key: "a", rest: [] } }
2025-11-26T10:32:06.021544Z  INFO actix_web::middleware::logger: 127.0.0.1 "DELETE /collections/benchmark/index/a?wait=true HTTP/1.1" 200 78 "http://localhost:6333/dashboard" "Mozilla/5.0 (X11; Linux x86_64; rv:145.0) Gecko/20100101 Firefox/145.0" 1.731689
2025-11-26T10:32:10.919258Z ERROR collection::update_handler: Failed to flush: Service runtime error: Failed to flush payload_index: Service runtime error: Failed to flush payload_index: Service runtime error: IO Error: failed to open file `./storage/collections/benchmark/0/segments/4fc17df6-0864-417e-a56b-fbe82bc7706f/payload_index/pabo24ma3euslum266ppk8us-a-null/has_values/flags_a.dat`: No such file or directory (os error 2)

This PR makes sure that drop and flushing are exclusive and uses a weak reference on storage to not trip over a deleted folder.

This was tested manually with bfb and adding/removing a keyword payload index.

@agourlay agourlay marked this pull request as ready for review November 26, 2025 15:51
@agourlay agourlay requested review from coszio and timvisee November 26, 2025 15:52
coderabbitai[bot]

This comment was marked as resolved.

@timvisee
Copy link
Member

timvisee commented Nov 26, 2025

Yeah, it makes sense to me that we see it on this particular storage/flusher.

See, I believe this problem only appears when we try to reopen the file. In almost all storage types we don't and keep a file handle. When we remove the file, the file stays accessible through then handle as long as we keep it and we can still write to it. The kernel silently cleans it up once all handles are closed.

This storage does indeed try to reopen the file when the number of flags grows, breaking the pattern. We reopen through set_len, which reopens here.

So our other payload indices suffer from a similar problem. We just don't see an error report for it, as all late writes are silently voided. I believe we should audit all our storage types for similar problems.

Just thinking out loud here.


impl Drop for BufferedDynamicFlags {
fn drop(&mut self) {
// Wait for all background flush operations to finish
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Wait for all background flush operations to finish
// Wait for all background flush operations to finish, and cancel future flushes

Copy link
Contributor

@coszio coszio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +63 to +64
// Weak reference to detect when the storage has been deleted
let flags_arc = Arc::downgrade(&self.storage);
Copy link
Contributor

@coszio coszio Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did similarly for #7624, this is an improvement so that we don't start flushing a dropped component 👍

Comment on lines +23 to +24
/// Lock to prevent concurrent flush and drop
pub is_alive_flush_lock: Arc<Mutex<bool>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some consideration, I am not sure this will actually prevent trying to interact with an inexistent file/directory. In PayloadIndex::drop_index we actually remove the directory BEFORE dropping the index.

This lock only makes the dropping wait for the end of the flush, but the files may already be deleted externally.

Please correct me if I missed something 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. That is a problem!

We must only delete files once we unload the payload index structure.

I am quite sure it's important to only delete after also releasing all files handles. Because Windows doesn't allow us to delete files which we still have open. Though then I'm also a bit puzzled on why we haven't seen issues about this. Maybe it needs a specific setup, like explicitly deleting a payload index which may not be very common.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought:

image

I do remember quite a few reports where Windows users complained about 'permission denied' errors. It may very well be caused by this.

Copy link
Member Author

@agourlay agourlay Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks I did push a fix to order the drop properly 34d031d

This lock only makes the dropping wait for the end of the flush, but the files may already be deleted externally.

My guess it that the original patch worked for me locally because the flush would happen after the drop anyway.

There is not a lot of time to squeeze the flusher between the FS remove and the implicit drop.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI started an audit here #7626

@qdrant qdrant deleted a comment from coderabbitai bot Nov 27, 2025
// Keep the guard till the end of the flush to prevent concurrent drop/flushes
let is_alive_flush_guard = is_alive_flush_lock.lock();

if !*is_alive_flush_guard {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to avoid this check by also using a weak reference to the lock. Then it would also simplify the mutex to just hold a unit struct ()

@agourlay agourlay merged commit 3babcfb into dev Nov 27, 2025
15 checks passed
@agourlay agourlay deleted the safe-flush-deleted-buffered-dynamic-flags branch November 27, 2025 13:18
@coszio coszio mentioned this pull request Nov 27, 2025
timvisee pushed a commit that referenced this pull request Dec 3, 2025
* Fix BufferedDynamicFlags flushing with concurrent drop/deletion

* make sure to drop before fs remove
@timvisee timvisee mentioned this pull request Dec 3, 2025
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants