Fix BufferedDynamicFlags flushing with concurrent drop/deletion by agourlay · Pull Request #7621 · qdrant/qdrant

agourlay · 2025-11-26T15:45:59Z

Fixes flushing error happening when deleting a payload index concurrently with flushing.

2025-11-26T10:31:55.833087Z DEBUG collection::update_handler: Optimizer 'indexing' running on segments: [18, 19]
2025-11-26T10:31:55.834739Z DEBUG collection::collection_manager::optimizers::segment_optimizer: Available space: 801.99 GiB, needed for optimization: 110.93 MiB
2025-11-26T10:31:56.052782Z DEBUG segment::index::hnsw_index::hnsw: building HNSW for 10693 vectors with 8 CPUs
2025-11-26T10:32:00.428258Z  INFO actix_web::middleware::logger: 127.0.0.1 "GET /collections/benchmark HTTP/1.1" 200 462 "http://localhost:6333/dashboard" "Mozilla/5.0 (X11; Linux x86_64; rv:145.0) Gecko/20100101 Firefox/145.0" 0.009905
2025-11-26T10:32:04.295185Z DEBUG storage::content_manager::toc::collection_meta_ops: Drop payload index DropPayloadIndex { collection_name: "benchmark", field_name: JsonPath { first_key: "a", rest: [] } }
2025-11-26T10:32:06.021544Z  INFO actix_web::middleware::logger: 127.0.0.1 "DELETE /collections/benchmark/index/a?wait=true HTTP/1.1" 200 78 "http://localhost:6333/dashboard" "Mozilla/5.0 (X11; Linux x86_64; rv:145.0) Gecko/20100101 Firefox/145.0" 1.731689
2025-11-26T10:32:10.919258Z ERROR collection::update_handler: Failed to flush: Service runtime error: Failed to flush payload_index: Service runtime error: Failed to flush payload_index: Service runtime error: IO Error: failed to open file `./storage/collections/benchmark/0/segments/4fc17df6-0864-417e-a56b-fbe82bc7706f/payload_index/pabo24ma3euslum266ppk8us-a-null/has_values/flags_a.dat`: No such file or directory (os error 2)

This PR makes sure that drop and flushing are exclusive and uses a weak reference on storage to not trip over a deleted folder.

This was tested manually with bfb and adding/removing a keyword payload index.

timvisee · 2025-11-26T16:23:03Z

Yeah, it makes sense to me that we see it on this particular storage/flusher.

See, I believe this problem only appears when we try to reopen the file. In almost all storage types we don't and keep a file handle. When we remove the file, the file stays accessible through then handle as long as we keep it and we can still write to it. The kernel silently cleans it up once all handles are closed.

This storage does indeed try to reopen the file when the number of flags grows, breaking the pattern. We reopen through set_len, which reopens here.

So our other payload indices suffer from a similar problem. We just don't see an error report for it, as all late writes are silently voided. I believe we should audit all our storage types for similar problems.

Just thinking out loud here.

timvisee · 2025-11-26T16:23:52Z

lib/segment/src/common/flags/buffered_dynamic_flags.rs


+impl Drop for BufferedDynamicFlags {
+    fn drop(&mut self) {
+        // Wait for all background flush operations to finish


Suggested change

// Wait for all background flush operations to finish

// Wait for all background flush operations to finish, and cancel future flushes

coszio

See https://github.com/qdrant/qdrant/pull/7621/files#r2566368975

coszio · 2025-11-26T20:04:22Z

lib/segment/src/common/flags/buffered_dynamic_flags.rs

+        // Weak reference to detect when the storage has been deleted
+        let flags_arc = Arc::downgrade(&self.storage);


I did similarly for #7624, this is an improvement so that we don't start flushing a dropped component 👍

coszio · 2025-11-26T20:10:27Z

lib/segment/src/common/flags/buffered_dynamic_flags.rs

+    /// Lock to prevent concurrent flush and drop
+    pub is_alive_flush_lock: Arc<Mutex<bool>>,


After some consideration, I am not sure this will actually prevent trying to interact with an inexistent file/directory. In PayloadIndex::drop_index we actually remove the directory BEFORE dropping the index.

This lock only makes the dropping wait for the end of the flush, but the files may already be deleted externally.

Please correct me if I missed something 🤔

Good call. That is a problem!

We must only delete files once we unload the payload index structure.

I am quite sure it's important to only delete after also releasing all files handles. Because Windows doesn't allow us to delete files which we still have open. Though then I'm also a bit puzzled on why we haven't seen issues about this. Maybe it needs a specific setup, like explicitly deleting a payload index which may not be very common.

On second thought:

I do remember quite a few reports where Windows users complained about 'permission denied' errors. It may very well be caused by this.

Thanks I did push a fix to order the drop properly 34d031d

This lock only makes the dropping wait for the end of the flush, but the files may already be deleted externally.

My guess it that the original patch worked for me locally because the flush would happen after the drop anyway.

There is not a lot of time to squeeze the flusher between the FS remove and the implicit drop.

FYI started an audit here #7626

coszio · 2025-11-27T12:06:29Z

lib/segment/src/common/flags/buffered_dynamic_flags.rs

+            // Keep the guard till the end of the flush to prevent concurrent drop/flushes
+            let is_alive_flush_guard = is_alive_flush_lock.lock();
+
+            if !*is_alive_flush_guard {


We should be able to avoid this check by also using a weak reference to the lock. Then it would also simplify the mutex to just hold a unit struct ()

* Fix BufferedDynamicFlags flushing with concurrent drop/deletion * make sure to drop before fs remove

Fix BufferedDynamicFlags flushing with concurrent drop/deletion

80646eb

agourlay marked this pull request as ready for review November 26, 2025 15:51

agourlay requested review from coszio and timvisee November 26, 2025 15:52

This comment was marked as resolved.

Sign in to view

agourlay mentioned this pull request Nov 26, 2025

[WIP] Fix payload index Flusher for deleted storage #7617

Closed

2 tasks

qdrant deleted a comment from coderabbitai bot Nov 26, 2025

timvisee approved these changes Nov 26, 2025

View reviewed changes

coszio requested changes Nov 26, 2025

View reviewed changes

make sure to drop before fs remove

34d031d

qdrant deleted a comment from coderabbitai bot Nov 27, 2025

coszio approved these changes Nov 27, 2025

View reviewed changes

agourlay merged commit 3babcfb into dev Nov 27, 2025
15 checks passed

agourlay deleted the safe-flush-deleted-buffered-dynamic-flags branch November 27, 2025 13:18

coszio mentioned this pull request Nov 27, 2025

Fix wiping of gridstore #7627

Merged

agourlay mentioned this pull request Nov 27, 2025

Fix payload indexes drop/flush sequence #7626

Merged

timvisee pushed a commit that referenced this pull request Dec 3, 2025

Fix BufferedDynamicFlags flushing with concurrent drop/deletion (#7621)

9c4af00

* Fix BufferedDynamicFlags flushing with concurrent drop/deletion * make sure to drop before fs remove

timvisee mentioned this pull request Dec 3, 2025

Bump version to 1.16.2 #7680

Merged

1 task

claude bot mentioned this pull request Dec 4, 2025

chore(deps): update qdrant/qdrant docker tag to v1.16.2 cbcoutinho/nextcloud-mcp-server#372

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix BufferedDynamicFlags flushing with concurrent drop/deletion#7621

Fix BufferedDynamicFlags flushing with concurrent drop/deletion#7621
agourlay merged 2 commits intodevfrom
safe-flush-deleted-buffered-dynamic-flags

agourlay commented Nov 26, 2025 •

edited

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

timvisee commented Nov 26, 2025 •

edited

Loading

Uh oh!

timvisee Nov 26, 2025

Uh oh!

coszio left a comment

Uh oh!

coszio Nov 26, 2025 •

edited

Loading

Uh oh!

coszio Nov 26, 2025

Uh oh!

timvisee Nov 27, 2025

Uh oh!

timvisee Nov 27, 2025

Uh oh!

agourlay Nov 27, 2025 •

edited

Loading

Uh oh!

agourlay Nov 27, 2025

Uh oh!

coszio Nov 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	// Wait for all background flush operations to finish
	// Wait for all background flush operations to finish, and cancel future flushes

		// Weak reference to detect when the storage has been deleted
		let flags_arc = Arc::downgrade(&self.storage);

		/// Lock to prevent concurrent flush and drop
		pub is_alive_flush_lock: Arc<Mutex<bool>>,

Conversation

agourlay commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

timvisee commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timvisee Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

coszio left a comment

Choose a reason for hiding this comment

Uh oh!

coszio Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coszio Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

timvisee Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

timvisee Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

agourlay Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

agourlay Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

coszio Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

agourlay commented Nov 26, 2025 •

edited

Loading

timvisee commented Nov 26, 2025 •

edited

Loading

coszio Nov 26, 2025 •

edited

Loading

agourlay Nov 27, 2025 •

edited

Loading