Skip to content

Avoid flushing invalid Gridstore#7624

Merged
coszio merged 1 commit intodevfrom
weak-gridstore-flusher
Nov 26, 2025
Merged

Avoid flushing invalid Gridstore#7624
coszio merged 1 commit intodevfrom
weak-gridstore-flusher

Conversation

@coszio
Copy link
Contributor

@coszio coszio commented Nov 26, 2025

When creating and deleting payload indices, while running optimizations, the following error would pop up.

2025-11-26T13:09:45.764981Z ERROR qdrant::startup: Panic occurred in file lib/gridstore/src/blob.rs at line 25: Failed to deserialize Vec<ecow::EcoString>: ErrorImpl { code: Message("invalid type: integer `0`, expected a sequence"), offset: 0 }

This is a corruption is caused, according to my testing, by a flushing sequence which would try to apply the changes from an earlier instance of gridstore, to another one, potentially overwriting it with old data.

Mental model would go like this.

  1. Gridstore_1 provides flusher with pending updates
  2. Gridstore_1 gets deleted (by DropPayloadIndex) before flushing
  3. Gridstore_2 gets created (by CreatePayloadIndex), and receives new updates
  4. Flusher tries to apply Gridstore_1's pending changes into Gridstore_2

Since directories are the same, the flusher will actually be able to apply the changes, but without considering latest changes.

This PR uses weak references to ensure that the instance that provides the pending changes is the one being flushed. Otherwise, since the files are the same, flushing operation will actually corrupt data.

With this PR, we can see something that resembles the hypothesis.

2025-11-26T16:31:54.401344Z  INFO actix_web::middleware::logger: 127.0.0.1 "DELETE /collections/benchmark/index/a?wait=true HTTP/1.1" 200 71 "http://localhost:6333/dashboard" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:145.0) Gecko/20100101 Firefox/145.0" 0.523232
2025-11-26T16:31:55.111493Z DEBUG storage::content_manager::toc::collection_meta_ops: Create payload index CreatePayloadIndex { collection_name: "benchmark", field_name: JsonPath { first_key: "a", rest: [] }, field_schema: FieldType(Keyword) }
2025-11-26T16:31:56.579150Z  INFO actix_web::middleware::logger: 127.0.0.1 "PUT /collections/benchmark/index?wait=true HTTP/1.1" 200 75 "http://localhost:6333/dashboard" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:145.0) Gecko/20100101 Firefox/145.0" 1.467731
2025-11-26T16:31:57.437640Z DEBUG storage::content_manager::toc::collection_meta_ops: Drop payload index DropPayloadIndex { collection_name: "benchmark", field_name: JsonPath { first_key: "a", rest: [] } }
2025-11-26T16:31:58.096076Z DEBUG gridstore::gridstore: Aborted flushing on a dropped Gridstore instance

@qdrant qdrant deleted a comment from coderabbitai bot Nov 26, 2025
@timvisee
Copy link
Member

Yes, this can happen because we may reopen the tracker mmap file if it grew during flush here:

// Grow tracker file if it isn't big enough
if self.mmap.len() < end_offset {
self.mmap.flush().unwrap();
let new_size = end_offset.next_power_of_two();
create_and_ensure_length(&self.path, new_size).unwrap();
self.mmap = open_write_mmap(&self.path, AdviceSetting::from(TRACKER_MEM_ADVICE), false)
.unwrap();
}

In that case the 'rogue flush' would open and corrupt the tracker file for the new Gridstore.

@coszio coszio merged commit e0aabd6 into dev Nov 26, 2025
15 checks passed
@coszio coszio deleted the weak-gridstore-flusher branch November 26, 2025 20:01
@timvisee timvisee mentioned this pull request Dec 3, 2025
1 task
@timvisee timvisee mentioned this pull request Dec 5, 2025
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants