Skip to content

id tracker persisted mappings offset#7877

Merged
timvisee merged 10 commits intodevfrom
id-tracker-persisted-mappings-offset
Jan 9, 2026
Merged

id tracker persisted mappings offset#7877
timvisee merged 10 commits intodevfrom
id-tracker-persisted-mappings-offset

Conversation

@IvanPleshkov
Copy link
Contributor

@IvanPleshkov IvanPleshkov commented Jan 6, 2026

This PR fixes the ID tracker flushing bug and continues PR #6173.

Bug description

In the production, there is an error while ID tracker loading:

026-01-05T18:21:40.350395Z ERROR qdrant::startup: Panic occurred in file /qdrant/lib/collection/src/shards/replica_set/mod.rs at line 315: Failed to load local shard "./storage/collections/embeddings/0": Service internal error: Failed to load ID tracker mappings: Service runtime error: IO Error: Corrupted ID tracker mapping storage, got malformed mapping change byte 0x6D

Possible repro steps

There are no confirmed repro steps yet, but there is a possible scenario of how it could happen:

  1. Flushing is called.
  2. ID Tracker tried to flush their pending mapping data but out-of-disk happened. ID tracked appends the pending changes into the file; the changes have been written partially(!!!).
  3. Flush returned an error, WAL is not cleared
  4. Something happened, and there is an additional disk space (the optimizer finishes the job, or another process cleaned something).
  5. Wait until flush retry. From WAL, there are non-empty pending changes; however, because there is sufficient disk space, the pending changes were written successfully after the partially written data.
  6. Tada! We got a malformed ID Tracker mapping file, which returns such an error after segment reloading.

In this scenario, file shrinking from here #6173 is not helping because the data is malformed, but we don't meet EOF.

Solution

As a solution, this PR tracks the amount of successfully persisted bytes amount. In code, it's called persisted_mappings_size.

This persisted_mappings_size is equal to the file size at the beginning. It's increased after a successful flush, and the flush is starting appends at persisted_mappings_size position (using file::Seek).

How it affects the described bug scenario

In the scenario of the possible repro steps, we don't update persisted_mappings_size if the flush failed. It means, at step 5, we won't append the unfinished file. We will override non-succesfull part in this file.

Open questions

Do we need to shrink the file in the persisted_mappings_size is not equal to the file size while flushing?

@IvanPleshkov IvanPleshkov force-pushed the id-tracker-persisted-mappings-offset branch from 3f1adab to 37ba63e Compare January 6, 2026 22:30
@IvanPleshkov IvanPleshkov requested a review from agourlay January 7, 2026 10:33
@IvanPleshkov IvanPleshkov marked this pull request as ready for review January 7, 2026 10:33
coderabbitai[bot]

This comment was marked as resolved.

@IvanPleshkov IvanPleshkov marked this pull request as draft January 7, 2026 10:41
@IvanPleshkov IvanPleshkov marked this pull request as ready for review January 7, 2026 10:54
@IvanPleshkov IvanPleshkov marked this pull request as draft January 7, 2026 10:55
coderabbitai[bot]

This comment was marked as resolved.

@timvisee
Copy link
Member

timvisee commented Jan 7, 2026

I wonder if a better alternative would be to try and grow the file to the exact required size on flush. If we don't have enough space, we'll abort without touching the file.

We currently grow the file automatically by writing to it, which means it's more likely to abort outside of entry boundaries.

One edge case with my suggestion is that we might have a bunch of zeroes at the end of the file. If is possible if we crash half way during the flush while we've already grown the file. But, we can easily detect this as there's no valid 0x00 entry (did this on purpose while building this tracker yay!).

Speaking of which: a crash during flush is likely another scenario to reproduce the problem you described - writing a partial entry.

Thoughts?

@qdrant qdrant deleted a comment from coderabbitai bot Jan 7, 2026
@timvisee timvisee self-requested a review January 7, 2026 13:17
@IvanPleshkov
Copy link
Contributor Author

IvanPleshkov commented Jan 7, 2026

I wonder if a better alternative would be to try and grow the file to the exact required size on flush. If we don't have enough space, we'll abort without touching the file.

We currently grow the file automatically by writing to it, which means it's more likely to abort outside of entry boundaries.

One edge case with my suggestion is that we might have a bunch of zeroes at the end of the file. If is possible if we crash half way during the flush while we've already grown the file. But, we can easily detect this as there's no valid 0x00 entry (did this on purpose while building this tracker yay!).

Speaking of which: a crash during flush is likely another scenario to reproduce the problem you described - writing a partial entry.

Thoughts?

As we discussed with @agourlay, in general, we cannot check if the data in the file is correct. We can check the entry, but if the malformed data is the last bytes after the last entry, we cannot check it - the external id may be any possible value. In case of out-of-space it's fine, but I'm not sure if this proposal works fine in case of crash

@IvanPleshkov
Copy link
Contributor Author

IvanPleshkov commented Jan 7, 2026

I wonder if a better alternative would be to try and grow the file to the exact required size on flush. If we don't have enough space, we'll abort without touching the file.

We currently grow the file automatically by writing to it, which means it's more likely to abort outside of entry boundaries.

One edge case with my suggestion is that we might have a bunch of zeroes at the end of the file. If is possible if we crash half way during the flush while we've already grown the file. But, we can easily detect this as there's no valid 0x00 entry (did this on purpose while building this tracker yay!).

Speaking of which: a crash during flush is likely another scenario to reproduce the problem you described - writing a partial entry.

Thoughts?

There is a draft PR with collecting all pending mappings into a single buffer and writing it in one shot. It contains the size estimation of the pending changes and is a partial implementation of your proposal, we can extend it to implement your idea
#7872

@IvanPleshkov IvanPleshkov marked this pull request as ready for review January 8, 2026 10:44
// `persisted_mappings_size` may be less than actual file size.
let file_start_appending = persisted_mappings_size.load(std::sync::atomic::Ordering::Relaxed);
if file_start_appending <= file_len {
file.seek(io::SeekFrom::Start(file_start_appending))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idea: how about we truncate the file here instead (or do both). Then we also 'repair' the file for the next load if we don't finish this flush.

If we hit this case, we expect to always write the exact same sequence of bytes. And so it should not corrupt the file even more. But truncating may be a bit safer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented in 3175206

Copy link
Member

@timvisee timvisee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea, the implementation looks sound.

I did extend your implementation a bit to clean up bad state more aggressively. It isn't strictly necessary, but in my opinion it's a nice to have.

Specifically it now:

  1. on failed flush try to truncate file to what we had before flush, actively removes partial entries while we still know how to correctly recover the state on disk
  2. if file is too large when persisting new mappings, don't just seek to append point but also truncate everything after

@qdrant qdrant deleted a comment from coderabbitai bot Jan 8, 2026
coderabbitai[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

@qdrant qdrant deleted a comment from coderabbitai bot Jan 8, 2026
coderabbitai[bot]

This comment was marked as resolved.

// If persisting mappings failed, try to truncate mappings file to what we had before
// in an best effort to get rid of partially persisted mappings. We can safely ignore
// truncate errors because load should properly handle partial entries as well.
if let Err(err) = stored {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems store_mapping_changes can throw a bunch of errors, I am not sure about matching on any error returned here.

Maybe we could at least check if the file len changed before trying to set it back?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the file wasn't touched, the following 'truncation' will be a no op.

Do you think it's still better to isolate the error paths?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a second though I think it is fine, wanted to have your opinion.
Isolating error path will be super verbose.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah.

There is extra work to re-open the file to allow truncation. But it only happens in the error path.


// Get new persisted size as a position after writing all changes.
// Stream position is used here to handle cases where the pending changes are shorter than non-persisted part.
let new_persisted_size = file.stream_position().map_err(|err| {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you test this stream_position thingy?

Is it supposed to give you the current cursor position after all the writes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

https://play.rust-lang.org/?version=stable&mode=debug&edition=2024&gist=aca587b249281c64c5c478f42b890273

In this case it gives us the size of the file in bytes, or the position we write the first new byte to.

@timvisee timvisee merged commit ec4b98b into dev Jan 9, 2026
15 checks passed
@timvisee timvisee deleted the id-tracker-persisted-mappings-offset branch January 9, 2026 10:38
@IvanPleshkov
Copy link
Contributor Author

IvanPleshkov commented Jan 9, 2026

I tested this PR locally using tmpfs (filesystem in RAM) with small disk size (8MB). Test code with repro steps. It shows the same error Corrupted ID tracker mapping storage, got malformed mapping change byte 0x00. This PR fixes this problem

    #[test]
    fn mutable_id_tracker_overflow() {
        // First, create a file of 1MB size to simulate later disk space cleaning.
        let tmp_file_path = Path::new("/mnt/tiny/tmp_file");
        if fs::exists(&tmp_file_path).unwrap() {
            fs::remove_file(&tmp_file_path).unwrap();
        }
        let mut tmp_file = fs::File::create(&tmp_file_path).unwrap();
        tmp_file.write(&vec![0; 1024 * 1024]).unwrap();
        tmp_file.flush().unwrap();
        tmp_file.sync_all().unwrap();
        drop(tmp_file);

        // Then, create ID tracker and fill it while possible.
        let path = Path::new("/mnt/tiny/id_tracker_overflow_test");
        if fs::exists(path).unwrap() {
            fs::remove_dir_all(path).unwrap();
        }
        std::fs::create_dir_all(path).unwrap();
        let mut id_tracker = make_mutable_tracker(&path);
        for i in 0.. {
            let internal_id = id_tracker.total_point_count() as PointOffsetType;
            let point_id = PointIdType::NumId(internal_id as u64);
            match id_tracker.set_link(point_id, internal_id) {
                Ok(_) => {}
                Err(err) => {
                    println!("Stopped at iteration {i} due to error: {err}");
                    break;
                }
            }

            // flush each 1000 points
            if i % 1000 == 0 {
                match id_tracker.mapping_flusher()() {
                    Ok(_) => {}
                    Err(err) => {
                        println!("Flushing stopped at iteration {i} due to error: {err}");
                        break;
                    }
                }
            }
        }

        // check that free space is not available
        let fs_stats = fs2::available_space(path).unwrap();
        println!("Available space before cleanup: {fs_stats} bytes");

        // then, add additional space by removing the temporary file.
        fs::remove_file(&tmp_file_path).unwrap();

        // flush file deletion to ensure the OS registers the freed space
        #[cfg(unix)]
        {
            use std::process::Command;
            // Force filesystem sync
            let _ = Command::new("sync").output();
        }
        std::thread::sleep(std::time::Duration::from_secs(3)); // wait for the system to register the freed space

        // check that free space is available
        let fs_stats = fs2::available_space(path).unwrap();
        println!("Available space after cleanup: {fs_stats} bytes");

        // Add some more points to verify that ID tracker can continue working.
        for i in 0..10 {
            let point_id = PointIdType::Uuid(Uuid::new_v4());
            let internal_id = id_tracker.total_point_count() as PointOffsetType;
            match id_tracker.set_link(point_id, internal_id) {
                Ok(_) => {}
                Err(err) => {
                    println!("Stopped at iteration {i} due to error: {err}");
                    break;
                }
            }
        }

        // Flush has to work at the end.
        id_tracker.mapping_flusher()().unwrap();

        // Reload ID tracker to verify data integrity.
        drop(id_tracker);

        // The bug appears here with some probability.
        MutableIdTracker::open(&path).unwrap();
    }

@timvisee
Copy link
Member

timvisee commented Jan 9, 2026

Corrupted ID tracker mapping storage, got malformed mapping change byte 0x00

Actually, this is critical.

This shows that we do get 0's in places where we don't expect it. It means the file was grown without actually writing bytes to it.

Turns out that message was from before the fix. No pending concerns then. 👍

generall pushed a commit that referenced this pull request Feb 9, 2026
* id tracker persisted mappings offset

* remove obsolete check

* review remarks

* review remarks

* When persisting mappings, truncate file that is larger than we expect

* If persisting mappings fails, truncate file to what we had before

This isn't necessary, but it is nice to clean up partial mappings.

* Fix typos

* Fix typos

* Just truncate the file, we don't have to seek anymore

* Rename length variable to mappings_expected_len

---------

Co-authored-by: timvisee <tim@visee.me>
@timvisee timvisee mentioned this pull request Feb 17, 2026
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants