Mutable ID tracker: implement byte storage format for mappings#6166
Mutable ID tracker: implement byte storage format for mappings#6166
Conversation
| internal_to_external: &mut Vec<PointIdType>, | ||
| external_to_internal_num: &mut BTreeMap<u64, PointOffsetType>, | ||
| external_to_internal_uuid: &mut BTreeMap<Uuid, PointOffsetType>, | ||
| /// Store new mapping changes, appending them to the given file |
There was a problem hiding this comment.
The meat for all store/load logic begins here.
It's all separated in functions like:
- store (append) mappings to file
- write mappings to writer
- write single mapping to writer
- load mappings from file
- read mappings from reader
- read single mapping from reader
| Ok(()) | ||
| } | ||
|
|
||
| fn load_versions( |
There was a problem hiding this comment.
Storage for versions starts here and is still using the same format. This will be taken care of in the next PR.
a5f067f to
92bf3e7
Compare
This comment was marked as resolved.
This comment was marked as resolved.
ffuugoo
left a comment
There was a problem hiding this comment.
LGTM, though, why not use serde_cbor for serialization instead of hand-rolled format? Would be a bit less efficient, but easier to maintain, and we already use CBOR in number of places. 🤔
I didn't look at CBOR specifically, but did look at bincode. But when I look at CBOR now, I can list a few arguments on why I'd still prefer the current approach:
|
| std::iter::from_fn(move || match read_entry(&mut reader) { | ||
| Ok(entry) => Some(Ok(entry)), | ||
| // Done reading if end of file is reached | ||
| Err(err) if err.kind() == io::ErrorKind::UnexpectedEof => None, |
There was a problem hiding this comment.
What would happen if we have a corrupted file, in which we ignored last entry error, in case we continue to write new records into it?
There was a problem hiding this comment.
Thanks! That would indeed be a problem. I had it on my task list, but forgot it in this PR.
I'll try to add some logic to truncate it.
There was a problem hiding this comment.
I'll handle this in a separate PR, as it isn't trivial to do and I don't want to make this PR more complicated. I've added it in the tracking issue.
|
Not exactly related to this PR, but those file names already used in immutable ID tracker. Let's change them to some other unique names to avoid potential misread. |
There was a problem hiding this comment.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
lib/segment/src/id_tracker/mutable_id_tracker.rs (1)
21-22: 🛠️ Refactor suggestionConsider renaming files to avoid confusion with immutable ID tracker
As pointed out by generall in the PR comments, the file names
id_tracker.mappingsandid_tracker.versionsare already in use by the immutable ID tracker. This could lead to confusion and potential issues.Consider using more specific file names like
mutable_id_tracker.mappingsandmutable_id_tracker.versionsto clearly distinguish between the two implementations.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
lib/segment/src/id_tracker/mutable_id_tracker.rs(11 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (13)
- GitHub Check: test-snapshot-operations-s3-minio
- GitHub Check: test-shard-snapshot-api-s3-minio
- GitHub Check: test-low-resources
- GitHub Check: test-consistency
- GitHub Check: test-consensus-compose
- GitHub Check: test (macos-latest)
- GitHub Check: test-consensus
- GitHub Check: test (windows-latest)
- GitHub Check: Basic TLS/HTTPS tests
- GitHub Check: test (ubuntu-latest)
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
🔇 Additional comments (9)
lib/segment/src/id_tracker/mutable_id_tracker.rs (9)
32-41: Well-designed implementation of change type classificationThe addition of the
change_typemethod to theMappingChangeenum provides a clean way to classify different mapping changes, improving code organization.
43-62: Good enum design with byte representationThe
MappingChangeTypeenum with#[repr(u8)]is well-implemented. Thefrom_bytemethod provides a robust way to convert from raw bytes to the enum type with proper error handling for invalid values.
113-127: Improved error handling and code organizationThe refactored
openmethod now uses dedicated loading functions, making the code more readable and maintainable while preserving proper error handling.
330-353: Well-structured persistence implementation with proper fsyncThe implementation correctly:
- Opens the file in append mode
- Writes changes using a buffered writer
- Explicitly calls
sync_all()to ensure durabilityThis ensures data integrity even in case of system crashes.
355-375: Clear documentation and implementation of write operationsThe function is well-documented explaining the file format, and the implementation properly handles the serialization of mapping changes.
384-405: Elegant implementation of error handling in iteratorThe
read_mappings_iterfunction usestake_while_inclusiveto stop iteration after the first error, which is a clean approach to handling errors in an iterator-based API.
497-532: Comprehensive entry deserialization with robust error handlingThe implementation correctly handles different types of mapping changes and has proper error handling for corrupted data.
534-575: Well-documented serialization formatThe documentation clearly explains the serialization format with a detailed comment showing the byte layout, which is excellent for future maintainers.
906-978: Excellent test coverage for serialization edge casesThe tests comprehensively cover various scenarios including:
- Normal serialization/deserialization
- Empty input
- Invalid type bytes
- Malformed data
- Incomplete entries
- Multiple entries
This ensures the robustness of the storage format implementation.
| std::iter::from_fn(move || match read_entry(&mut reader) { | ||
| Ok(entry) => Some(Ok(entry)), | ||
| // Done reading if end of file is reached | ||
| Err(err) if err.kind() == io::ErrorKind::UnexpectedEof => None, |
There was a problem hiding this comment.
💡 Verification agent
❓ Verification inconclusive
Consider implementing file truncation for corrupted files
Per the comment from generall and your response, there's a need to handle corrupted files when writing new records. If a file is corrupted and the last entry was ignored, writing new records could lead to further corruption.
I don't see the truncation logic implemented yet - this should be addressed as you mentioned in your response.
Could you implement a mechanism to truncate corrupted files before appending new records? This would involve checking for file integrity before writing and truncating to the last valid entry if corruption is detected.
Below is a shell script to verify whether any file truncation logic is already present in the target file:
Once you’ve reviewed the results from these commands, please note that—if nothing relevant is found—it confirms that the truncation mechanism for handling corrupted files has not yet been implemented. Based on that outcome, please consider updating your code before proceeding to append new records.
Implement file truncation for corrupted files in mutable_id_tracker
- The current logic only ignores errors (
io::ErrorKind::UnexpectedEof) by returningNonewithout attempting to repair or truncate the file. - If a file is corrupted and the last entry is ignored, appending new records risks further corrupting file data.
- Action Required: Introduce file integrity checks prior to writing new records and, upon detection of corruption, truncate the file at the last valid record.
🏁 Script executed:
#!/bin/bash
# Check for any file truncation markers, e.g. usage of "truncate" or "set_len",
# in the mutable_id_tracker.rs file.
rg -nH "truncate" lib/segment/src/id_tracker/mutable_id_tracker.rs
rg -nH "set_len" lib/segment/src/id_tracker/mutable_id_tracker.rsLength of output: 132
Action Required: Implement File Truncation for Corrupted Files
After verifying that there is no evidence of file truncation logic (e.g., the absence of keywords like truncate or set_len in lib/segment/src/id_tracker/mutable_id_tracker.rs), it appears that the mechanism to repair corrupted files has not yet been implemented. When encountering a corruption (such as an UnexpectedEof error), the code should check the file's integrity and truncate it to the last valid entry before appending new records. This is necessary to prevent further data corruption.
- File:
lib/segment/src/id_tracker/mutable_id_tracker.rs - Location Concerned: Around line 399, where the error is handled.
- Suggested Improvement: Incorporate file integrity checking logic to detect corruption and truncate the file accordingly.
Please update the code to address this issue.
|
Merging now. The review comments are handled in a follow up PR. |
* Extract common points mappings check into function * Fully rework load/store for point mapping changes, use byte format * Add extra serialize/deserialize test for new format * Fuse iterator * Add another test case * Take mutable reader when reading point mapping entry * Flush versions in the same way we flush mappings now * Don't explicitly fuse reader, it's not necessary
Tracked in: #6157
Depends on: #6158
Implements a custom byte-based storage format for persisting mutable ID tracker mappings. It also improves how we handle reading/writing/loading/storing these mappings in general.
Note that this does not really touch point version storage yet. That'll be done in the following PR.
The following comment in code describes the storage format:
The implementation is based on what we have in the immutable ID tracker, though it has been extended a bit to also support deletions.
It is resilient against partially written files due to a crash during flush. The reader ignores the very last entry if it's incomplete, so that we can still recover with the WAL. If any other entries are corrupted an error is returned.
This includes tests to ensure serializing and deserialization is sound, even in case of corrupted data.
Tasks
devAll Submissions:
devbranch. Did you create your branch fromdev?New Feature Submissions:
cargo +nightly fmt --allcommand prior to submission?cargo clippy --all --all-featurescommand?Changes to Core Features: