Move SerdeWal from collection into shard crate#7146
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
lib/shard/src/wal.rs (1)
58-58: Doc fix: “does not guarantee durability”The sentence currently says “does guarantee durability”, which reverses the meaning.
Apply this diff:
- /// Write a record to the WAL but does guarantee durability. + /// Write a record to the WAL but does not guarantee durability.lib/collection/src/update_handler.rs (1)
831-842: Flush error handling only catches thread panics, not IO errors from flush
flush_async()returns astd::thread::JoinHandle<std::io::Result<()>>. The current code checks onlyjoin()’s panic path (Err(_)) and ignoresOk(Err(io_err)), potentially proceeding after a failed WAL flush.Apply this diff to handle both failure modes and also fix a small variable naming typo:
- trace!("Attempting flushing"); - let wal_flash_job = wal.lock().await.flush_async(); - - if let Err(err) = wal_flash_job.join() { - error!("Failed to flush wal: {err:?}"); - segments - .write() - .report_optimizer_error(WalError::WriteWalError(format!( - "WAL flush error: {err:?}" - ))); - continue; - } + trace!("Attempting flushing"); + let wal_flush_job = wal.lock().await.flush_async(); + + match wal_flush_job.join() { + Ok(Ok(())) => {} // all good + Ok(Err(err)) => { + error!("Failed to flush wal: {err:?}"); + segments + .write() + .report_optimizer_error(WalError::WriteWalError(format!( + "WAL flush IO error: {err:?}" + ))); + continue; + } + Err(err) => { + error!("WAL flush thread panicked: {err:?}"); + segments + .write() + .report_optimizer_error(WalError::WriteWalError(format!( + "WAL flush thread panic: {err:?}" + ))); + continue; + } + }
🧹 Nitpick comments (5)
lib/shard/Cargo.toml (1)
25-25: Prefer workspace-managed dependency for consistencyMost deps use
{ workspace = true }. Consider aligningrmp-serdeto the workspace to avoid drift and simplify upgrades.Apply this diff if
rmp-serdeis already defined in the workspace:-rmp-serde = "~1.3" +rmp-serde = { workspace = true }lib/shard/src/wal.rs (3)
60-65: Avoid panic on serialization failure in write()
serde_cbor::to_vec(...).unwrap()will panic on malformed/unsupported data. Prefer returningWalError::WriteWalErrorinstead of panicking.Apply this diff:
- // ToDo: Replace back to faster rmp, once this https://github.com/serde-rs/serde/issues/2055 solved - let binary_entity = serde_cbor::to_vec(&entity).unwrap(); + // ToDo: Replace back to faster rmp, once this https://github.com/serde-rs/serde/issues/2055 solved + let binary_entity = serde_cbor::to_vec(&entity) + .map_err(|err| WalError::WriteWalError(format!("failed to serialize record: {err}")))?; self.wal .append(&binary_entity) .map_err(|err| WalError::WriteWalError(format!("{err:?}")))
85-93: Deserialization failures currently hard-panicBoth
entry(...).expect(...)andfrom_slice(...).expect(...)will crash on IO or format errors (e.g., partial writes, corruption, or version skew not covered by the fallback). Consider mapping toWalErrorand letting callers decide how to handle bad records.If you want a minimally invasive tweak without changing external behavior, at least switch the
expectmessages to include the index and path for more actionable crash reports. I can provide a full error-propagating diff if desired.
166-172: Error variant mismatch when persisting first-index fileA failure to write
first-indexis classified asTruncateWalError. That’s misleading; it’s an IO/write error.Apply this diff:
- .map_err(|err| { - WalError::TruncateWalError(format!("failed to write first-index file: {err:?}")) - })?; + .map_err(|err| { + WalError::WriteWalError(format!("failed to write first-index file: {err:?}")) + })?;lib/collection/src/update_handler.rs (1)
831-833: Nit: rename wal_flash_job -> wal_flush_jobTypo in variable name; rename improves readability and greppability.
(Handled by the diff above.)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (9)
lib/collection/src/lib.rs(0 hunks)lib/collection/src/operations/types.rs(1 hunks)lib/collection/src/shards/local_shard/mod.rs(1 hunks)lib/collection/src/shards/local_shard/shard_ops.rs(1 hunks)lib/collection/src/update_handler.rs(1 hunks)lib/collection/src/wal_delta.rs(5 hunks)lib/shard/Cargo.toml(1 hunks)lib/shard/src/lib.rs(1 hunks)lib/shard/src/wal.rs(5 hunks)
💤 Files with no reviewable changes (1)
- lib/collection/src/lib.rs
🧰 Additional context used
🧬 Code graph analysis (1)
lib/shard/src/wal.rs (2)
lib/collection/src/shards/local_shard/mod.rs (2)
new(168-250)new(1229-1234)lib/collection/src/wal_delta.rs (1)
new(33-43)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: test-consistency
- GitHub Check: e2e-tests
- GitHub Check: lint
- GitHub Check: storage-compat-test
- GitHub Check: test-consensus-compose
- GitHub Check: integration-tests-consensus
🔇 Additional comments (12)
lib/shard/Cargo.toml (1)
28-28: LGTM: required deps added for moved WAL module
serde_cbor,thiserror, andwaladditions make sense for the newshard::walsurface.Also applies to: 31-31, 34-34
lib/shard/src/wal.rs (2)
136-159: Visibility change of ack() is appropriate for cross-crate usageMaking
ackpublic is necessary now thatcollectioncalls it throughshard::wal. No concerns.
24-24: Underscored PhantomData field rename is a nice cleanupRenaming to
_recordremoves the unused-field lint without altering behavior.Also applies to: 54-55
lib/collection/src/update_handler.rs (1)
18-18: Import hop to shard::wal is correct
WalErrornow lives undershard::wal; this update matches the crate move.lib/collection/src/operations/types.rs (1)
41-41: LGTM: WalError path updatedImporting
WalErrorfromshard::walaligns with the module move. The existingFrom<WalError> for CollectionErrorcontinues to work unchanged.lib/shard/src/lib.rs (1)
6-6: Good: expose shard::wal as a public moduleThis is the right place for WAL now that it conceptually belongs to shards.
lib/collection/src/shards/local_shard/shard_ops.rs (1)
75-85: Updated match arm for ClockRejected is correct after the move.Catching
Err(shard::wal::WalError::ClockRejected)preserves existing behavior and surfacesUpdateStatus::ClockRejectedas before.Given the
Err(err) => return Err(err.into())path now receivesshard::wal::WalError, please confirm aFrom<WalError> for CollectionErrorexists (otherwise this won’t compile). You can reuse the script from the other comment (step 3).lib/collection/src/wal_delta.rs (5)
4-4: Import path updated to shard::wal::SerdeWal — aligns with crate reorg.This matches the new ownership of the WAL implementation in the shard crate.
63-64: Returning WalError::ClockRejected at the boundary is correct.The early rejection on clock drift continues to be enforced before writing to WAL. No behavioral change, just the new path.
132-144: Test helper append_from signature migrated to shard::wal::Result — OK.The test-only API tracks the new result type; downstream tests already unwrap on it.
272-272: Tests import SerdeWal from shard::wal — consistent with the move.All test fixtures here now use the new path; looks good.
53-70: Verification Complete: lock_and_write call-sites updatedAll usages of the revised
lock_and_writesignature have been adjusted to the newshard::wal::Resultreturn type:
- lib/collection/src/shards/local_shard/shard_ops.rs – the main call in
LocalShard::updatenow matches on
Err(shard::wal::WalError::ClockRejected)as intended.- lib/collection/src/wal_delta.rs – internal operations and test scaffolding invoke
.await.unwrap()on theResult, aligning with the updated API.No remaining call-sites lack the proper handling or unwrapping of the new
Resulttype. The implementation and tests are consistent with the API change.
24272f6 to
e2dc630
Compare
Another one for Qdrant on Edge. Only moving files and minimal cleanup, no code changes.
All Submissions:
devbranch. Did you create your branch fromdev?New Feature Submissions:
cargo +nightly fmt --allcommand prior to submission?cargo clippy --all --all-featurescommand?Changes to Core Features: