feat: replicate control requests through WAL via state machine#882
feat: replicate control requests through WAL via state machine#882mattisonchao merged 20 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR extends Oxia’s replication/WAL format to carry control-plane commands (feature enablement) and introduces a shared state-machine layer to apply both write and control log entries consistently across leader and follower paths, enabling negotiated activation of the DB checksum feature.
Changes:
- Add
ControlRequest{FeatureEnableRequest}support toLogEntryValueand implement application logic via a newcontroller/statemachinepackage. - Wire leaders/followers to apply WAL entries through the state machine; leaders can now propose control entries to enable features post-election.
- Rename negotiated feature
FEATURE_FINGERPRINTtoFEATURE_DB_CHECKSUMand expose checksum/feature state via controllers and tests.
Reviewed changes
Copilot reviewed 20 out of 20 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
tests/control/control_request_test.go |
Adds an integration-style test for feature enablement and checksum consistency across replicas. |
oxiad/dataserver/server.go |
Exposes GetShardDirector() for test/introspection access to shard controllers. |
oxiad/dataserver/database/db.go |
Adds feature enable tracking and conditional checksum persistence/read APIs on the DB. |
oxiad/dataserver/controller/statemachine/state_machine.go |
New shared apply-layer for proposals and WAL log entries (write + control). |
oxiad/dataserver/controller/statemachine/state_machine_test.go |
Unit tests for applying proposals and WAL entries (write + feature enable). |
oxiad/dataserver/controller/statemachine/proposal.go |
Proposal abstraction for write and control entries and WAL serialization. |
oxiad/dataserver/controller/statemachine/proposal_test.go |
Unit tests for proposal creation and WAL encoding. |
oxiad/dataserver/controller/lead/leader_controller.go |
Refactors write path to propose(), replays WAL via state machine, and proposes feature-enable control entries post-election; exposes checksum. |
oxiad/dataserver/controller/lead/leader_controller_test.go |
Updates feature negotiation tests for the renamed DB checksum feature. |
oxiad/dataserver/controller/follow/follower_controller.go |
Applies committed WAL entries via state machine; exposes checksum and feature-enabled state. |
oxiad/coordinator/controller/shard_controller_test.go |
Updates coordinator tests to expect negotiated DB checksum feature. |
oxiad/coordinator/controller/shard_controller_election_test.go |
Updates feature negotiation unit tests for renamed feature. |
oxiad/common/feature/feature.go |
Updates supported feature list to FEATURE_DB_CHECKSUM. |
oxiad/common/feature/feature_test.go |
Removes prior supported-features tests (file deleted). |
oxiad/common/crc/crc.go |
Adds Checksum.IsZero() helper. |
common/proto/storage.proto |
Adds control_request to LogEntryValue and defines ControlRequest/FeatureEnableRequest. |
common/proto/storage.pb.go |
Regenerated Go bindings for storage.proto. |
common/proto/storage_vtproto.pb.go |
Regenerated vtproto bindings to support new messages/oneofs. |
common/proto/replication.proto |
Renames feature enum value to FEATURE_DB_CHECKSUM. |
common/proto/replication.pb.go |
Regenerated Go bindings for replication.proto. |
Comments suppressed due to low confidence (2)
oxiad/dataserver/controller/lead/leader_controller.go:849
- Timer started here is never finished on the early-return path (non-leader). This will skew latency metrics and may leak resources depending on the histogram implementation. Ensure timer.Done()/DoneCtx is called before every return.
func (lc *leaderController) propose(ctx context.Context, proposalSupplier func(offset int64) statemachine.Proposal, cb concurrent.Callback[statemachine.ApplyResponse]) {
timer := lc.writeLatencyHisto.Timer()
lc.Lock()
if err := checkStatusIsLeader(lc.status); err != nil {
lc.Unlock()
cb.OnCompleteError(err)
return
}
oxiad/dataserver/controller/lead/leader_controller.go:868
- If MarshalVT() fails, propose() returns without calling timer.Done/DoneCtx, leaving the latency timer open. Make sure to finish the timer on this error path as well.
value, err := entryValue.MarshalVT()
if err != nil {
lc.Unlock()
cb.OnCompleteError(err)
return
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 21 out of 21 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (1)
oxiad/dataserver/controller/lead/leader_controller.go:869
propose()starts a latency timer, but on early-return paths (eg not leader at lines 846-850, or MarshalVT error at 865-869) the timer is never completed. This can skew latency metrics and potentially leak timer resources depending on the implementation. Suggest callingtimer.DoneCtx(ctx)(ortimer.Done()) before each early return.
func (lc *leaderController) propose(ctx context.Context, proposalSupplier func(offset int64) statemachine.Proposal, cb concurrent.Callback[statemachine.ApplyResponse]) {
timer := lc.writeLatencyHisto.Timer()
lc.Lock()
if err := checkStatusIsLeader(lc.status); err != nil {
lc.Unlock()
cb.OnCompleteError(err)
return
}
newOffset := lc.quorumAckTracker.NextOffset()
walLog := lc.wal
tracker := lc.quorumAckTracker
term := lc.term
proposal := proposalSupplier(newOffset)
lc.log.Debug("propose a proposal", slog.Any("proposal", proposal))
entryValue := proto.LogEntryValueFromVTPool()
defer entryValue.ReturnToVTPool()
proposal.ToLogEntry(entryValue)
value, err := entryValue.MarshalVT()
if err != nil {
lc.Unlock()
cb.OnCompleteError(err)
return
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
fa596b5 to
f681fc5
Compare
…anges Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
f681fc5 to
ed7a78c
Compare
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Each proposal type now knows how to apply itself via an Apply method, eliminating the type-switch in ApplyProposal and the applyWriteRequest wrapper. Callers invoke proposal.Apply() directly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ApplyResponse belongs with the Proposal interface that returns it. applyControlRequest was a 5-line helper used in only two places — inline it into ControlProposal.Apply and ApplyLogEntry. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…test The feature-enable proposal is fire-and-forget after BecomeLeader, so followers may not have applied it yet when the assertion runs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ReadChecksum reads an atomic.Pointer, so a read lock is sufficient. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace vague "propose a proposal" with "Appending proposal to WAL" and log useful fields (offset, timestamp) instead of unexported struct. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Motivation
Enable the leader to replicate control commands (e.g., feature enablement) through the WAL alongside regular writes, so that features negotiated between replicas are applied consistently across all nodes via the replication log rather than in-memory state.
Modification
ControlRequestandFeatureEnableRequestprotobuf messages to theLogEntryValueoneof, allowing control commands to be serialized into WAL entries alongside write requests.statemachinepackage with aProposalinterface (WriteProposal,ControlProposal) where each proposal type knows how to serialize itself to a log entry and apply itself to the DB.ApplyLogEntryfunction for the follower/replay path that deserializes and applies WAL entries.writetoproposeto reflect the generalized proposal flow, splitBecomeLeaderto propose feature enablement outside the lock, and delegate toproposal.Apply()after quorum ack.statemachine.ApplyLogEntry, hold read lock during apply, and exposeIsFeatureEnabled/Checksummethods.EnableFeature(FEATURE_DB_CHECKSUM)— checksum is only computed when the feature is enabled or a non-zero checksum already exists.Checksum,ApplyLogEntryon follower,IsFeatureEnabledon leader).FEATURE_FINGERPRINTtoFEATURE_DB_CHECKSUM.