Skip to content

feat: replicate control requests through WAL via state machine#882

Merged
mattisonchao merged 20 commits intomainfrom
feat.control.request
Feb 22, 2026
Merged

feat: replicate control requests through WAL via state machine#882
mattisonchao merged 20 commits intomainfrom
feat.control.request

Conversation

@mattisonchao
Copy link
Copy Markdown
Member

@mattisonchao mattisonchao commented Feb 9, 2026

Motivation

Enable the leader to replicate control commands (e.g., feature enablement) through the WAL alongside regular writes, so that features negotiated between replicas are applied consistently across all nodes via the replication log rather than in-memory state.

Modification

  • Add ControlRequest and FeatureEnableRequest protobuf messages to the LogEntryValue oneof, allowing control commands to be serialized into WAL entries alongside write requests.
  • Introduce a statemachine package with a Proposal interface (WriteProposal, ControlProposal) where each proposal type knows how to serialize itself to a log entry and apply itself to the DB.
  • Add ApplyLogEntry function for the follower/replay path that deserializes and applies WAL entries.
  • Refactor leader controller: rename write to propose to reflect the generalized proposal flow, split BecomeLeader to propose feature enablement outside the lock, and delegate to proposal.Apply() after quorum ack.
  • Refactor follower controller: replace inline write-processing with statemachine.ApplyLogEntry, hold read lock during apply, and expose IsFeatureEnabled/Checksum methods.
  • Make DB checksum computation opt-in via EnableFeature(FEATURE_DB_CHECKSUM) — checksum is only computed when the feature is enabled or a non-zero checksum already exists.
  • Use read locks instead of exclusive locks for read-only operations (Checksum, ApplyLogEntry on follower, IsFeatureEnabled on leader).
  • Rename FEATURE_FINGERPRINT to FEATURE_DB_CHECKSUM.
  • Add unit tests for state machine and proposal logic, and an integration test verifying feature enablement replication and checksum consistency across 3 replicas.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends Oxia’s replication/WAL format to carry control-plane commands (feature enablement) and introduces a shared state-machine layer to apply both write and control log entries consistently across leader and follower paths, enabling negotiated activation of the DB checksum feature.

Changes:

  • Add ControlRequest{FeatureEnableRequest} support to LogEntryValue and implement application logic via a new controller/statemachine package.
  • Wire leaders/followers to apply WAL entries through the state machine; leaders can now propose control entries to enable features post-election.
  • Rename negotiated feature FEATURE_FINGERPRINT to FEATURE_DB_CHECKSUM and expose checksum/feature state via controllers and tests.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
tests/control/control_request_test.go Adds an integration-style test for feature enablement and checksum consistency across replicas.
oxiad/dataserver/server.go Exposes GetShardDirector() for test/introspection access to shard controllers.
oxiad/dataserver/database/db.go Adds feature enable tracking and conditional checksum persistence/read APIs on the DB.
oxiad/dataserver/controller/statemachine/state_machine.go New shared apply-layer for proposals and WAL log entries (write + control).
oxiad/dataserver/controller/statemachine/state_machine_test.go Unit tests for applying proposals and WAL entries (write + feature enable).
oxiad/dataserver/controller/statemachine/proposal.go Proposal abstraction for write and control entries and WAL serialization.
oxiad/dataserver/controller/statemachine/proposal_test.go Unit tests for proposal creation and WAL encoding.
oxiad/dataserver/controller/lead/leader_controller.go Refactors write path to propose(), replays WAL via state machine, and proposes feature-enable control entries post-election; exposes checksum.
oxiad/dataserver/controller/lead/leader_controller_test.go Updates feature negotiation tests for the renamed DB checksum feature.
oxiad/dataserver/controller/follow/follower_controller.go Applies committed WAL entries via state machine; exposes checksum and feature-enabled state.
oxiad/coordinator/controller/shard_controller_test.go Updates coordinator tests to expect negotiated DB checksum feature.
oxiad/coordinator/controller/shard_controller_election_test.go Updates feature negotiation unit tests for renamed feature.
oxiad/common/feature/feature.go Updates supported feature list to FEATURE_DB_CHECKSUM.
oxiad/common/feature/feature_test.go Removes prior supported-features tests (file deleted).
oxiad/common/crc/crc.go Adds Checksum.IsZero() helper.
common/proto/storage.proto Adds control_request to LogEntryValue and defines ControlRequest/FeatureEnableRequest.
common/proto/storage.pb.go Regenerated Go bindings for storage.proto.
common/proto/storage_vtproto.pb.go Regenerated vtproto bindings to support new messages/oneofs.
common/proto/replication.proto Renames feature enum value to FEATURE_DB_CHECKSUM.
common/proto/replication.pb.go Regenerated Go bindings for replication.proto.
Comments suppressed due to low confidence (2)

oxiad/dataserver/controller/lead/leader_controller.go:849

  • Timer started here is never finished on the early-return path (non-leader). This will skew latency metrics and may leak resources depending on the histogram implementation. Ensure timer.Done()/DoneCtx is called before every return.
func (lc *leaderController) propose(ctx context.Context, proposalSupplier func(offset int64) statemachine.Proposal, cb concurrent.Callback[statemachine.ApplyResponse]) {
	timer := lc.writeLatencyHisto.Timer()
	lc.Lock()
	if err := checkStatusIsLeader(lc.status); err != nil {
		lc.Unlock()
		cb.OnCompleteError(err)
		return
	}

oxiad/dataserver/controller/lead/leader_controller.go:868

  • If MarshalVT() fails, propose() returns without calling timer.Done/DoneCtx, leaving the latency timer open. Make sure to finish the timer on this error path as well.
	value, err := entryValue.MarshalVT()
	if err != nil {
		lc.Unlock()
		cb.OnCompleteError(err)
		return
	}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 21 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

oxiad/dataserver/controller/lead/leader_controller.go:869

  • propose() starts a latency timer, but on early-return paths (eg not leader at lines 846-850, or MarshalVT error at 865-869) the timer is never completed. This can skew latency metrics and potentially leak timer resources depending on the implementation. Suggest calling timer.DoneCtx(ctx) (or timer.Done()) before each early return.
func (lc *leaderController) propose(ctx context.Context, proposalSupplier func(offset int64) statemachine.Proposal, cb concurrent.Callback[statemachine.ApplyResponse]) {
	timer := lc.writeLatencyHisto.Timer()
	lc.Lock()
	if err := checkStatusIsLeader(lc.status); err != nil {
		lc.Unlock()
		cb.OnCompleteError(err)
		return
	}
	newOffset := lc.quorumAckTracker.NextOffset()
	walLog := lc.wal
	tracker := lc.quorumAckTracker
	term := lc.term
	proposal := proposalSupplier(newOffset)

	lc.log.Debug("propose a proposal", slog.Any("proposal", proposal))

	entryValue := proto.LogEntryValueFromVTPool()
	defer entryValue.ReturnToVTPool()

	proposal.ToLogEntry(entryValue)

	value, err := entryValue.MarshalVT()
	if err != nil {
		lc.Unlock()
		cb.OnCompleteError(err)
		return
	}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@mattisonchao mattisonchao changed the title feat: support control request for shard feat: support control request in replication state machine Feb 22, 2026
…anges

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
mattisonchao and others added 3 commits February 23, 2026 00:44
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Each proposal type now knows how to apply itself via an Apply method,
eliminating the type-switch in ApplyProposal and the applyWriteRequest
wrapper. Callers invoke proposal.Apply() directly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ApplyResponse belongs with the Proposal interface that returns it.
applyControlRequest was a 5-line helper used in only two places — inline
it into ControlProposal.Apply and ApplyLogEntry.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
mattisonchao and others added 3 commits February 23, 2026 01:23
…test

The feature-enable proposal is fire-and-forget after BecomeLeader, so
followers may not have applied it yet when the assertion runs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ReadChecksum reads an atomic.Pointer, so a read lock is sufficient.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace vague "propose a proposal" with "Appending proposal to WAL" and
log useful fields (offset, timestamp) instead of unexported struct.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mattisonchao mattisonchao changed the title feat: support control request in replication state machine feat: replicate control requests through WAL via state machine Feb 22, 2026
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mattisonchao mattisonchao added this to the 0.16 milestone Feb 22, 2026
@mattisonchao mattisonchao merged commit baa33eb into main Feb 22, 2026
11 of 12 checks passed
@mattisonchao mattisonchao deleted the feat.control.request branch February 22, 2026 17:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants