Skip to content

feat: add checksum gauge metric and move checksumInterval to storage level#890

Merged
mattisonchao merged 11 commits intomainfrom
feat/checksum-gauge-metric
Feb 23, 2026
Merged

feat: add checksum gauge metric and move checksumInterval to storage level#890
mattisonchao merged 11 commits intomainfrom
feat/checksum-gauge-metric

Conversation

@mattisonchao
Copy link
Copy Markdown
Member

@mattisonchao mattisonchao commented Feb 23, 2026

Motivation

The RecordChecksumRequest WAL entry was a no-op marker that never actually recorded the DB checksum as a metric.

Modification

  • Add SyncGauge metric type (non-callback based) to common/metric/gauge.go
  • Add RecordChecksumRequest to the ControlRequest proto oneof
  • Add oxia_dataserver_db_checksum gauge to both leader and follower controllers
  • ControlProposal.Apply() and ApplyLogEntry() now read and return the DB checksum via ApplyResponse.Checksum when processing a RecordChecksumRequest
  • Leader records the gauge after proposal.Apply() in the generic propose path
  • Follower records the gauge after ApplyLogEntry() in the committed entries path
  • Add checksum_scheduler.go to periodically trigger checksum recording
  • Add scheduler.checksum.interval config option
  • Add integration test verifying checksum metric appears with correct labels and changes after additional writes

@mattisonchao mattisonchao self-assigned this Feb 23, 2026
@mattisonchao mattisonchao force-pushed the feat/checksum-gauge-metric branch 2 times, most recently from 84a2884 to 5d3d342 Compare February 23, 2026 06:00
…level

Move ChecksumInterval config from DatabaseOptions to StorageOptions as
checksum recording is a storage-level concern. Add SyncGauge metric type
and record oxia_dataserver_db_checksum on both leader and follower after
applying RecordChecksumRequest entries through the WAL.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mattisonchao mattisonchao force-pushed the feat/checksum-gauge-metric branch from 5d3d342 to c03b107 Compare February 23, 2026 06:04
mattisonchao and others added 10 commits February 23, 2026 14:12
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Record the WAL commit offset as a dynamic attribute on the checksum
gauge so replicas can be compared at the same offset.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add TestControlRequestRecordChecksum to verify the db_checksum gauge
metric is correctly exposed via the Prometheus endpoint with expected
labels (shard, namespace, commit_offset) and a non-zero value.

Refactor mock.NewServer into NewServerWithOptions to allow tests to
customize dataserver options (e.g. checksum interval).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a second round of puts + checksum recording to the integration test
to verify that the checksum gauge produces a different value with a
higher commit offset after new writes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add t.Helper() call and rename unused parameter to _.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Change the default checksum scheduler interval from 1m to 5m. Allow
disabling the scheduler by setting the interval to 0s or a negative
value (e.g. -1s).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… request oneof

Add nil check on lc.db in leader's IsFeatureEnabled to prevent panic
if called after close. Use switch on proto oneof type instead of
if/if chains in ControlProposal.Apply and ApplyLogEntry.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mattisonchao
Copy link
Copy Markdown
Member Author

PR Review Summary

Overview

This PR implements the checksum gauge metric end-to-end: from proto definition through WAL replication to Prometheus-visible metrics on both leader and follower paths, with a periodic scheduler and integration test coverage.

Reviewed Areas

  • Proto changes (storage.proto): RecordChecksumRequest added to ControlRequest oneof — clean, backward-compatible.
  • Metric infrastructure (common/metric/gauge.go): New SyncGauge type for non-callback gauge recording. OTel deduplicates Int64Gauge by name, so no cleanup needed on re-election.
  • Leader path (leader_controller.go): Checksum recorded after proposal.Apply() in the generic propose callback. ProposeRecordChecksum creates the control request correctly.
  • Follower path (follower_controller.go): Checksum recorded after ApplyLogEntry() in the committed entries loop. Correctly synchronized under rwMutex.RLock.
  • State machine (proposal.go, state_machine.go): Both use switch on proto oneof type to dispatch FeatureEnable vs RecordChecksum. ApplyResponse.Checksum is returned as a pointer — nil when not a checksum entry.
  • Scheduler (checksum_scheduler.go): Periodic ticker over all leaders, gated by IsFeatureEnabled. Gracefully handles fenced leaders (propose fails with logged warning). Disabled when interval <= 0.
  • Config (option.go, dataserver.yaml, dataserver.json): Default 5m interval, disableable via 0s or negative value. Duration regex updated to allow negative values.
  • Integration test (record_checksum_test.go): Verifies metric appears with correct labels and non-zero value after first recording, then verifies a different checksum with higher commit offset after additional writes.

Bugs Found & Fixed During Review

  1. Leader IsFeatureEnabled nil pointerlc.db is set to nil in close() under Lock(), but a concurrent caller could acquire RLock() after close completes. Added nil guard. (fixed in 821e6f5)
  2. if/if pattern for proto oneofControlProposal.Apply and ApplyLogEntry used two if statements for the oneof variants. Functionally correct (oneof guarantees mutual exclusion), but not future-proof. Refactored to switch. (fixed in 821e6f5)

No Outstanding Issues

  • Follower db is never nil outside of InstallSnapshot (which holds the write lock), so no nil guard needed there.
  • int64(checksum) conversion from uint32 is safe — no overflow.
  • commit-offset attribute (kebab-case) correctly maps to commit_offset in Prometheus.

🤖 Generated with Claude Code

@mattisonchao mattisonchao merged commit 9df2e2d into main Feb 23, 2026
9 checks passed
@mattisonchao mattisonchao deleted the feat/checksum-gauge-metric branch February 23, 2026 14:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant