fix: handle metadata leadership loss gracefully instead of panicking by mattisonchao · Pull Request #935 · oxia-db/oxia

mattisonchao · 2026-03-08T14:47:06Z

Summary

When the metadata provider detects a version conflict during Store(), it previously panicked. This PR replaces that with graceful error handling and coordinator lifecycle management:

Return error instead of panic: All metadata providers now return ErrMetadataBadVersion on version conflicts instead of panicking
Leadership loss channel: WaitToBecomeLeader() now returns a <-chan struct{} that is closed when leadership is lost (configmap lease lost, raft leader change). Returns nil for providers without leader election (memory, file)
Coordinator restart: GrpcServer monitors the leadership-lost channel. When fired, it closes the current coordinator and creates a new one (which blocks on WaitToBecomeLeader until re-elected)
Version conflict retry: status_resource.go re-reads the current version on ErrMetadataBadVersion and retries

Changes by file

File	Change
`metadata.go`	`WaitToBecomeLeader()` returns `(<-chan struct{}, error)`
`metadata_configmap.go`	Return error instead of panic, close channel on `OnStoppedLeading`
`metadata_raft.go`	Return error instead of panic, monitor goroutine closes channel on leader loss
`metadata_memory.go`	Return error instead of panic, returns `nil` channel
`metadata_file.go`	Return error instead of panic, returns `nil` channel
`status_resource.go`	`handleStoreError` re-reads version on bad version, retries
`server.go`	`monitorLeaderLoss` goroutine for coordinator lifecycle
`coordinator.go`	Pass through leadership-lost channel from `WaitToBecomeLeader`
test files	Update for new `NewCoordinator` and `WaitToBecomeLeader` signatures

Test plan

All metadata provider tests pass (memory, file, configmap, raft)
Full build passes across all packages
Deploy to kind cluster and test leader failover

Closes #934

Replace panic(ErrMetadataBadVersion) with error returns in all metadata providers. On version conflict in Store(), the configmap and raft providers check their internal leader status (atomic.Bool): - If leadership is lost: return ErrLeadershipLost (fatal) - If still leader: return ErrMetadataBadVersion (retryable) The status_resource handles these errors: - ErrLeadershipLost: log.Fatal — process exits, Kubernetes restarts - ErrMetadataBadVersion: re-read current version from metadata, retry the store operation The configmap provider tracks leadership via OnStartedLeading/ OnStoppedLeading callbacks. The raft provider monitors LeaderCh(). Both default isLeader=true for backward compatibility with standalone deployments that skip WaitToBecomeLeader(). Fixes #934

When the metadata provider detects a version conflict during Store(), it now checks whether the provider still holds the lease. If the lease is lost, it returns ErrLeadershipLost (permanent, non-retryable). If the lease is still held, it returns ErrMetadataBadVersion (retryable). The metadata provider exposes a LeadershipLostCh() channel that is closed when the lease is lost. GrpcServer monitors this channel and gracefully closes the coordinator, waits to become leader again, then creates a new coordinator instance. Closes #934 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

A conflict error on ConfigMap Upsert means another coordinator wrote concurrently — a strong signal of leadership loss even before the lease callback fires. Use CompareAndSwap-guarded signalLeadershipLost to safely close the channel exactly once. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The LeadershipLostCh channel handles coordinator restart on leadership loss. Store() only needs to return ErrMetadataBadVersion — no need for hasLease tracking or a separate error type. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Instead of a separate LeadershipLostCh() method, WaitToBecomeLeader() now returns the channel directly. This ensures the channel is always initialized when leadership is acquired and eliminates the separate interface method. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Memory and file providers intentionally return nil channel since they don't support leader election. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Re-reading the version and retrying could overwrite valid data written by a new leader. Instead, stop retrying and let the LeadershipLostCh trigger a full coordinator restart with clean state. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Mattison Zhao <mattisonchao@gmail.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Fix nil dereference in configmap Store() on non-conflict errors - Protect coordinator field with RWMutex against data race in monitorLease - Close metadata provider in GrpcServer.Close() to unblock WaitToBecomeLeader - Check ctx before recreating coordinator to avoid work during shutdown Signed-off-by: Mattison Zhao <mattisonchao@gmail.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mattisonchao · 2026-03-09T03:51:26Z

Self-Review

Changes Summary

Panic removal: All panic(ErrMetadataBadVersion) in metadata providers (memory, file, configmap, raft) replaced with proper error returns
WaitToBecomeLeader signature: Changed from error to (<-chan struct{}, error) — returns a channel closed on leadership loss
handleStoreError: Treats ErrMetadataBadVersion as backoff.Permanent to stop retry loops — avoids dirty writes from re-reading version and retrying
monitorLease goroutine: Watches leadership lost channel, closes old coordinator and recreates on leadership loss
coordinatorMu: Protects coordinator field against data race between monitorLease and Close()
metadataProvider.Close(): Now called in GrpcServer.Close() (was previously missing)
Configmap Store() bug fix: Non-conflict K8s errors previously fell through to nil cm dereference — now properly returned

Known Edge Cases

monitorLease blocked in NewCoordinator during shutdown: If leadership is lost during normal operation and monitorLease enters WaitToBecomeLeader, a concurrent Close() call would block on wg.Wait(). This is a very narrow race window (leadership loss + immediate shutdown) and the original code had no leader loss handling at all (it panicked). Acceptable for a follow-up if needed.
Raft monitor goroutine: If raft shuts down and closes LeaderCh(), the monitor goroutine exits via range without closing leadershipLostCh. No practical impact since raft Close() handles cleanup.

mattisonchao requested review from RobertIndie, coderzc and merlimat as code owners March 8, 2026 14:47

mattisonchao force-pushed the fix/metadata-bad-version-no-panic branch 3 times, most recently from dda22cc to 9a1f4ed Compare March 8, 2026 15:48

mattisonchao self-assigned this Mar 8, 2026

mattisonchao force-pushed the fix/metadata-bad-version-no-panic branch from 9a1f4ed to dd0e154 Compare March 8, 2026 15:53

mattisonchao changed the title ~~fix: return error instead of panic on metadata bad version~~ fix: handle metadata leadership loss gracefully instead of panicking Mar 8, 2026

mattisonchao and others added 13 commits March 9, 2026 10:39

refactor: let NewCoordinator handle WaitToBecomeLeader internally

040fc37

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

style: fix goimports alignment in server.go struct literal

e00833b

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

style: add period to comment for linter

4a639f2

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

style: use early-return pattern for revive linter

5ccaa63

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: suppress nilnil lint for providers without leader election

87ee154

Memory and file providers intentionally return nil channel since they don't support leader election. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: remove unused receiver in handleStoreError

af5fe4c

Signed-off-by: Mattison Zhao <mattisonchao@gmail.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: update NewCoordinator calls in leader_hint_test.go after rebase

9cd0a6e

Signed-off-by: Mattison Zhao <mattisonchao@gmail.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mattisonchao force-pushed the fix/metadata-bad-version-no-panic branch from 9297be3 to 9cd0a6e Compare March 9, 2026 02:39

fix: rename leader-loss-monitor component to lease-monitor

19addf0

Signed-off-by: Mattison Zhao <mattisonchao@gmail.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mattisonchao force-pushed the fix/metadata-bad-version-no-panic branch from 964ca69 to 19addf0 Compare March 9, 2026 02:53

fix: change leadership lost log level from Warn to Info

35300d3

Signed-off-by: Mattison Zhao <mattisonchao@gmail.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mattisonchao force-pushed the fix/metadata-bad-version-no-panic branch from 211ab73 to f63ba45 Compare March 9, 2026 03:15

mattisonchao force-pushed the fix/metadata-bad-version-no-panic branch from f63ba45 to 7e7715f Compare March 9, 2026 03:25

mattisonchao merged commit 155ac18 into main Mar 9, 2026
11 of 12 checks passed

mattisonchao deleted the fix/metadata-bad-version-no-panic branch March 9, 2026 03:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: handle metadata leadership loss gracefully instead of panicking#935

fix: handle metadata leadership loss gracefully instead of panicking#935
mattisonchao merged 16 commits intomainfrom
fix/metadata-bad-version-no-panic

mattisonchao commented Mar 8, 2026 •

edited

Loading

Uh oh!

mattisonchao commented Mar 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mattisonchao commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes by file

Test plan

Uh oh!

mattisonchao commented Mar 9, 2026

Self-Review

Changes Summary

Known Edge Cases

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mattisonchao commented Mar 8, 2026 •

edited

Loading