Skip to content

fix: handle metadata leadership loss gracefully instead of panicking#935

Merged
mattisonchao merged 16 commits intomainfrom
fix/metadata-bad-version-no-panic
Mar 9, 2026
Merged

fix: handle metadata leadership loss gracefully instead of panicking#935
mattisonchao merged 16 commits intomainfrom
fix/metadata-bad-version-no-panic

Conversation

@mattisonchao
Copy link
Copy Markdown
Member

@mattisonchao mattisonchao commented Mar 8, 2026

Summary

When the metadata provider detects a version conflict during Store(), it previously panicked. This PR replaces that with graceful error handling and coordinator lifecycle management:

  • Return error instead of panic: All metadata providers now return ErrMetadataBadVersion on version conflicts instead of panicking
  • Leadership loss channel: WaitToBecomeLeader() now returns a <-chan struct{} that is closed when leadership is lost (configmap lease lost, raft leader change). Returns nil for providers without leader election (memory, file)
  • Coordinator restart: GrpcServer monitors the leadership-lost channel. When fired, it closes the current coordinator and creates a new one (which blocks on WaitToBecomeLeader until re-elected)
  • Version conflict retry: status_resource.go re-reads the current version on ErrMetadataBadVersion and retries

Changes by file

File Change
metadata.go WaitToBecomeLeader() returns (<-chan struct{}, error)
metadata_configmap.go Return error instead of panic, close channel on OnStoppedLeading
metadata_raft.go Return error instead of panic, monitor goroutine closes channel on leader loss
metadata_memory.go Return error instead of panic, returns nil channel
metadata_file.go Return error instead of panic, returns nil channel
status_resource.go handleStoreError re-reads version on bad version, retries
server.go monitorLeaderLoss goroutine for coordinator lifecycle
coordinator.go Pass through leadership-lost channel from WaitToBecomeLeader
test files Update for new NewCoordinator and WaitToBecomeLeader signatures

Test plan

  • All metadata provider tests pass (memory, file, configmap, raft)
  • Full build passes across all packages
  • Deploy to kind cluster and test leader failover

Closes #934

@mattisonchao mattisonchao force-pushed the fix/metadata-bad-version-no-panic branch 3 times, most recently from dda22cc to 9a1f4ed Compare March 8, 2026 15:48
@mattisonchao mattisonchao self-assigned this Mar 8, 2026
@mattisonchao mattisonchao force-pushed the fix/metadata-bad-version-no-panic branch from 9a1f4ed to dd0e154 Compare March 8, 2026 15:53
@mattisonchao mattisonchao changed the title fix: return error instead of panic on metadata bad version fix: handle metadata leadership loss gracefully instead of panicking Mar 8, 2026
mattisonchao and others added 13 commits March 9, 2026 10:39
Replace panic(ErrMetadataBadVersion) with error returns in all
metadata providers.

On version conflict in Store(), the configmap and raft providers
check their internal leader status (atomic.Bool):
- If leadership is lost: return ErrLeadershipLost (fatal)
- If still leader: return ErrMetadataBadVersion (retryable)

The status_resource handles these errors:
- ErrLeadershipLost: log.Fatal — process exits, Kubernetes restarts
- ErrMetadataBadVersion: re-read current version from metadata,
  retry the store operation

The configmap provider tracks leadership via OnStartedLeading/
OnStoppedLeading callbacks. The raft provider monitors LeaderCh().
Both default isLeader=true for backward compatibility with
standalone deployments that skip WaitToBecomeLeader().

Fixes #934
When the metadata provider detects a version conflict during Store(),
it now checks whether the provider still holds the lease. If the lease
is lost, it returns ErrLeadershipLost (permanent, non-retryable).
If the lease is still held, it returns ErrMetadataBadVersion (retryable).

The metadata provider exposes a LeadershipLostCh() channel that is
closed when the lease is lost. GrpcServer monitors this channel and
gracefully closes the coordinator, waits to become leader again, then
creates a new coordinator instance.

Closes #934

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
A conflict error on ConfigMap Upsert means another coordinator wrote
concurrently — a strong signal of leadership loss even before the
lease callback fires. Use CompareAndSwap-guarded signalLeadershipLost
to safely close the channel exactly once.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The LeadershipLostCh channel handles coordinator restart on leadership
loss. Store() only needs to return ErrMetadataBadVersion — no need
for hasLease tracking or a separate error type.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of a separate LeadershipLostCh() method, WaitToBecomeLeader()
now returns the channel directly. This ensures the channel is always
initialized when leadership is acquired and eliminates the separate
interface method.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Memory and file providers intentionally return nil channel since
they don't support leader election.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Re-reading the version and retrying could overwrite valid data written
by a new leader. Instead, stop retrying and let the LeadershipLostCh
trigger a full coordinator restart with clean state.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Mattison Zhao <mattisonchao@gmail.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Mattison Zhao <mattisonchao@gmail.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mattisonchao mattisonchao force-pushed the fix/metadata-bad-version-no-panic branch from 9297be3 to 9cd0a6e Compare March 9, 2026 02:39
Signed-off-by: Mattison Zhao <mattisonchao@gmail.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mattisonchao mattisonchao force-pushed the fix/metadata-bad-version-no-panic branch from 964ca69 to 19addf0 Compare March 9, 2026 02:53
Signed-off-by: Mattison Zhao <mattisonchao@gmail.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mattisonchao mattisonchao force-pushed the fix/metadata-bad-version-no-panic branch from 211ab73 to f63ba45 Compare March 9, 2026 03:15
- Fix nil dereference in configmap Store() on non-conflict errors
- Protect coordinator field with RWMutex against data race in monitorLease
- Close metadata provider in GrpcServer.Close() to unblock WaitToBecomeLeader
- Check ctx before recreating coordinator to avoid work during shutdown

Signed-off-by: Mattison Zhao <mattisonchao@gmail.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mattisonchao mattisonchao force-pushed the fix/metadata-bad-version-no-panic branch from f63ba45 to 7e7715f Compare March 9, 2026 03:25
@mattisonchao
Copy link
Copy Markdown
Member Author

Self-Review

Changes Summary

  1. Panic removal: All panic(ErrMetadataBadVersion) in metadata providers (memory, file, configmap, raft) replaced with proper error returns
  2. WaitToBecomeLeader signature: Changed from error to (<-chan struct{}, error) — returns a channel closed on leadership loss
  3. handleStoreError: Treats ErrMetadataBadVersion as backoff.Permanent to stop retry loops — avoids dirty writes from re-reading version and retrying
  4. monitorLease goroutine: Watches leadership lost channel, closes old coordinator and recreates on leadership loss
  5. coordinatorMu: Protects coordinator field against data race between monitorLease and Close()
  6. metadataProvider.Close(): Now called in GrpcServer.Close() (was previously missing)
  7. Configmap Store() bug fix: Non-conflict K8s errors previously fell through to nil cm dereference — now properly returned

Known Edge Cases

  • monitorLease blocked in NewCoordinator during shutdown: If leadership is lost during normal operation and monitorLease enters WaitToBecomeLeader, a concurrent Close() call would block on wg.Wait(). This is a very narrow race window (leadership loss + immediate shutdown) and the original code had no leader loss handling at all (it panicked). Acceptable for a follow-up if needed.
  • Raft monitor goroutine: If raft shuts down and closes LeaderCh(), the monitor goroutine exits via range without closing leadershipLostCh. No practical impact since raft Close() handles cleanup.

@mattisonchao mattisonchao merged commit 155ac18 into main Mar 9, 2026
11 of 12 checks passed
@mattisonchao mattisonchao deleted the fix/metadata-bad-version-no-panic branch March 9, 2026 03:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Coordinator panics on metadata bad version instead of graceful leader stepdown

1 participant