fix: handle metadata leadership loss gracefully instead of panicking#935
Merged
mattisonchao merged 16 commits intomainfrom Mar 9, 2026
Merged
fix: handle metadata leadership loss gracefully instead of panicking#935mattisonchao merged 16 commits intomainfrom
mattisonchao merged 16 commits intomainfrom
Conversation
dda22cc to
9a1f4ed
Compare
9a1f4ed to
dd0e154
Compare
Replace panic(ErrMetadataBadVersion) with error returns in all metadata providers. On version conflict in Store(), the configmap and raft providers check their internal leader status (atomic.Bool): - If leadership is lost: return ErrLeadershipLost (fatal) - If still leader: return ErrMetadataBadVersion (retryable) The status_resource handles these errors: - ErrLeadershipLost: log.Fatal — process exits, Kubernetes restarts - ErrMetadataBadVersion: re-read current version from metadata, retry the store operation The configmap provider tracks leadership via OnStartedLeading/ OnStoppedLeading callbacks. The raft provider monitors LeaderCh(). Both default isLeader=true for backward compatibility with standalone deployments that skip WaitToBecomeLeader(). Fixes #934
When the metadata provider detects a version conflict during Store(), it now checks whether the provider still holds the lease. If the lease is lost, it returns ErrLeadershipLost (permanent, non-retryable). If the lease is still held, it returns ErrMetadataBadVersion (retryable). The metadata provider exposes a LeadershipLostCh() channel that is closed when the lease is lost. GrpcServer monitors this channel and gracefully closes the coordinator, waits to become leader again, then creates a new coordinator instance. Closes #934 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
A conflict error on ConfigMap Upsert means another coordinator wrote concurrently — a strong signal of leadership loss even before the lease callback fires. Use CompareAndSwap-guarded signalLeadershipLost to safely close the channel exactly once. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The LeadershipLostCh channel handles coordinator restart on leadership loss. Store() only needs to return ErrMetadataBadVersion — no need for hasLease tracking or a separate error type. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of a separate LeadershipLostCh() method, WaitToBecomeLeader() now returns the channel directly. This ensures the channel is always initialized when leadership is acquired and eliminates the separate interface method. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Memory and file providers intentionally return nil channel since they don't support leader election. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Re-reading the version and retrying could overwrite valid data written by a new leader. Instead, stop retrying and let the LeadershipLostCh trigger a full coordinator restart with clean state. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Mattison Zhao <mattisonchao@gmail.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Mattison Zhao <mattisonchao@gmail.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
9297be3 to
9cd0a6e
Compare
Signed-off-by: Mattison Zhao <mattisonchao@gmail.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
964ca69 to
19addf0
Compare
Signed-off-by: Mattison Zhao <mattisonchao@gmail.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
211ab73 to
f63ba45
Compare
- Fix nil dereference in configmap Store() on non-conflict errors - Protect coordinator field with RWMutex against data race in monitorLease - Close metadata provider in GrpcServer.Close() to unblock WaitToBecomeLeader - Check ctx before recreating coordinator to avoid work during shutdown Signed-off-by: Mattison Zhao <mattisonchao@gmail.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
f63ba45 to
7e7715f
Compare
Member
Author
Self-ReviewChanges Summary
Known Edge Cases
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When the metadata provider detects a version conflict during
Store(), it previously panicked. This PR replaces that with graceful error handling and coordinator lifecycle management:ErrMetadataBadVersionon version conflicts instead of panickingWaitToBecomeLeader()now returns a<-chan struct{}that is closed when leadership is lost (configmap lease lost, raft leader change). Returnsnilfor providers without leader election (memory, file)GrpcServermonitors the leadership-lost channel. When fired, it closes the current coordinator and creates a new one (which blocks onWaitToBecomeLeaderuntil re-elected)status_resource.gore-reads the current version onErrMetadataBadVersionand retriesChanges by file
metadata.goWaitToBecomeLeader()returns(<-chan struct{}, error)metadata_configmap.goOnStoppedLeadingmetadata_raft.gometadata_memory.gonilchannelmetadata_file.gonilchannelstatus_resource.gohandleStoreErrorre-reads version on bad version, retriesserver.gomonitorLeaderLossgoroutine for coordinator lifecyclecoordinator.goWaitToBecomeLeaderNewCoordinatorandWaitToBecomeLeadersignaturesTest plan
Closes #934