Description
When the coordinator's metadata Store() encounters a version mismatch (e.g., another coordinator wrote to the ConfigMap), it panics instead of gracefully stepping down from leadership:
// metadata_configmap.go:139
if version != expectedVersion {
panic(ErrMetadataBadVersion)
}
// metadata_configmap.go:146
if k8serrors.IsConflict(err) {
panic(err)
}
This pattern exists across all metadata providers (configmap, file, memory, raft).
Impact
In sidecar mode (3 coordinators with Kubernetes Lease-based leader election), this becomes more impactful:
- Leader loses the lease (e.g., slow renewal under pod pressure)
- New leader starts writing metadata
- Old leader's in-flight
Store() hits version mismatch → panic → pod restart
- Instead of a graceful leadership transfer, we get a crash + restart cycle
With standalone coordinator (single replica), this is less visible since there's no competing writer.
Expected behavior
On ErrMetadataBadVersion or K8s conflict:
- Return the error from
Store() instead of panicking
- The coordinator detects it lost leadership
- Close all shard controllers gracefully
- Re-enter
WaitToBecomeLeader() to become a candidate again
Current workaround
The panic causes the container to restart, and Kubernetes eventually re-establishes the cluster. But this is disruptive — active shard controller operations are aborted, and recovery takes longer than a graceful stepdown.
Affected code
oxiad/coordinator/metadata/metadata_configmap.go (line 139, 146)
oxiad/coordinator/metadata/metadata_file.go
oxiad/coordinator/metadata/metadata_memory.go
oxiad/coordinator/metadata/metadata_raft.go
Description
When the coordinator's metadata
Store()encounters a version mismatch (e.g., another coordinator wrote to the ConfigMap), it panics instead of gracefully stepping down from leadership:This pattern exists across all metadata providers (configmap, file, memory, raft).
Impact
In sidecar mode (3 coordinators with Kubernetes Lease-based leader election), this becomes more impactful:
Store()hits version mismatch → panic → pod restartWith standalone coordinator (single replica), this is less visible since there's no competing writer.
Expected behavior
On
ErrMetadataBadVersionor K8s conflict:Store()instead of panickingWaitToBecomeLeader()to become a candidate againCurrent workaround
The panic causes the container to restart, and Kubernetes eventually re-establishes the cluster. But this is disruptive — active shard controller operations are aborted, and recovery takes longer than a graceful stepdown.
Affected code
oxiad/coordinator/metadata/metadata_configmap.go(line 139, 146)oxiad/coordinator/metadata/metadata_file.gooxiad/coordinator/metadata/metadata_memory.gooxiad/coordinator/metadata/metadata_raft.go