Skip to content

Coordinator panics on metadata bad version instead of graceful leader stepdown #934

@mattisonchao

Description

@mattisonchao

Description

When the coordinator's metadata Store() encounters a version mismatch (e.g., another coordinator wrote to the ConfigMap), it panics instead of gracefully stepping down from leadership:

// metadata_configmap.go:139
if version != expectedVersion {
    panic(ErrMetadataBadVersion)
}

// metadata_configmap.go:146
if k8serrors.IsConflict(err) {
    panic(err)
}

This pattern exists across all metadata providers (configmap, file, memory, raft).

Impact

In sidecar mode (3 coordinators with Kubernetes Lease-based leader election), this becomes more impactful:

  1. Leader loses the lease (e.g., slow renewal under pod pressure)
  2. New leader starts writing metadata
  3. Old leader's in-flight Store() hits version mismatch → panic → pod restart
  4. Instead of a graceful leadership transfer, we get a crash + restart cycle

With standalone coordinator (single replica), this is less visible since there's no competing writer.

Expected behavior

On ErrMetadataBadVersion or K8s conflict:

  1. Return the error from Store() instead of panicking
  2. The coordinator detects it lost leadership
  3. Close all shard controllers gracefully
  4. Re-enter WaitToBecomeLeader() to become a candidate again

Current workaround

The panic causes the container to restart, and Kubernetes eventually re-establishes the cluster. But this is disruptive — active shard controller operations are aborted, and recovery takes longer than a graceful stepdown.

Affected code

  • oxiad/coordinator/metadata/metadata_configmap.go (line 139, 146)
  • oxiad/coordinator/metadata/metadata_file.go
  • oxiad/coordinator/metadata/metadata_memory.go
  • oxiad/coordinator/metadata/metadata_raft.go

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions