Skip to content

fix(job-orchestration): Handle garbage collector recovery file loss when k8s emptyDir volume is recycled #2260

@coderabbitai

Description

@coderabbitai

Problem

When the garbage-collector pod crashes and Kubernetes recycles its emptyDir volume (mounted at /var/tmp), the recovery file written by archive_garbage_collector.py is lost. Without this recovery file, the garbage collector loses track of archives that were already scheduled for deletion, potentially leaving orphaned archives in storage that are never cleaned up.

This issue was identified during the review of #1834, which moved the recovery file from clp_config.logs_directory to clp_config.tmp_directory.

Steps to Reproduce

  1. Deploy CLP on Kubernetes.
  2. Start a garbage collection job.
  3. Crash the garbage-collector pod mid-run.
  4. Allow Kubernetes to recycle the emptyDir volume.
  5. Observe that the recovery file is gone; previously scheduled archives may not be cleaned up.

Expected Behaviour

The garbage collector should be resilient to pod restarts and volume recycling — orphaned archives should still be identified and cleaned up correctly.

Possible Solutions

  • Persist the recovery file to a durable volume (e.g., a PersistentVolumeClaim) rather than an emptyDir volume.
  • Redesign the garbage collection logic to be idempotent and not rely on a recovery file that may be lost across restarts.

References

Raised by @junhaoliao.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions