Problem
When the garbage-collector pod crashes and Kubernetes recycles its emptyDir volume (mounted at /var/tmp), the recovery file written by archive_garbage_collector.py is lost. Without this recovery file, the garbage collector loses track of archives that were already scheduled for deletion, potentially leaving orphaned archives in storage that are never cleaned up.
This issue was identified during the review of #1834, which moved the recovery file from clp_config.logs_directory to clp_config.tmp_directory.
Steps to Reproduce
- Deploy CLP on Kubernetes.
- Start a garbage collection job.
- Crash the garbage-collector pod mid-run.
- Allow Kubernetes to recycle the
emptyDir volume.
- Observe that the recovery file is gone; previously scheduled archives may not be cleaned up.
Expected Behaviour
The garbage collector should be resilient to pod restarts and volume recycling — orphaned archives should still be identified and cleaned up correctly.
Possible Solutions
- Persist the recovery file to a durable volume (e.g., a PersistentVolumeClaim) rather than an
emptyDir volume.
- Redesign the garbage collection logic to be idempotent and not rely on a recovery file that may be lost across restarts.
References
Raised by @junhaoliao.
Problem
When the garbage-collector pod crashes and Kubernetes recycles its
emptyDirvolume (mounted at/var/tmp), the recovery file written byarchive_garbage_collector.pyis lost. Without this recovery file, the garbage collector loses track of archives that were already scheduled for deletion, potentially leaving orphaned archives in storage that are never cleaned up.This issue was identified during the review of #1834, which moved the recovery file from
clp_config.logs_directorytoclp_config.tmp_directory.Steps to Reproduce
emptyDirvolume.Expected Behaviour
The garbage collector should be resilient to pod restarts and volume recycling — orphaned archives should still be identified and cleaned up correctly.
Possible Solutions
emptyDirvolume.References
Raised by @junhaoliao.