Skip to content

fix(job-orchestration): Move garbage collector recovery file to temporary directory (fixes #1823).#1834

Merged
junhaoliao merged 12 commits into
y-scope:mainfrom
junhaoliao:docker-compose-tmp-tmpfs
May 7, 2026
Merged

fix(job-orchestration): Move garbage collector recovery file to temporary directory (fixes #1823).#1834
junhaoliao merged 12 commits into
y-scope:mainfrom
junhaoliao:docker-compose-tmp-tmpfs

Conversation

@junhaoliao

@junhaoliao junhaoliao commented Dec 26, 2025

Copy link
Copy Markdown
Member

Description

This PR moves the garbage collector recovery file from the logs directory to the temporary directory, addressing the issues raised in #1823.

The archive_garbage_collector.py was writing its recovery file to clp_config.logs_directory, which:

  • Violated the convention of writing temporary files to temporary directories
  • Caused issues because the garbage-collector service didn't have a writable logs directory mounted

This PR:

  • Changed the recovery file location from clp_config.logs_directory to clp_config.tmp_directory
  • Added the *volume_clp_tmp mount to the garbage-collector service in Docker Compose
  • Fixed the Helm chart garbage-collector deployment to mount the tmp volume at /var/tmp

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

  • Verified that the recovery file path now uses clp_config.tmp_directory
    Confirmed that the garbage-collector service has the tmp volume * mounted in Docker Compose
  • Verified the Helm chart mounts the tmp volume at the correct path /var/tmp

Docker Compose deployment

task && cd build/clp-package && ./sbin/start-clp.sh
2026-01-14T23:21:12.117 INFO [controller] Started CLP.

All services started successfully including garbage-collector with the tmp volume mounted.

Helm chart deployments

Single-node setup

cd tools/deployment/package-helm && ./set-up-test.sh
pod/test-clp-garbage-collector-6bd59567d7-8b54g condition met
...
All jobs completed and services are ready.

Multi-node setup with shared worker nodes

cd tools/deployment/package-helm && ./set-up-multi-shared-test.sh
pod/test-clp-garbage-collector-6bd59567d7-875ws condition met
...
All jobs completed and services are ready.

Multi-node setup with dedicated worker nodes

cd tools/deployment/package-helm && ./set-up-multi-dedicated-test.sh
pod/test-clp-garbage-collector-6bd59567d7-fccqm condition met
...
All jobs completed and services are ready.

Summary by CodeRabbit

  • Chores
    • Garbage collector now stores recovery/temporary files in a dedicated temporary directory (mount path changed to /var/tmp) instead of the logs directory.
    • Deployment manifests and compose setup updated to include a shared temporary bind mount for the garbage collector.
    • Helm chart version incremented to 0.3.2-dev.4.

@coderabbitai

coderabbitai Bot commented Dec 26, 2025

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 73816a9c-51ea-4208-832d-f1f3b3a3f504

📥 Commits

Reviewing files that changed from the base of the PR and between a6bcbb8 and 4e469ca.

📒 Files selected for processing (2)
  • tools/deployment/package-helm/Chart.yaml
  • tools/deployment/package/docker-compose-all.yaml

Walkthrough

The archive garbage collector’s recovery file path is changed from the configured logs directory to the configured temporary directory. Deployment artifacts (Helm template and Docker Compose) are updated so the garbage-collector container mounts the shared tmp volume at the new path; chart version is bumped.

Changes

Garbage Collector: tmp recovery file + deployment mounts

Layer / File(s) Summary
Core Logic
components/job-orchestration/job_orchestration/garbage_collector/archive_garbage_collector.py
Recovery file is created in clp_config.tmp_directory instead of clp_config.logs_directory, filename unchanged ({ARCHIVE_GARBAGE_COLLECTOR_NAME}.tmp).
Helm Deployment Wiring
tools/deployment/package-helm/templates/garbage-collector-deployment.yaml
tmp volume mountPath in the garbage-collector container changed from /var/log to /var/tmp.
Compose Orchestration Wiring
tools/deployment/package/docker-compose-all.yaml
garbage-collector service volumes updated to include the shared temporary bind mount (*volume_clp_tmp) in addition to existing mounts.
Chart Metadata
tools/deployment/package-helm/Chart.yaml
Helm chart version incremented from 0.3.2-dev.3 to 0.3.2-dev.4.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: moving the garbage collector recovery file from the logs directory to the temporary directory, which is the primary modification across all affected files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment thread tools/deployment/package-helm/Chart.yaml Outdated
junhaoliao and others added 2 commits January 7, 2026 15:22
# Conflicts:
#	tools/deployment/package-helm/Chart.yaml
#	tools/deployment/package-helm/templates/garbage-collector-deployment.yaml
@junhaoliao

Copy link
Copy Markdown
Member Author

Since "using tmpfs instead of binding mount" is a different requirement from changing the GC file list directory to use clpConfig.tmp_directory, i believe that change should be submitted in a separate PR.

@junhaoliao junhaoliao changed the title fix(clp-package)!: Use tmpfs for temporary directory mounts in Docker Compose. fix(job-orchestration): Move garbage collector recovery file to temporary directory (fixes #1823). Jan 14, 2026
@junhaoliao junhaoliao marked this pull request as ready for review January 14, 2026 23:31
@junhaoliao junhaoliao requested a review from a team as a code owner January 14, 2026 23:31
@junhaoliao junhaoliao modified the milestones: Backlog, February 2026 Jan 19, 2026
@junhaoliao junhaoliao requested review from hoophalab and removed request for kirkrodrigues May 4, 2026 23:23
hoophalab
hoophalab previously approved these changes May 7, 2026

@hoophalab hoophalab left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

It looks that the garbage collector could leave orphaned archives in storage if the pod crashes and k8s recycles its emptyDir volume. Not saying we should fix this right now -- we can just ask coderabbit to create an issue

@junhaoliao

Copy link
Copy Markdown
Member Author

LGTM.

It looks that the garbage collector could leave orphaned archives in storage if the pod crashes and k8s recycles its emptyDir volume. Not saying we should fix this right now -- we can just ask coderabbit to create an issue

@coderabbitai create an issue to track

@coderabbitai

coderabbitai Bot commented May 7, 2026

Copy link
Copy Markdown
Contributor

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
gh issue create \
  --repo y-scope/clp \
  --title "fix(job-orchestration): Handle garbage collector recovery file loss when k8s emptyDir volume is recycled" \
  --body "## Problem

When the garbage-collector pod crashes and Kubernetes recycles its \`emptyDir\` volume (mounted at \`/var/tmp\`), the recovery file written by \`archive_garbage_collector.py\` is lost. Without this recovery file, the garbage collector loses track of archives that were already scheduled for deletion, potentially leaving orphaned archives in storage that are never cleaned up.

This issue was identified during the review of `#1834`, which moved the recovery file from \`clp_config.logs_directory\` to \`clp_config.tmp_directory\`.

## Steps to Reproduce

1. Deploy CLP on Kubernetes.
2. Start a garbage collection job.
3. Crash the garbage-collector pod mid-run.
4. Allow Kubernetes to recycle the \`emptyDir\` volume.
5. Observe that the recovery file is gone; previously scheduled archives may not be cleaned up.

## Expected Behaviour

The garbage collector should be resilient to pod restarts and volume recycling — orphaned archives should still be identified and cleaned up correctly.

## Possible Solutions

- Persist the recovery file to a durable volume (e.g., a PersistentVolumeClaim) rather than an \`emptyDir\` volume.
- Redesign the garbage collection logic to be idempotent and not rely on a recovery file that may be lost across restarts.

## References

- PR: https://github.com/y-scope/clp/pull/1834
- Comment: https://github.com/y-scope/clp/pull/1834#issuecomment-2575063900

Raised by `@junhaoliao`." \
  --assignee junhaoliao

Length of output: 43


@junhaoliao: Done! I've created issue #2260"fix(job-orchestration): Handle garbage collector recovery file loss when k8s emptyDir volume is recycled" — to track this. It's assigned to you and includes a description of the problem, steps to reproduce, and possible solutions (persisting to a durable volume or redesigning GC logic to be idempotent).

@junhaoliao junhaoliao merged commit aeb077c into y-scope:main May 7, 2026
26 checks passed
junhaoliao added a commit to junhaoliao/clp that referenced this pull request May 17, 2026
…rary directory (fixes y-scope#1823). (y-scope#1834)

Co-authored-by: hoophalab <200652805+hoophalab@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants