feat(helm): Add garbage collector deployment with retention period configuration.#1817
Conversation
WalkthroughAdds a garbage-collector component to the Helm chart: chart version bump, config keys for retention and sweep intervals, conditional Deployment and storage manifests, and test artifacts for garbage-collector logs. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…n_period values are cast to integers in configmap template
# Conflicts: # tools/deployment/package-helm/Chart.yaml
There was a problem hiding this comment.
Actionable comments posted: 2
📜 Review details
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (7)
tools/deployment/package-helm/Chart.yaml(1 hunks)tools/deployment/package-helm/templates/configmap.yaml(3 hunks)tools/deployment/package-helm/templates/garbage-collector-deployment.yaml(1 hunks)tools/deployment/package-helm/templates/garbage-collector-logs-pv.yaml(1 hunks)tools/deployment/package-helm/templates/garbage-collector-logs-pvc.yaml(1 hunks)tools/deployment/package-helm/test.sh(2 hunks)tools/deployment/package-helm/values.yaml(2 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-05-06T10:07:04.654Z
Learnt from: kirkrodrigues
Repo: y-scope/clp PR: 881
File: components/core/tools/scripts/lib_install/macos/install-all.sh:11-12
Timestamp: 2025-05-06T10:07:04.654Z
Learning: In CLP installation scripts, temporary directories with downloaded files should not be automatically cleaned up on failure (e.g., with EXIT traps) to preserve artifacts for debugging purposes.
Applied to files:
tools/deployment/package-helm/test.sh
📚 Learning: 2025-09-25T05:13:13.298Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-package-utils/clp_package_utils/controller.py:217-223
Timestamp: 2025-09-25T05:13:13.298Z
Learning: The compression scheduler service in CLP runs with CLP_UID_GID (current user's UID:GID) rather than CLP_SERVICE_CONTAINER_UID_GID (999:999), unlike infrastructure services such as database, queue, redis, and results cache which run with the service container UID:GID.
Applied to files:
tools/deployment/package-helm/templates/configmap.yaml
🪛 YAMLlint (1.37.1)
tools/deployment/package-helm/templates/garbage-collector-logs-pvc.yaml
[error] 1-1: syntax error: expected the node content, but found '-'
(syntax)
[error] 8-8: too many spaces inside braces
(braces)
[error] 9-9: too many spaces inside braces
(braces)
tools/deployment/package-helm/templates/garbage-collector-deployment.yaml
[error] 1-1: syntax error: expected the node content, but found '-'
(syntax)
[error] 5-5: too many spaces inside braces
(braces)
[error] 5-5: too many spaces inside braces
(braces)
[error] 7-7: too many spaces inside braces
(braces)
[error] 13-13: too many spaces inside braces
(braces)
[error] 18-18: too many spaces inside braces
(braces)
[error] 21-21: too many spaces inside braces
(braces)
[error] 21-21: too many spaces inside braces
(braces)
[error] 24-24: too many spaces inside braces
(braces)
[error] 24-24: too many spaces inside braces
(braces)
[error] 25-25: too many spaces inside braces
(braces)
[error] 25-25: too many spaces inside braces
(braces)
[error] 26-26: too many spaces inside braces
(braces)
[error] 26-26: too many spaces inside braces
(braces)
[error] 32-32: too many spaces inside braces
(braces)
[error] 37-37: too many spaces inside braces
(braces)
[error] 46-46: too many spaces inside braces
(braces)
[error] 46-46: too many spaces inside braces
(braces)
[error] 51-51: too many spaces inside braces
(braces)
[error] 51-51: too many spaces inside braces
(braces)
[error] 56-56: too many spaces inside braces
(braces)
[error] 56-56: too many spaces inside braces
(braces)
[error] 68-68: too many spaces inside braces
(braces)
[error] 71-71: too many spaces inside braces
(braces)
[error] 73-73: too many spaces inside braces
(braces)
[error] 76-76: too many spaces inside braces
(braces)
[error] 78-78: too many spaces inside braces
(braces)
[error] 81-81: too many spaces inside braces
(braces)
[error] 93-93: too many spaces inside braces
(braces)
[error] 98-98: too many spaces inside braces
(braces)
[error] 103-103: too many spaces inside braces
(braces)
[error] 106-106: too many spaces inside braces
(braces)
[error] 106-106: too many spaces inside braces
(braces)
tools/deployment/package-helm/templates/garbage-collector-logs-pv.yaml
[error] 1-1: syntax error: expected the node content, but found '-'
(syntax)
[error] 10-10: too many spaces inside braces
(braces)
[error] 11-11: too many spaces inside braces
(braces)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: package-image
- GitHub Check: check-generated
🔇 Additional comments (14)
tools/deployment/package-helm/Chart.yaml (1)
3-3: LGTM!Standard version bump to reflect the new garbage collector feature.
tools/deployment/package-helm/templates/configmap.yaml (1)
30-34: LGTM!The garbage collector configuration block is correctly structured with appropriate typing for logging level and sweep intervals.
tools/deployment/package-helm/test.sh (2)
62-62: LGTM!Correctly adds the garbage collector log directory to match the PV hostPath configuration.
71-76: LGTM!The sample log file generation provides appropriate test data for garbage collector validation with realistic JSONL structure and dynamic timestamps.
tools/deployment/package-helm/values.yaml (2)
66-67: LGTM!The default 60-minute retention period for search results is reasonable and properly documented.
119-125: LGTM!The garbage collector configuration is well-structured with sensible defaults: 60-minute sweep interval for archives and 30 minutes for search results aligns with the retention periods.
tools/deployment/package-helm/templates/garbage-collector-logs-pvc.yaml (1)
1-9: LGTM!The conditional PVC creation correctly aligns with the deployment's conditional logic. The 5Gi capacity and ReadWriteOnce access mode are appropriate for a single-replica garbage collector's log storage.
Note: The YAMLlint warnings are false positives—they don't understand Helm template syntax.
tools/deployment/package-helm/templates/garbage-collector-logs-pv.yaml (1)
1-11: LGTM!The local PV configuration correctly constructs the hostPath from
logs_directoryand matches the conditional logic used in the PVC and deployment. The control-plane node role is appropriate for local development/testing scenarios.Note: YAMLlint warnings are false positives for Helm template syntax.
tools/deployment/package-helm/templates/garbage-collector-deployment.yaml (6)
1-1: LGTM!The conditional deployment logic correctly ensures the garbage collector is only created when at least one retention period is configured, maintaining consistency with the PV/PVC templates.
27-37: LGTM!The init containers properly ensure the database tables and results cache indices are created before the garbage collector starts, preventing race conditions during initialisation.
42-60: LGTM!Environment variables are correctly configured with database credentials sourced from Kubernetes secrets and appropriate paths for CLP components.
61-82: LGTM!The volume mount strategy is correct: the
tmpemptyDir at/var/logprovides general log space, while the PVC-backed mount at/var/log/garbage_collectorpersists garbage collector-specific logs. The more-specific mount takes precedence in Kubernetes, ensuring logs are persisted to the PVC as intended. Access to archives and streams volumes is necessary for the garbage collector to delete expired data.
83-87: LGTM!The command correctly invokes the Python garbage collector module with the config file path, and
-uensures unbuffered output for proper logging.
23-26: LGTM!The security context correctly uses
firstPartyuid/gid (1000:1000) for this first-party CLP service, consistent with the established pattern for non-infrastructure components.Based on learnings, this is appropriate since the garbage collector is a CLP service rather than an infrastructure service.
| - name: "tmp" | ||
| mountPath: "/var/log" |
There was a problem hiding this comment.
the archive_garbage_collector.py, called by garbage collector, writes to
which:
- Doesn't respect the ; otherwise, it would write to
/var/log/garbage_collector/archive-garbage-collector.tmpand we don't add to add this volume - Has violated our latest convention of writing temporary files to temporary directories
- Doesn't really need to be mapped onto the host.
@coderabbitai create an issue to track the request of letting archive_garbage_collector.py write to some other directory than /var/log. The docker compose orchestration should also be updated according to that
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Examine the archive_garbage_collector.py file to understand the issue
cat components/job-orchestration/job_orchestration/garbage_collector/archive_garbage_collector.py | head -220 | tail -20
# Also check the garbage_collector.py to see the log directory configuration
cat components/job-orchestration/job_orchestration/garbage_collector/garbage_collector.py | head -50
Length of output: 2576
🏁 Script executed:
#!/bin/bash
# Check for docker-compose files related to garbage collector
fd -t f "docker-compose" tools/deployment/package/ -x echo "File: {}" \; -x grep -l "garbage.collector" {} \;
# Search for garbage collector volume mounts in docker-compose files
rg -n "garbage.collector" -A 5 -B 2 tools/deployment/package/ --type yaml
Length of output: 6361
🏁 Script executed:
#!/bin/bash
# Create a GitHub issue to track this request
gh issue create \
--title "Move archive_garbage_collector.py temporary file writing from /var/log to temporary directory" \
--body "## Background
In PR #1817, we identified that \`archive_garbage_collector.py\` writes temporary files to an inappropriate location.
**Related PR**: #1817
**Related Comment**: https://github.com/y-scope/clp/pull/1817#discussion_r2634344014
## Problems
1. **Doesn't respect log directory configuration**: The \`archive_garbage_collector.py\` writes the recovery file using \`clp_config.logs_directory\` instead of the \`CLP_LOGS_DIR\` environment variable that is used by the logger. This means it would write to a different location than expected (e.g., would write to the directory specified in clp-config.yaml rather than \`/var/log/garbage_collector\`).
2. **Violates convention**: Writing temporary files to \`/var/log\` violates our latest convention of writing temporary files to temporary directories.
3. **Unnecessary host mapping**: The temporary recovery file doesn't need to be mapped onto the host filesystem.
## Affected Code
- [\`archive_garbage_collector.py:205\`](https://github.com/y-scope/clp/blob/c7c2ffcc13473e386223fb3a1b4e1d5d45267f7b/components/job-orchestration/job_orchestration/garbage_collector/archive_garbage_collector.py#L205): Creates recovery file using \`clp_config.logs_directory\`
- [\`garbage_collector.py:39\`](https://github.com/y-scope/clp/blob/c7c2ffcc13473e386223fb3a1b4e1d5d45267f7b/components/job-orchestration/job_orchestration/garbage_collector/garbage_collector.py#L39): Sets up logging using \`CLP_LOGS_DIR\` environment variable
## Required Changes
1. Update \`archive_garbage_collector.py\` to write the recovery file to a proper temporary directory instead of \`/var/log\`.
2. Update Docker Compose orchestration to reflect this change.
3. Update Helm chart orchestration if needed.
## Suggested Solution
Consider using a dedicated temporary directory (e.g., \`/tmp/clp/garbage_collector/\`) or passing a separate temporary directory configuration to the garbage collector.
---
Reported by: @junhaoliao" \
--assignee junhaoliao \
--repo y-scope/clp
Length of output: 42
@junhaoliao I've created issue #1823 to track this request: #1823
The issue documents all three problems you identified:
- Not respecting the logger's directory configuration (using
clp_config.logs_directoryinstead ofCLP_LOGS_DIR) - Violating the convention of writing temporary files to temporary directories
- Unnecessarily mapping the recovery file onto the host
The issue includes backlinks to this PR and comment, references the affected code locations, and outlines the required changes for both the Python code and Docker Compose orchestration. I've assigned it to you for tracking.
🧠 Learnings used
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1690
File: tools/deployment/package/docker-compose-all.yaml:424-427
Timestamp: 2025-12-04T03:31:55.239Z
Learning: In tools/deployment/package/docker-compose-all.yaml, the query-worker service writes to /var/data/streams (CLP_STREAM_OUTPUT_DIR_HOST mount), so this directory must remain read-write and should not be mounted with :ro flag.
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 1340
File: components/job-orchestration/job_orchestration/executor/compress/compression_task.py:528-528
Timestamp: 2025-09-28T15:00:22.170Z
Learning: In components/job-orchestration/job_orchestration/executor/compress/compression_task.py, there is a suggestion to refactor from passing logger as a parameter through multiple functions to creating a ClpCompressor class that takes the logger as a class member, with current helper functions becoming private member functions.
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-package-utils/clp_package_utils/controller.py:315-315
Timestamp: 2025-10-17T19:59:25.596Z
Learning: In components/clp-package-utils/clp_package_utils/controller.py, worker log directories (compression_worker, query_worker, reducer) created via `mkdir()` do not need `_chown_paths_if_root()` calls because directories are created with the same owner as the script caller. This differs from infrastructure service directories (database, queue, Redis, results cache) which do require explicit ownership changes.
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
tools/deployment/package-helm/templates/configmap.yaml(3 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-25T05:13:13.298Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-package-utils/clp_package_utils/controller.py:217-223
Timestamp: 2025-09-25T05:13:13.298Z
Learning: The compression scheduler service in CLP runs with CLP_UID_GID (current user's UID:GID) rather than CLP_SERVICE_CONTAINER_UID_GID (999:999), unlike infrastructure services such as database, queue, redis, and results cache which run with the service container UID:GID.
Applied to files:
tools/deployment/package-helm/templates/configmap.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: package-image
- GitHub Check: check-generated
- GitHub Check: lint-check (ubuntu-24.04)
- GitHub Check: lint-check (macos-15)
🔇 Additional comments (2)
tools/deployment/package-helm/templates/configmap.yaml (2)
14-18: LGTM! Conditional rendering prevents null-to-zero conversion.This implementation correctly addresses the previous concern about the
| intfilter converting null values to 0. By conditionally rendering the retention_period field, the template now properly handles null values to disable automatic deletion as intended.
81-85: LGTM! Consistent null handling for results_cache.retention_period.This implementation mirrors the fix applied to
archive_output.retention_periodand correctly prevents null-to-zero conversion. The conditional rendering ensures that null values properly disable automatic deletion of search results.
| garbage_collector: | ||
| logging_level: {{ .Values.clpConfig.garbage_collector.logging_level | quote }} | ||
| sweep_interval: | ||
| archive: {{ .Values.clpConfig.garbage_collector.sweep_interval.archive | int }} | ||
| search_result: {{ .Values.clpConfig.garbage_collector.sweep_interval.search_result | int }} |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for sweep_interval in the codebase
echo "=== Searching for sweep_interval in Python files ==="
find . -type f -name "*.py" | xargs grep -n "sweep_interval" 2>/dev/null | head -30
echo -e "\n=== Searching for sweep_interval in YAML files ==="
find . -type f -name "values.yaml" -o -name "*.yaml" | xargs grep -A 3 -B 3 "sweep_interval" 2>/dev/null | head -50
echo -e "\n=== Checking configmap.yaml for context around sweep_interval ==="
find . -path "*/templates/configmap.yaml" | xargs cat 2>/dev/null | head -100
echo -e "\n=== Searching for retention_period to see comparison pattern ==="
find . -type f \( -name "*.yaml" -o -name "*.yml" \) | xargs grep -A 3 -B 3 "retention_period" 2>/dev/null | head -50Repository: y-scope/clp
Length of output: 12720
Consider adding conditional rendering for sweep_interval consistency.
The sweep_interval fields currently lack the conditional pattern used for retention_period (lines 14–18, 81–85), which allows null values. While defaults exist in values.yaml (archive: 60, search_result: 30), explicitly setting either value to null would convert to 0 via the | int filter, causing the garbage collector to run continuously without delay.
If sweeping should be disableable via null (like retention_period), apply the same conditional rendering pattern. Otherwise, consider documenting that these values require numeric configuration and cannot be disabled.
Description
Note
This PR is part of the ongoing work for #1309. More PRs will be submitted until the Helm chart is complete and fully functional.
Add the garbage collector component to the Helm chart, enabling automatic cleanup of expired
archives and search results based on configurable retention periods.
Changes:
retention_periodconfiguration toresults_cachein values.yaml (default: 60 minutes)garbage_collectorconfiguration section in values.yaml withlogging_levelandsweep_intervalsettingsgarbage_collectorandresults_cache.retention_periodfields to configmap.yamlarchive_output.retention_periodorresults_cache.retention_periodis set)db-table-creatorandresults-cache-indices-creatorjobsThe garbage collector runs two async tasks:
archive_output.retention_periodresults_cache.retention_periodChecklist
breaking change.
Validation performed
0. Preparation
Configure retention periods in
values.yamlfor testing:1. Helm chart deployment
2. Log ingestion
Ingested sample logs at
/tmp/clp/samples/test-gc.jsonlusing the WebUI at http://localhost:30000/ingest .Observed ingestion completed successfully, and the "Uncompressed Size" and "Compressed Size"
became non-zero after compression: 170.0 B v.s. 683.0 B. (The negative compression ratio is
expected for tiny files at this scale.)
3. Perform a search query in the WebUI
Visited http://localhost:30000/search and executed a search with query string "message:login".
Observed results returned successfully.
4. Verify garbage collection of search results
Wait for 3 minutes, which should exceed the
results_cache.retention_periodof 1 minute. Also,the
sweep_interval.search_resultof 2 minute should have allowed the garbage collector to run.Use MongoDB shell to verify that the search results have been deleted:
Observed that no collection with name being any integer (which represents search result
collections) exists, indicating that the search results have been successfully garbage collected.
5. Verify garbage collection of archives
Repeated steps 0, 1, 2 with
Wait for 3 minutes, which should exceed the
archive_output.retention_periodof 1 minute. Also,the
sweep_interval.archiveof 2 minute should have allowed the garbage collector to run.Revisit the ingestion page at http://localhost:30000/ingest and check the "Uncompressed Size" and
"Compressed Size" again, which were both reset to zero, indicating that the archives have been
cleaned up.
Summary by CodeRabbit
New Features
Chores
Tests
✏️ Tip: You can customize this high-level summary in your review settings.