Skip to content

refactor(helm)!: Remove all host-path volume mounts.#2023

Merged
junhaoliao merged 20 commits into
y-scope:mainfrom
junhaoliao:helm-remove-host-path
Feb 27, 2026
Merged

refactor(helm)!: Remove all host-path volume mounts.#2023
junhaoliao merged 20 commits into
y-scope:mainfrom
junhaoliao:helm-remove-host-path

Conversation

@junhaoliao

@junhaoliao junhaoliao commented Feb 24, 2026

Copy link
Copy Markdown
Member
  • Log to stdout/stderr instead of host-path log PVs.
  • Use dynamic provisioning for persistent data (database, results-cache, shared-data).
  • Use emptyDir for ephemeral/temp volumes (Redis, staging, worker tmp).
  • Delete static PV/PVC templates and the local-storage StorageClass.

Description

Redirect all CLP service logs from host-path-backed PersistentVolumes to pod stdout/stderr, and
switch all persistent data storage from static host-path PVs to either dynamic provisioning via the
cluster's default StorageClass (database, results-cache, shared-data archives/streams) or emptyDir
(Redis data, staging/tmp volumes).

This eliminates all static PV/PVC template files and the custom local-storage StorageClass,
following Kubernetes best practices where logs are captured by kubectl logs, ephemeral service
state uses emptyDir, and persistent data uses dynamically provisioned volumes.

Summary of changes:

Source code (Python + Rust):

  • Make CLP_LOGS_DIR optional in all scheduler, garbage-collector, reducer, and MCP server
    components. When unset, services log only to stdout (console handler from get_logger()).
  • Rust services (api_server, log_ingestor): set_up_logging() falls back to stdout-only when
    CLP_LOGS_DIR is absent. Return type changed to Option<WorkerGuard>.

Helm chart — PV/PVC:

  • Delete 35 static PV/PVC template files (log, data, staging/tmp PVs and their PVCs) and the
    local-storage StorageClass template.
  • Remove clp.createStaticPv helper from _helpers.tpl. Update clp.createPvc to use the
    cluster's default StorageClass (remove hardcoded storageClassName and selector).
  • Shared-data PVCs (archives, streams) now use ReadWriteOnce with dynamic provisioning.
  • Database volumeClaimTemplates: remove storageClassName, selector, and
    persistentVolumeClaimRetentionPolicy.
  • Replace host-path PVCs with emptyDir for: compression-worker tmp, compression-worker
    staged-archives, query-worker staged-streams, redis data.
  • Results-cache (MongoDB): switch from static PV to volumeClaimTemplates with dynamic
    provisioning (10Gi RWO) for persistent data across pod restarts.
  • Remove volumeClaimTemplates and persistentVolumeClaimRetentionPolicy from redis statefulset.

Helm chart — logs and securityContext:

  • Remove CLP_LOGS_DIR env var and log volume mounts from 9 deployments.
  • Workers (compression-worker, query-worker) use CLP_WORKER_LOG_PATH=/dev/stdout for direct
    stdout logging. Retain CLP_LOGS_DIR=/tmp for Celery executor tasks (not yet updated).
  • Add emptyDir at /var/log for compression-scheduler and garbage-collector (used for
    user-failure logs and GC recovery files respectively).
  • Remove securityContext (runAsUser/runAsGroup/fsGroup) from all statefulsets (database, redis,
    results-cache, queue) and all first-party deployments/jobs (api-server, compression-scheduler,
    compression-worker, db-table-creator, garbage-collector, log-ingestor, mcp-server,
    query-scheduler, query-worker, reducer, results-cache-indices-creator, webui). The CLP Docker
    image already runs as the correct user.
  • Remove securityContext section entirely from values.yaml. Remove storage section and
    data_directory/logs_directory/tmp_directory entries (no longer referenced by any template).

Infrastructure configs (configmap):

  • Database: general_log is not enabled for either MariaDB or MySQL — the database process runs
    as a non-root user that cannot write to /dev/stdout. Error log (stderr) is sufficient.
  • Redis: logfile "" (logs to stdout).
  • MongoDB: remove destination: file and path from systemLog (defaults to stdout).
  • RabbitMQ: RABBITMQ_LOGS="-" and RABBITMQ_SASL_LOGS="-" (stdout).

Setup scripts:

  • create_clp_directories() now only creates $CLP_HOME/samples (all data directories are
    dynamically provisioned or emptyDir).

Impact Assessment

  • PV reduction: All static PV files and the local-storage StorageClass deleted. Remaining
    persistent storage (database, results-cache, shared-data archives/streams) uses dynamically
    provisioned PVCs via the cluster's default StorageClass.
  • Log observability: All service logs now visible via kubectl logs. Previously required
    exec-ing into containers or inspecting host-path directories.
  • Data durability: Database metadata, results-cache (MongoDB), and archive/stream data remain
    persistent via dynamically provisioned PVCs. Redis uses emptyDir and RabbitMQ has no persistent
    volume — their data is reconstructable and does not require persistence across pod restarts.
  • securityContext: Removed from all pods (first-party and third-party). The CLP Docker image
    already sets the correct user, and dynamically provisioned volumes / emptyDir don't require
    explicit UID/GID configuration.
  • Setup simplification: create_clp_directories() reduces to a single mkdir for the
    samples directory. No host-path data directories need to be pre-created.
  • Breaking: CLP_LOGS_DIR is no longer required for most services. Existing non-Helm
    deployments that set CLP_LOGS_DIR continue to work (file logging is additive).

PV/PVC inventory

Log PV/PVCs

All log PV/PVCs are removed. Services now log to stdout/stderr via kubectl logs.

Component Old PV file Old PVC source New state
api-server api-server-logs-pv.yaml *-logs-pvc.yaml Removed
compression-scheduler compression-scheduler-logs-pv.yaml *-logs-pvc.yaml Removed (emptyDir at /var/log for user-failure logs)
compression-worker compression-worker-logs-pv.yaml *-logs-pvc.yaml Removed
garbage-collector garbage-collector-logs-pv.yaml *-logs-pvc.yaml Removed (emptyDir at /var/log for GC recovery files)
log-ingestor log-ingestor-logs-pv.yaml *-logs-pvc.yaml Removed
mcp-server mcp-server-logs-pv.yaml *-logs-pvc.yaml Removed
query-scheduler query-scheduler-logs-pv.yaml *-logs-pvc.yaml Removed
query-worker query-worker-logs-pv.yaml *-logs-pvc.yaml Removed
reducer reducer-logs-pv.yaml *-logs-pvc.yaml Removed
database database-logs-pv.yaml volumeClaimTemplates in statefulset Removed
queue queue-logs-pv.yaml volumeClaimTemplates in statefulset Removed
redis redis-logs-pv.yaml volumeClaimTemplates in statefulset Removed
results-cache results-cache-logs-pv.yaml volumeClaimTemplates in statefulset Removed

Additionally, compression-scheduler-user-logs-pv.yaml and its PVC are removed (the emptyDir at
/var/log replaces it).

Data PV/PVCs

Component Old PV file Old PVC source New state
database database-data-pv.yaml volumeClaimTemplates in statefulset Dynamic provisioning (20Gi RWO)
redis redis-data-pv.yaml volumeClaimTemplates in statefulset emptyDir
results-cache results-cache-data-pv.yaml volumeClaimTemplates in statefulset Dynamic provisioning (10Gi RWO)
shared-data-archives shared-data-archives-pv.yaml *-pvc.yaml Dynamic provisioning (PVC, 50Gi RWO)
shared-data-streams shared-data-streams-pv.yaml *-pvc.yaml Dynamic provisioning (PVC, 20Gi RWO)

Staging/Tmp PV/PVCs

Component Old PV file Old PVC file New state
compression-worker tmp compression-worker-tmp-pv.yaml *-tmp-pvc.yaml emptyDir
compression-worker staged-archives compression-worker-staged-archives-pv.yaml *-pvc.yaml emptyDir
query-worker staged-streams query-worker-staged-streams-pv.yaml *-pvc.yaml emptyDir

Infrastructure

File Status Notes
storage-class.yaml Deleted local-storage StorageClass no longer needed
_helpers.tpl Updated Removed clp.createStaticPv; simplified clp.createPvc (no storageClassName/selector)

data_directory, logs_directory, tmp_directory usage

These three entries have been removed from values.yaml since no Helm template references them.
The configmap hardcodes the container-internal paths directly.

Removed values.yaml key Old default value Configmap (container path) Previously used for
clpConfig.data_directory /tmp/clp/var/data /var/data Host path for static data PVs
clpConfig.logs_directory /tmp/clp/var/log /var/log Host path for static log PVs
clpConfig.tmp_directory /tmp/clp/var/tmp /var/tmp Host path for static tmp PVs

The configmap hardcodes /var/data, /var/log, and /var/tmp as container-internal mount points.
These correspond to dynamically provisioned PVCs, emptyDir volumes, or in-container paths depending
on the component. The old values.yaml entries (/tmp/clp/var/...) were host paths consumed by the
now-deleted clp.createStaticPv helper.

Documentation updates

File Changes
docs/src/user-docs/quick-start/clp-json.md Remove CLP_HOME, mkdir -p host directory creation, $CLP_HOME kind mount, and --set clpConfig.data_directory/logs_directory/tmp_directory/archive_output.storage.directory/stream_output.storage.directory from helm install.
docs/src/user-docs/quick-start/clp-text.md Same as clp-json.
docs/src/user-docs/guides-k8s-deployment.md Remove "Storage for CLP package services' data and logs" section (static PVs, data_directory/logs_directory/tmp_directory, logging infra issue #1760 note). Remove mkdir -p host directory creation and directory helm flags from "Basic installation". Update uninstall warning to reference dynamically provisioned PVCs instead of static PVs. Remove stale logging-infra-issue link.
docs/src/dev-docs/design-deployment-orchestration.md Update Kubernetes storage description: dynamically provisioned PVCs for persistent data, emptyDir for ephemeral state, service logs to pod stdout/stderr.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

0. Build local Docker image

Task: Build the CLP package Docker image with source code changes.

Command:

task docker-images:package

Build completed successfully — image clp-package:dev-junhao-e83c.

1. Test with MariaDB (default)

Task: Deploy the full Helm chart with MariaDB and a locally-built image. Verify all pods start
with no errors, dynamic provisioning creates 4 PVCs via the standard StorageClass (kind's
default), and emptyDir log volumes are mounted on compression-worker and query-worker.

Command:

bash -c '
export CLP_HOME="/tmp/clp-validation"
export CLUSTER_NAME="clp-test"
source tools/deployment/package-helm/.set-up-common.sh
prepare_environment "${CLUSTER_NAME}"
generate_kind_config 0 | kind create cluster --name "${CLUSTER_NAME}" --config=-
kind load docker-image clp-package:dev-junhao-e83c --name "${CLUSTER_NAME}"
helm install test tools/deployment/package-helm \
  --set "image.clpPackage.repository=clp-package" \
  --set "image.clpPackage.tag=dev-junhao-e83c" \
  --set "image.clpPackage.pullPolicy=Never"
wait_for_cluster_ready
'

Output (2026-02-26 08:19 UTC):

NAME                                             READY   STATUS      RESTARTS   AGE
test-clp-api-server-f65f6f96-b7hjm               1/1     Running     0          116s
test-clp-compression-scheduler-fcf6c46fd-j2ldz   1/1     Running     0          116s
test-clp-compression-worker-9776b5d7-bzfdr       1/1     Running     0          116s
test-clp-database-0                              1/1     Running     0          116s
test-clp-db-table-creator-rfx49                  0/1     Completed   0          116s
test-clp-garbage-collector-5b54bb6d58-kpknz      1/1     Running     0          116s
test-clp-query-scheduler-57449b49f4-jlz9j        1/1     Running     0          116s
test-clp-query-worker-84f598cb66-76hqs           1/1     Running     0          116s
test-clp-queue-0                                 1/1     Running     0          116s
test-clp-redis-0                                 1/1     Running     0          116s
test-clp-reducer-6f4c8fc765-wnd76                1/1     Running     0          116s
test-clp-results-cache-0                         1/1     Running     0          116s
test-clp-results-cache-indices-creator-w2t9g     0/1     Completed   0          116s
test-clp-webui-7f57f545f9-shn2t                  1/1     Running     0          116s

All 14 pods running with 0 restarts.

1.1 Verify PVCs are dynamically provisioned

Command:

kubectl get pvc

Output:

NAME                                          STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   VOLUMEATTRIBUTESCLASS   AGE
database-data-test-clp-database-0             Bound    pvc-abc4a0f2-8e0e-4836-b70d-6512f67b505e   20Gi       RWO            standard       <unset>                 2m22s
results-cache-data-test-clp-results-cache-0   Bound    pvc-3c651e11-710c-4814-a7a5-8cfc9ca02f0c   10Gi       RWO            standard       <unset>                 2m22s
test-clp-shared-data-archives                 Bound    pvc-7ec207f4-232f-4f1f-88fa-010af3d8e0c4   50Gi       RWO            standard       <unset>                 2m22s
test-clp-shared-data-streams                  Bound    pvc-702a4734-546c-4098-b370-e871f3330071   20Gi       RWO            standard       <unset>                 2m22s

All 4 PVCs are Bound with STORAGECLASS=standard.

1.2 Verify emptyDir log volumes are mounted

Commands:

kubectl exec deploy/test-clp-compression-worker -- ls -la /var/log/compression_worker
kubectl exec deploy/test-clp-query-worker -- ls -la /var/log/query_worker

Output:

# compression-worker
total 8
drwxrwxrwx 2 root root 4096 Feb 26 08:17 .
drwxr-xr-x 1 root root 4096 Feb 26 08:17 ..

# query-worker
total 8
drwxrwxrwx 2 root root 4096 Feb 26 08:17 .
drwxr-xr-x 1 root root 4096 Feb 26 08:17 ..

Both emptyDir log volumes are mounted and writable.

1.3 Verify compression and search (end-to-end)

Commands:

# Submit compression job via WebUI API
curl -s -X POST "http://localhost:30000/api/compress/" \
  -H "Content-Type: application/json" \
  -d '{"paths": ["/tmp/clp-validation/samples/postgresql"], "dataset": "default"}'

# Submit search job via WebUI API
curl -s -X POST "http://localhost:30000/api/search/query" \
  -H "Content-Type: application/json" \
  -d '{"queryString": "ERROR", "dataset": "default", "timestampBegin": 0, "timestampEnd": 9999999999999, "ignoreCase": false, "timeRangeBucketSizeMillis": 60000}'

Output:

Compression job 2 completed successfully:

2026-02-26 08:25:42,544 compression_scheduler [INFO] Dispatched job 2 with 1 tasks (0 remaining).
2026-02-26 08:25:45,174 compression_scheduler [INFO] Compression task job-2-task-2 completed in 2.62523 second(s).
2026-02-26 08:25:45,174 compression_scheduler [INFO] Job 2 succeeded (1 tasks completed).

Stderr log file written to emptyDir:

$ kubectl exec deploy/test-clp-compression-worker -- ls /var/log/compression_worker/
compression-job-2-task-2-stderr.log

Search jobs dispatched and query worker wrote log files to emptyDir:

$ kubectl exec deploy/test-clp-query-worker -- find /var/log/query_worker -type f
/var/log/query_worker/1/1-clo.log
/var/log/query_worker/2/2-clo.log

Search tasks returned return_code=1 due to missing timestamp column in the archive (pre-existing
archive format limitation, not related to this PR).

2. Test with MySQL

Task: Deploy the Helm chart with database.type=mysql and verify MySQL starts without errors.

Command:

bash -c '
export CLP_HOME="/tmp/clp-validation"
export CLUSTER_NAME="clp-test"
source tools/deployment/package-helm/.set-up-common.sh
prepare_environment "${CLUSTER_NAME}"
generate_kind_config 0 | kind create cluster --name "${CLUSTER_NAME}" --config=-
kind load docker-image clp-package:dev-junhao-e83c --name "${CLUSTER_NAME}"
helm install test tools/deployment/package-helm \
  --set "clpConfig.database.type=mysql" \
  --set "image.clpPackage.repository=clp-package" \
  --set "image.clpPackage.tag=dev-junhao-e83c" \
  --set "image.clpPackage.pullPolicy=Never"
wait_for_cluster_ready
'

Output (2026-02-26 08:30 UTC):

NAME                                             READY   STATUS      RESTARTS   AGE
test-clp-api-server-f65f6f96-kgjk4               1/1     Running     0          2m14s
test-clp-compression-scheduler-fcf6c46fd-bhk9s   1/1     Running     0          2m14s
test-clp-compression-worker-9776b5d7-qlkvg       1/1     Running     0          2m14s
test-clp-database-0                              1/1     Running     0          2m14s
test-clp-db-table-creator-26zp2                  0/1     Completed   0          2m14s
test-clp-garbage-collector-5b54bb6d58-pr97n      1/1     Running     0          2m14s
test-clp-query-scheduler-57449b49f4-rrplv        1/1     Running     0          2m14s
test-clp-query-worker-84f598cb66-qh9qr           1/1     Running     0          2m14s
test-clp-queue-0                                 1/1     Running     0          2m14s
test-clp-redis-0                                 1/1     Running     0          2m14s
test-clp-reducer-6f4c8fc765-k844p                1/1     Running     0          2m14s
test-clp-results-cache-0                         1/1     Running     0          2m14s
test-clp-results-cache-indices-creator-bbx2d     0/1     Completed   0          2m14s
test-clp-webui-7f57f545f9-rgb4w                  1/1     Running     0          2m14s

All 14 pods running with 0 restarts.

2.1 Verify PVCs are dynamically provisioned

Command:

kubectl get pvc

Output:

NAME                                          STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   VOLUMEATTRIBUTESCLASS   AGE
database-data-test-clp-database-0             Bound    pvc-f0c70538-6994-474c-a04d-fc2fa88ddfff   20Gi       RWO            standard       <unset>                 2m26s
results-cache-data-test-clp-results-cache-0   Bound    pvc-4db5b4df-23e2-470d-9f47-b88ebfacaa25   10Gi       RWO            standard       <unset>                 2m26s
test-clp-shared-data-archives                 Bound    pvc-3be2628d-027e-4264-a8a2-3a7a26a3ff8a   50Gi       RWO            standard       <unset>                 2m26s
test-clp-shared-data-streams                  Bound    pvc-eaf4307f-dd55-4ada-a5d5-888a482fed2c   20Gi       RWO            standard       <unset>                 2m26s

All 4 PVCs are Bound with STORAGECLASS=standard.

3. Test Docker Compose flow (build/clp-package)

Task: Start all services via Docker Compose, verify file logging still works when CLP_LOGS_DIR
is set, and run a compression job.

Commands:

cd build/clp-package
bash sbin/start-clp.sh
bash sbin/compress.sh /tmp/clp-validation/samples/postgresql

Output (2026-02-26 08:37 UTC):

All services started and reported healthy. Compression completed successfully:

2026-02-26T08:37:36.510 INFO [compress] Compression job 1 submitted.
2026-02-26T08:37:39.518 INFO [compress] Compression finished.
2026-02-26T08:37:39.519 INFO [compress] Compressed 392.84MB into 10.22MB (38.44x). Speed: 152.21MB/s.

3.1 Verify log files are written (Docker Compose sets CLP_LOGS_DIR)

All services with CLP_LOGS_DIR set produce log files on the host-mounted var/log/ directory:

Service Log file(s) present
compression-scheduler compression_scheduler.log (job dispatch/completion)
compression-worker worker.log, compression-job-1-task-1-stderr.log
query-scheduler query_scheduler.log
garbage-collector garbage_collector.log, search-result-garbage-collector.log
reducer reducer.log (Python), reducer-{0..7}.log (subprocesses)
api-server api_server.log.2026-02-26-08 (hourly rotation)

Docker logs (docker logs <container>) also show the same output — confirming the additive
behaviour: when CLP_LOGS_DIR is set, services log to both stdout and file.

4. Test all-external services in Kubernetes (bundled guards, PR #1681 Scenario 4)

Task: Deploy with bundled: [] and all four services (database, queue, redis, results-cache)
running as external Docker containers. Verify no StatefulSets/Services are created for bundled
resources, and all CLP pods connect to external hosts.

Setup — start four external services on host (192.168.3.89):

docker run -d --name ext-k8s-mariadb -p 192.168.3.89:13306:3306 \
  -e MYSQL_ROOT_PASSWORD=root-pass -e MYSQL_DATABASE=clp-db \
  -e MYSQL_USER=clp-user -e MYSQL_PASSWORD=pass mariadb:10-jammy

docker run -d --name ext-k8s-rabbitmq -p 192.168.3.89:15672:5672 \
  -e RABBITMQ_DEFAULT_USER=clp-user -e RABBITMQ_DEFAULT_PASS=pass rabbitmq:3.9.8

docker run -d --name ext-k8s-redis -p 192.168.3.89:16379:6379 \
  redis:7.2.4 redis-server --requirepass 'pass'

docker run -d --name ext-k8s-mongodb -p 192.168.3.89:17017:27017 \
  mongo:7.0.1 mongod --replSet rs0 --bind_ip_all

Command:

helm install test tools/deployment/package-helm \
  --set "image.clpPackage.repository=clp-package" \
  --set "image.clpPackage.tag=dev-junhao-e83c" \
  --set "image.clpPackage.pullPolicy=Never" \
  --set 'clpConfig.bundled={}' \
  --set "clpConfig.database.host=192.168.3.89" \
  --set "clpConfig.database.port=13306" \
  --set "clpConfig.database.credentials.username=clp-user" \
  --set "clpConfig.database.credentials.password=pass" \
  --set "clpConfig.database.credentials.root_password=root-pass" \
  --set "clpConfig.queue.host=192.168.3.89" \
  --set "clpConfig.queue.port=15672" \
  --set "clpConfig.queue.credentials.username=clp-user" \
  --set "clpConfig.queue.credentials.password=pass" \
  --set "clpConfig.redis.host=192.168.3.89" \
  --set "clpConfig.redis.port=16379" \
  --set "clpConfig.redis.credentials.password=pass" \
  --set "clpConfig.results_cache.host=192.168.3.89" \
  --set "clpConfig.results_cache.port=17017"

Pod status (2026-02-26 13:43 UTC):

NAME                                              READY   STATUS      RESTARTS   AGE
test-clp-api-server-f65f6f96-2lwp9                1/1     Running     0          2m1s
test-clp-compression-scheduler-6754bb4d8d-m6m5d   1/1     Running     0          2m1s
test-clp-compression-worker-754dd997d7-sps7f      1/1     Running     0          2m1s
test-clp-db-table-creator-zkwpz                   0/1     Completed   0          2m1s
test-clp-garbage-collector-5b54bb6d58-qflvp       1/1     Running     0          2m1s
test-clp-query-scheduler-cfcfcb86d-67s6x          1/1     Running     0          2m1s
test-clp-query-worker-676b6f5b5d-dcz9h            1/1     Running     0          2m1s
test-clp-reducer-6f4c8fc765-w88hp                 1/1     Running     0          2m1s
test-clp-results-cache-indices-creator-fw5jx      0/1     Completed   0          2m1s
test-clp-webui-7f57f545f9-q68s5                   1/1     Running     0          2m1s

No database, queue, redis, or results-cache StatefulSets — all external.

4.1 Verify no bundled resources

$ kubectl get statefulsets
No resources found in default namespace.

$ kubectl get svc
NAME                       TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)           AGE
kubernetes                 ClusterIP   10.96.0.1       <none>        443/TCP           2m27s
test-clp-api-server        NodePort    10.96.232.152   <none>        3001:30301/TCP    2m6s
test-clp-query-scheduler   ClusterIP   None            <none>        7000/TCP          2m6s
test-clp-reducer           ClusterIP   None            <none>        14009/TCP,...     2m6s
test-clp-webui             NodePort    10.96.126.123   <none>        4000:30000/TCP    2m6s

$ kubectl get pvc
NAME                            STATUS   VOLUME      CAPACITY   ACCESS MODES   STORAGECLASS   AGE
test-clp-shared-data-archives   Bound    pvc-...     50Gi       RWO            standard       2m6s
test-clp-shared-data-streams    Bound    pvc-...     20Gi       RWO            standard       2m6s

Only 2 PVCs (shared-data), no database/results-cache PVCs.

4.2 Verify external service connections

compression-worker → external RabbitMQ + Redis:

.> transport:   amqp://clp-user:**@192.168.3.89:15672//
.> results:     redis://default:**@192.168.3.89:16379/1

query-scheduler → external MariaDB:

2026-02-26 13:42:45,928 search-job-handler [INFO] Connected to archive database 192.168.3.89:13306.
2026-02-26 13:42:45,928 search-job-handler [INFO] query_scheduler started.

All CLP pods connect to external services. Bundled guards correctly exclude all
StatefulSet/Service resources when bundled: [].

- Log to stdout/stderr instead of host-path log PVs.
- Use dynamic provisioning for persistent data (database, results-cache, shared-data).
- Use emptyDir for ephemeral/temp volumes (Redis, staging, worker tmp).
- Delete static PV/PVC templates and the local-storage StorageClass.
@junhaoliao junhaoliao requested a review from a team as a code owner February 24, 2026 14:18
@junhaoliao junhaoliao requested a review from hoophalab February 24, 2026 14:18
@coderabbitai

coderabbitai Bot commented Feb 24, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Centralizes logging into new clp-rust-utils logging APIs, removes many per-component file log handlers and CLP_HOME filesystem conventions, converts component logs to stdout/emptyDir, removes numerous static PV/PVC and pod securityContext entries from Helm charts, and updates Python/Rust services to use the centralized logging configurator.

Changes

Cohort / File(s) Summary
Rust logging library
components/clp-rust-utils/src/logging.rs, components/clp-rust-utils/src/lib.rs, components/clp-rust-utils/Cargo.toml
Add public logging module and set_up_logging(log_filename: &str) -> Option<WorkerGuard>; add tracing-appender/tracing-subscriber deps here.
Rust binaries (centralized logging usage)
components/api-server/src/bin/api_server.rs, components/log-ingestor/src/bin/log_ingestor.rs, components/api-server/Cargo.toml, components/log-ingestor/Cargo.toml
Remove local set_up_logging helpers and tracing deps; binaries call clp_rust_utils::logging::set_up_logging("<name>.log").
Python logging util
components/clp-py-utils/clp_py_utils/clp_logging.py
Add configure_logging(logger, component_name) to optionally attach a file handler when CLP_LOGS_DIR set; change set_logging_level to accept None.
Python services & job workers
components/clp-mcp-server/.../clp_mcp_server.py, components/job-orchestration/.../garbage_collector/*.py, .../reducer/reducer.py, .../scheduler/.../*.py, components/job-orchestration/.../garbage_collector/utils.py
Replace ad-hoc FileHandler/configure_logger usage with configure_logging(logger, component_name); remove log-directory/logging-level parameters from several public GC entrypoints; reducer supports optional per-reducer log files and closes file handles when used.
Helm helpers & chart metadata
tools/deployment/package-helm/templates/_helpers.tpl, tools/deployment/package-helm/Chart.yaml, tools/deployment/package-helm/templates/storage-class.yaml
Remove clp.createStaticPv helper; change clp.createPvc to rely on dynamic provisioning (omit storageClassName/selector); remove local-storage StorageClass; bump Chart version.
Deployment manifests — logs, volumes, securityContext
tools/deployment/package-helm/templates/*-deployment.yaml, *-statefulset.yaml, *-job.yaml (many files)
Remove pod-level securityContext entries, drop CLP_LOGS_DIR env vars and PVC-backed logs mounts; switch logs to stdout or emptyDir; simplify and remove logs PVCs/PVs.
PV/PVC templates removed or altered
tools/deployment/package-helm/templates/*-logs-*.yaml, *-*-pv.yaml, *-*-pvc.yaml (30+ files)
Delete numerous static PV and conditional PVC template blocks (logs, staged data, shared-data); adjust some PVC accessModes from ReadWriteMany → ReadWriteOnce.
Config, values & setup script
tools/deployment/package-helm/templates/configmap.yaml, tools/deployment/package-helm/values.yaml, tools/deployment/package-helm/.set-up-common.sh
Remove CLP_HOME/data/log/tmp directory setup and log-file config; drop securityContext defaults from values; simplify setup script to only create samples.
Documentation / quick-start
docs/src/.../design-deployment-orchestration.md, docs/src/user-docs/guides-k8s-deployment.md, docs/src/user-docs/quick-start/clp-json.md, docs/src/user-docs/quick-start/clp-text.md
Remove CLP_HOME and hostPath mount guidance; document dynamic PVCs/ephemeral state and logs-to-stdout behaviour; simplify quick-start steps and Helm flags.
Misc: Rust cargo updates
components/api-server/Cargo.toml, components/log-ingestor/Cargo.toml
Remove tracing-appender and tracing-subscriber from these crates (moved to clp-rust-utils).

Sequence Diagram(s)

sequenceDiagram
    participant Binary
    participant clp_rust_utils as clp_rust_utils::logging
    participant Env as Environment (CLP_LOGS_DIR)
    participant FS as Filesystem / Stdout

    Binary->>clp_rust_utils: set_up_logging("component.log")
    clp_rust_utils->>Env: read CLP_LOGS_DIR
    alt CLP_LOGS_DIR set
        clp_rust_utils->>FS: create RollingFileAppender at CLP_LOGS_DIR/component.log
        clp_rust_utils->>FS: create non-blocking writer + JSON formatter
        clp_rust_utils-->>Binary: return Some(WorkerGuard)
        Binary->>FS: logs -> file + stdout (JSON)
    else CLP_LOGS_DIR not set
        clp_rust_utils->>FS: configure stdout JSON logging only
        clp_rust_utils-->>Binary: return None
        Binary->>FS: logs -> stdout (JSON)
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 47.62% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely describes the main change—removing all host-path volume mounts from the Helm chart. It aligns with the comprehensive refactoring detailed in the PR objectives.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (8)
tools/deployment/package-helm/templates/compression-worker-deployment.yaml (1)

41-52: ⚠️ Potential issue | 🟡 Minor

Change CLP_LOGS_DIR from /tmp to /var/tmp to match the emptyDir mount.

The compression-worker writes stderr logs to files under CLP_LOGS_DIR (see compression_task.py line 424), but line 41 sets it to /tmp (container ephemeral layer) while line 52 mounts an emptyDir at /var/tmp. This mismatch causes logs to be lost on container restart. Align the environment variable with the mounted volume:

-            - name: "CLP_LOGS_DIR"
-              value: "/tmp"
+            - name: "CLP_LOGS_DIR"
+              value: "/var/tmp"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/deployment/package-helm/templates/compression-worker-deployment.yaml`
around lines 41 - 52, Set the CLP_LOGS_DIR env var to the mounted emptyDir path
so logs persist: change the CLP_LOGS_DIR value from "/tmp" to "/var/tmp" in the
deployment template to match the volumeMount mountPath "/var/tmp" (the volume
defined via include "clp.volumeName" for component_category
"compression-worker"). Ensure this aligns with how compression_task.py writes
stderr logs (referenced via CLP_LOGS_DIR).
tools/deployment/package-helm/templates/shared-data-streams-pvc.yaml (1)

1-8: ⚠️ Potential issue | 🟠 Major

RWO blocks multi-node filesystem storage for streams.

Line 7 restricts the streams PVC to ReadWriteOnce, which prevents multiple worker pods on different nodes from mounting the same shared filesystem. That breaks the multi-node filesystem storage path described in the docs. Consider keeping ReadWriteMany or making access modes configurable.

🔧 Suggested fix
-  "accessModes" (list "ReadWriteOnce")
+  "accessModes" (list "ReadWriteMany")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/deployment/package-helm/templates/shared-data-streams-pvc.yaml` around
lines 1 - 8, The PVC template currently hardcodes accessModes to ReadWriteOnce
in the clp.createPvc invocation for the "streams" shared-data component, which
prevents multi-node mounts; update the invocation in
templates/shared-data-streams-pvc.yaml to use ReadWriteMany (or expose
.Values.clpConfig.stream_output.storage.accessModes and default to
ReadWriteMany) instead of the fixed (list "ReadWriteOnce") so multiple worker
pods on different nodes can mount the same filesystem; ensure the change is
applied to the clp.createPvc call that supplies "component_category"
"shared-data" and "name" "streams".
tools/deployment/package-helm/templates/shared-data-archives-pvc.yaml (1)

1-8: ⚠️ Potential issue | 🟠 Major

RWO prevents archives PVC from being shared across workers.

Line 7 changes archives to ReadWriteOnce, which blocks multi-node access to the shared archives volume. For filesystem-based multi-node deployments this needs RWX (or an explicit RWX-capable StorageClass).

🔧 Suggested fix
-  "accessModes" (list "ReadWriteOnce")
+  "accessModes" (list "ReadWriteMany")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/deployment/package-helm/templates/shared-data-archives-pvc.yaml` around
lines 1 - 8, The PVC for archives is being created with accessModes set to
ReadWriteOnce via the clp.createPvc invocation, which prevents multi-node
mounts; update the call in shared-data-archives-pvc.yaml (the include
"clp.createPvc" invocation for component_category "shared-data" and name
"archives") so the accessModes list uses ReadWriteMany (or reads the value from
a new configurable .Values.clpConfig.archive_output.storage.accessMode) to allow
RWX mounts on multi-node filesystem deployments; ensure the change preserves
existing capacity and name parameters and falls back to ReadWriteOnce if RWX is
not requested.
tools/deployment/package-helm/templates/_helpers.tpl (1)

113-138: ⚠️ Potential issue | 🟠 Major

PVC helper should still allow explicit StorageClass selection.

With storageClassName/selector removed, users can’t target an RWX-capable StorageClass or clusters without a default StorageClass. This can leave PVCs Pending and blocks multi-node filesystem storage. Please keep an optional override even if the default is preferred.

🔧 Suggested fix (optional StorageClass override)
 spec:
   accessModes: {{ .accessModes }}
+  {{- with .storageClassName }}
+  storageClassName: {{ . | quote }}
+  {{- end }}
   resources:
     requests:
       storage: {{ .capacity }}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/deployment/package-helm/templates/_helpers.tpl` around lines 113 - 138,
The PVC helper clp.createPvc removed storageClassName/selector causing PVCs to
hang when no default or when RWX is required; re-add an optional override in the
template so callers can pass a storage class or selector (e.g., check for
.storageClassName and/or .selector on the template context and emit a
storageClassName: <value> line or selector: <map> block when present). Update
the clp.createPvc definition to conditionally render storageClassName (using
.storageClassName | quote) and/or selector (render the map from .selector or
.storageClassSelector under spec) while leaving behavior unchanged when those
keys are not provided; keep references to clp.fullname and clp.volumeName
unchanged.
docs/src/dev-docs/design-deployment-orchestration.md (1)

242-248: ⚠️ Potential issue | 🟡 Minor

Update the storage intro to reflect stdout logging.

The lead sentence still says services require persistent storage for logs, but the Kubernetes bullet now states logs are emitted to stdout/stderr. Please drop “logs” from the lead sentence to avoid mixed signals.

✍️ Suggested fix
-Services require persistent storage for logs, data, archives, and streams.
+Services require persistent storage for data, archives, and streams.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/src/dev-docs/design-deployment-orchestration.md` around lines 242 - 248,
The lead sentence currently reads "Services require persistent storage for logs,
data, archives, and streams." which conflicts with the Kubernetes note about
emitting logs to stdout/stderr; update that sentence to remove "logs" so it
reads something like "Services require persistent storage for data, archives,
and streams." Modify the opening line in the same paragraph (the one before the
"Docker Compose" and "Kubernetes" bullets) to remove the word "logs" and ensure
the rest of the bullets remain unchanged so the Kubernetes note about
stdout/stderr stays accurate.
docs/src/user-docs/guides-k8s-deployment.md (1)

128-145: ⚠️ Potential issue | 🟡 Minor

Clarify RWX requirement and configuration for shared storage.

This section states that workers must share filesystem paths, but it doesn’t mention that the underlying PVCs must support ReadWriteMany (or be backed by an RWX-capable StorageClass). Without that, multi-node mounts will fail. Please add a short note about RWX and how to select the appropriate StorageClass in Helm values.

✍️ Suggested doc tweak
-   For multi-node clusters using filesystem storage, the following directories **must** be
+   For multi-node clusters using filesystem storage, the following directories **must** be
    accessible from all worker nodes at the same paths. Without shared storage, compressed logs
    created by one worker cannot be searched by other workers.
+   Ensure the StorageClass backing the archives/streams PVCs supports ReadWriteMany (RWX) and is
+   selected in your Helm values if your cluster’s default StorageClass is not RWX-capable.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/src/user-docs/guides-k8s-deployment.md` around lines 128 - 145, Update
the "Shared storage for workers" section to explicitly state that the underlying
PVCs must support ReadWriteMany (RWX) for multi-node mounts to work, and add a
short note pointing users to pick an RWX-capable StorageClass in their Helm
values; reference the existing directory keys (archive_output.storage.directory,
stream_output.storage.directory, logs_input.directory) and say that the
corresponding PVC/storageClass used by the chart must be RWX (and indicate the
Helm values field where to set the storageClass used by persistent volumes so
operators know to choose an RWX-backed StorageClass).
components/log-ingestor/src/bin/log_ingestor.rs (1)

46-82: 🧹 Nitpick | 🔵 Trivial

Extract set_up_logging into clp-rust-utils to eliminate cross-component duplication.

This function is virtually identical to set_up_logging in components/api-server/src/bin/api_server.rs (lines 43–79); the only difference is the log filename ("log_ingestor.log" vs "api_server.log"). A shared helper in clp-rust-utils that accepts the service name would prevent this duplication across multiple binary crates:

♻️ Sketch of a shared helper
// e.g. in clp-rust-utils/src/logging.rs
pub fn init_logging(
    service_name: &str,
) -> anyhow::Result<Option<tracing_appender::non_blocking::WorkerGuard>> {
    let log_file_name = format!("{service_name}.log");
    if let Ok(logs_directory) = std::env::var("CLP_LOGS_DIR") {
        let file_appender = RollingFileAppender::new(
            Rotation::HOURLY,
            std::path::Path::new(logs_directory.as_str()),
            &log_file_name,
        );
        let (non_blocking_writer, guard) = tracing_appender::non_blocking(file_appender);
        init_subscriber(std::io::stdout.and(non_blocking_writer));
        return Ok(Some(guard));
    }
    init_subscriber(std::io::stdout);
    Ok(None)
}

Additionally, the subscriber builder chain (event format, env filter, ansi) is repeated in both the file-logging and stdout-only branches within this function. Extracting a small helper that accepts a writer would collapse that internal duplication as well.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/log-ingestor/src/bin/log_ingestor.rs` around lines 46 - 82,
Extract the duplicated set_up_logging into clp-rust-utils as a public function
(e.g., logging::init_logging(service_name: &str) ->
anyhow::Result<Option<tracing_appender::non_blocking::WorkerGuard>>) that builds
the log filename from service_name, creates the RollingFileAppender and
tracing_appender::non_blocking writer when CLP_LOGS_DIR is set, and returns the
guard; inside clp-rust-utils also add a small helper (e.g., init_subscriber<W:
MakeWriter + 'static>(writer: W)) that configures the subscriber chain
(event_format with level/target/file/line/json, EnvFilter::from_default_env,
with_ansi(false)) and calls .with_writer(writer).init(); replace the local
set_up_logging functions in log_ingestor (and api_server) to call
logging::init_logging("log_ingestor") (or "api_server") so the file-vs-stdout
branching and subscriber setup are centralized.
tools/deployment/package-helm/templates/queue-statefulset.yaml (1)

1-49: ⚠️ Potential issue | 🟠 Major

Confirm whether RabbitMQ data should persist across pod restarts.

The StatefulSet has no volumeClaimTemplates and no explicit volume mount for /var/lib/rabbitmq. The queue will use emptyDir semantics, losing all broker state on pod restart or rescheduling.

The design documentation lists persistent storage for database, results-cache, archives, and streams, and explicitly mentions emptyDir for Redis and staging directories—but does not list the queue as either persistent or ephemeral. This is undocumented.

If ephemeral queues are intentional (workers re-declare queues on connect and tolerate in-flight message loss), this must be explicitly documented. Otherwise, add a volumeClaimTemplates section matching the pattern used in database-statefulset.yaml and results-cache-statefulset.yaml.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/deployment/package-helm/templates/queue-statefulset.yaml` around lines
1 - 49, StatefulSet "queue" currently has no volumeClaimTemplates or volumeMount
for /var/lib/rabbitmq, causing RabbitMQ data to be ephemeral; decide whether
persistence is required and if so add a volumeClaimTemplates block and a
corresponding volumeMount in the "queue" container to mount /var/lib/rabbitmq
(follow the same PVC pattern used by database-statefulset and
results-cache-statefulset), otherwise update documentation to explicitly state
the queue is ephemeral and explain broker restart/ message-loss behavior; adjust
the StatefulSet spec (replicas/selector/template) around the container named
"queue" and include the PVC name used by the volumeClaimTemplates so the mount
references the new claim.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/job-orchestration/job_orchestration/reducer/reducer.py`:
- Around line 71-90: The loop that spawns reducers (building
reducer_instance_cmd, opening log_file and calling subprocess.Popen) must be
wrapped in a try/finally so any exception during open() or Popen() still closes
resources: in the finally iterate reducer_log_files and call .close() for each
opened file and remove/close the optional logging_file_handler, and also
terminate/cleanup any partially-started processes recorded in reducers; update
code around reducer_log_files, reducers, reducer_instance_cmd and
subprocess.Popen to ensure files and handlers are closed regardless of
exceptions.

In
`@components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py`:
- Around line 426-432: The optional file logging code in
compression_scheduler.py should guard against missing/unwritable CLP_LOGS_DIR:
before creating logging.FileHandler for log_file, check if
os.path.isdir(logs_dir) and attempt to create it (os.makedirs(logs_dir,
exist_ok=True)) and validate write permissions; wrap FileHandler creation in a
try/except to catch OSError/PermissionError and on failure skip adding the
FileHandler (or fallback to console) while logging the failure via the existing
logger (created by get_logger()). Ensure references to CLP_LOGS_DIR, log_file,
logging.FileHandler, and logger in compression_scheduler.py are updated
accordingly.

In `@tools/deployment/package-helm/.set-up-common.sh`:
- Around line 10-13: Update the JSDoc-style comment for the
create_clp_directories() function to reflect its narrowed behavior: it now only
creates the "samples" subdirectory under CLP_HOME rather than all "CLP data".
Edit the comment to briefly state that the function creates the CLP samples
directory (e.g., "Creates the CLP samples directory at ${CLP_HOME}/samples") and
remove or replace the misleading "CLP data" phrasing so the comment matches the
mkdir -p "$CLP_HOME/samples" implementation.

In `@tools/deployment/package-helm/templates/query-worker-deployment.yaml`:
- Around line 106-110: The emptyDir for the staged-streams volume currently has
no sizeLimit; update the Helm template that defines the volume (the include
"clp.volumeName" with dict
"component_category":"query-worker","name":"staged-streams") to add a sizeLimit
to the emptyDir (e.g., "5Gi") or wire it to a values key so operators can
configure it, ensuring the emptyDir: block includes sizeLimit to prevent
unbounded node disk usage.

In `@tools/deployment/package-helm/templates/redis-statefulset.yaml`:
- Around line 63-70: The templated volume block triggers YAMLlint brace-spacing
errors; fix by normalizing Helm template whitespace inside the mustaches for the
volume name and configMap name: collapse the multiline dict into a single
expression and ensure single spaces immediately after '{{' and before '}}' for
the include calls (references: include "clp.volumeName" and include
"clp.fullname"), so the volume name and configMap name lines render with
properly spaced braces and no extra line breaks.

In `@tools/deployment/package-helm/templates/results-cache-statefulset.yaml`:
- Around line 65-68: The templates currently set resources.requests.storage to a
smaller hard-coded value which will cause Helm upgrades to fail when an existing
PVC is larger; change the template so volumeClaimTemplates (or the field where
resources.requests.storage is defined) uses a configurable value (e.g.,
.Values.persistence.size) and fall back to the existing PVC size via the Helm
lookup function if present (use lookup to read the current PersistentVolumeClaim
and use its spec.resources.requests.storage), or keep the previous default size
(e.g., 20Gi) to avoid shrinking; update the template to reference that value
instead of the hard-coded "10Gi".

---

Outside diff comments:
In `@components/log-ingestor/src/bin/log_ingestor.rs`:
- Around line 46-82: Extract the duplicated set_up_logging into clp-rust-utils
as a public function (e.g., logging::init_logging(service_name: &str) ->
anyhow::Result<Option<tracing_appender::non_blocking::WorkerGuard>>) that builds
the log filename from service_name, creates the RollingFileAppender and
tracing_appender::non_blocking writer when CLP_LOGS_DIR is set, and returns the
guard; inside clp-rust-utils also add a small helper (e.g., init_subscriber<W:
MakeWriter + 'static>(writer: W)) that configures the subscriber chain
(event_format with level/target/file/line/json, EnvFilter::from_default_env,
with_ansi(false)) and calls .with_writer(writer).init(); replace the local
set_up_logging functions in log_ingestor (and api_server) to call
logging::init_logging("log_ingestor") (or "api_server") so the file-vs-stdout
branching and subscriber setup are centralized.

In `@docs/src/dev-docs/design-deployment-orchestration.md`:
- Around line 242-248: The lead sentence currently reads "Services require
persistent storage for logs, data, archives, and streams." which conflicts with
the Kubernetes note about emitting logs to stdout/stderr; update that sentence
to remove "logs" so it reads something like "Services require persistent storage
for data, archives, and streams." Modify the opening line in the same paragraph
(the one before the "Docker Compose" and "Kubernetes" bullets) to remove the
word "logs" and ensure the rest of the bullets remain unchanged so the
Kubernetes note about stdout/stderr stays accurate.

In `@docs/src/user-docs/guides-k8s-deployment.md`:
- Around line 128-145: Update the "Shared storage for workers" section to
explicitly state that the underlying PVCs must support ReadWriteMany (RWX) for
multi-node mounts to work, and add a short note pointing users to pick an
RWX-capable StorageClass in their Helm values; reference the existing directory
keys (archive_output.storage.directory, stream_output.storage.directory,
logs_input.directory) and say that the corresponding PVC/storageClass used by
the chart must be RWX (and indicate the Helm values field where to set the
storageClass used by persistent volumes so operators know to choose an
RWX-backed StorageClass).

In `@tools/deployment/package-helm/templates/_helpers.tpl`:
- Around line 113-138: The PVC helper clp.createPvc removed
storageClassName/selector causing PVCs to hang when no default or when RWX is
required; re-add an optional override in the template so callers can pass a
storage class or selector (e.g., check for .storageClassName and/or .selector on
the template context and emit a storageClassName: <value> line or selector:
<map> block when present). Update the clp.createPvc definition to conditionally
render storageClassName (using .storageClassName | quote) and/or selector
(render the map from .selector or .storageClassSelector under spec) while
leaving behavior unchanged when those keys are not provided; keep references to
clp.fullname and clp.volumeName unchanged.

In `@tools/deployment/package-helm/templates/compression-worker-deployment.yaml`:
- Around line 41-52: Set the CLP_LOGS_DIR env var to the mounted emptyDir path
so logs persist: change the CLP_LOGS_DIR value from "/tmp" to "/var/tmp" in the
deployment template to match the volumeMount mountPath "/var/tmp" (the volume
defined via include "clp.volumeName" for component_category
"compression-worker"). Ensure this aligns with how compression_task.py writes
stderr logs (referenced via CLP_LOGS_DIR).

In `@tools/deployment/package-helm/templates/queue-statefulset.yaml`:
- Around line 1-49: StatefulSet "queue" currently has no volumeClaimTemplates or
volumeMount for /var/lib/rabbitmq, causing RabbitMQ data to be ephemeral; decide
whether persistence is required and if so add a volumeClaimTemplates block and a
corresponding volumeMount in the "queue" container to mount /var/lib/rabbitmq
(follow the same PVC pattern used by database-statefulset and
results-cache-statefulset), otherwise update documentation to explicitly state
the queue is ephemeral and explain broker restart/ message-loss behavior; adjust
the StatefulSet spec (replicas/selector/template) around the container named
"queue" and include the PVC name used by the volumeClaimTemplates so the mount
references the new claim.

In `@tools/deployment/package-helm/templates/shared-data-archives-pvc.yaml`:
- Around line 1-8: The PVC for archives is being created with accessModes set to
ReadWriteOnce via the clp.createPvc invocation, which prevents multi-node
mounts; update the call in shared-data-archives-pvc.yaml (the include
"clp.createPvc" invocation for component_category "shared-data" and name
"archives") so the accessModes list uses ReadWriteMany (or reads the value from
a new configurable .Values.clpConfig.archive_output.storage.accessMode) to allow
RWX mounts on multi-node filesystem deployments; ensure the change preserves
existing capacity and name parameters and falls back to ReadWriteOnce if RWX is
not requested.

In `@tools/deployment/package-helm/templates/shared-data-streams-pvc.yaml`:
- Around line 1-8: The PVC template currently hardcodes accessModes to
ReadWriteOnce in the clp.createPvc invocation for the "streams" shared-data
component, which prevents multi-node mounts; update the invocation in
templates/shared-data-streams-pvc.yaml to use ReadWriteMany (or expose
.Values.clpConfig.stream_output.storage.accessModes and default to
ReadWriteMany) instead of the fixed (list "ReadWriteOnce") so multiple worker
pods on different nodes can mount the same filesystem; ensure the change is
applied to the clp.createPvc call that supplies "component_category"
"shared-data" and "name" "streams".

ℹ️ Review info

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7fb76c3 and d6e3362.

📒 Files selected for processing (73)
  • components/api-server/src/bin/api_server.rs
  • components/clp-mcp-server/clp_mcp_server/clp_mcp_server.py
  • components/job-orchestration/job_orchestration/garbage_collector/archive_garbage_collector.py
  • components/job-orchestration/job_orchestration/garbage_collector/garbage_collector.py
  • components/job-orchestration/job_orchestration/garbage_collector/search_result_garbage_collector.py
  • components/job-orchestration/job_orchestration/garbage_collector/utils.py
  • components/job-orchestration/job_orchestration/reducer/reducer.py
  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
  • components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py
  • components/log-ingestor/src/bin/log_ingestor.rs
  • docs/src/dev-docs/design-deployment-orchestration.md
  • docs/src/user-docs/guides-k8s-deployment.md
  • docs/src/user-docs/quick-start/clp-json.md
  • docs/src/user-docs/quick-start/clp-text.md
  • tools/deployment/package-helm/.set-up-common.sh
  • tools/deployment/package-helm/Chart.yaml
  • tools/deployment/package-helm/templates/_helpers.tpl
  • tools/deployment/package-helm/templates/api-server-deployment.yaml
  • tools/deployment/package-helm/templates/api-server-logs-pv.yaml
  • tools/deployment/package-helm/templates/api-server-logs-pvc.yaml
  • tools/deployment/package-helm/templates/compression-scheduler-deployment.yaml
  • tools/deployment/package-helm/templates/compression-scheduler-logs-pv.yaml
  • tools/deployment/package-helm/templates/compression-scheduler-logs-pvc.yaml
  • tools/deployment/package-helm/templates/compression-scheduler-user-logs-pv.yaml
  • tools/deployment/package-helm/templates/compression-scheduler-user-logs-pvc.yaml
  • tools/deployment/package-helm/templates/compression-worker-deployment.yaml
  • tools/deployment/package-helm/templates/compression-worker-logs-pv.yaml
  • tools/deployment/package-helm/templates/compression-worker-logs-pvc.yaml
  • tools/deployment/package-helm/templates/compression-worker-staged-archives-pv.yaml
  • tools/deployment/package-helm/templates/compression-worker-staged-archives-pvc.yaml
  • tools/deployment/package-helm/templates/compression-worker-tmp-pv.yaml
  • tools/deployment/package-helm/templates/compression-worker-tmp-pvc.yaml
  • tools/deployment/package-helm/templates/configmap.yaml
  • tools/deployment/package-helm/templates/database-data-pv.yaml
  • tools/deployment/package-helm/templates/database-logs-pv.yaml
  • tools/deployment/package-helm/templates/database-statefulset.yaml
  • tools/deployment/package-helm/templates/db-table-creator-job.yaml
  • tools/deployment/package-helm/templates/garbage-collector-deployment.yaml
  • tools/deployment/package-helm/templates/garbage-collector-logs-pv.yaml
  • tools/deployment/package-helm/templates/garbage-collector-logs-pvc.yaml
  • tools/deployment/package-helm/templates/log-ingestor-deployment.yaml
  • tools/deployment/package-helm/templates/log-ingestor-logs-pv.yaml
  • tools/deployment/package-helm/templates/log-ingestor-logs-pvc.yaml
  • tools/deployment/package-helm/templates/mcp-server-deployment.yaml
  • tools/deployment/package-helm/templates/mcp-server-logs-pv.yaml
  • tools/deployment/package-helm/templates/mcp-server-logs-pvc.yaml
  • tools/deployment/package-helm/templates/query-scheduler-deployment.yaml
  • tools/deployment/package-helm/templates/query-scheduler-logs-pv.yaml
  • tools/deployment/package-helm/templates/query-scheduler-logs-pvc.yaml
  • tools/deployment/package-helm/templates/query-worker-deployment.yaml
  • tools/deployment/package-helm/templates/query-worker-logs-pv.yaml
  • tools/deployment/package-helm/templates/query-worker-logs-pvc.yaml
  • tools/deployment/package-helm/templates/query-worker-staged-streams-pv.yaml
  • tools/deployment/package-helm/templates/query-worker-staged-streams-pvc.yaml
  • tools/deployment/package-helm/templates/queue-logs-pv.yaml
  • tools/deployment/package-helm/templates/queue-statefulset.yaml
  • tools/deployment/package-helm/templates/redis-data-pv.yaml
  • tools/deployment/package-helm/templates/redis-logs-pv.yaml
  • tools/deployment/package-helm/templates/redis-statefulset.yaml
  • tools/deployment/package-helm/templates/reducer-deployment.yaml
  • tools/deployment/package-helm/templates/reducer-logs-pv.yaml
  • tools/deployment/package-helm/templates/reducer-logs-pvc.yaml
  • tools/deployment/package-helm/templates/results-cache-data-pv.yaml
  • tools/deployment/package-helm/templates/results-cache-indices-creator-job.yaml
  • tools/deployment/package-helm/templates/results-cache-logs-pv.yaml
  • tools/deployment/package-helm/templates/results-cache-statefulset.yaml
  • tools/deployment/package-helm/templates/shared-data-archives-pv.yaml
  • tools/deployment/package-helm/templates/shared-data-archives-pvc.yaml
  • tools/deployment/package-helm/templates/shared-data-streams-pv.yaml
  • tools/deployment/package-helm/templates/shared-data-streams-pvc.yaml
  • tools/deployment/package-helm/templates/storage-class.yaml
  • tools/deployment/package-helm/templates/webui-deployment.yaml
  • tools/deployment/package-helm/values.yaml
💤 Files with no reviewable changes (48)
  • tools/deployment/package-helm/templates/query-scheduler-deployment.yaml
  • tools/deployment/package-helm/templates/compression-worker-tmp-pv.yaml
  • tools/deployment/package-helm/templates/api-server-logs-pv.yaml
  • tools/deployment/package-helm/templates/shared-data-archives-pv.yaml
  • tools/deployment/package-helm/templates/webui-deployment.yaml
  • docs/src/user-docs/quick-start/clp-text.md
  • tools/deployment/package-helm/templates/compression-worker-logs-pv.yaml
  • tools/deployment/package-helm/templates/garbage-collector-logs-pvc.yaml
  • tools/deployment/package-helm/templates/compression-scheduler-user-logs-pvc.yaml
  • tools/deployment/package-helm/templates/compression-scheduler-logs-pvc.yaml
  • tools/deployment/package-helm/templates/query-worker-logs-pv.yaml
  • tools/deployment/package-helm/templates/query-worker-staged-streams-pvc.yaml
  • tools/deployment/package-helm/templates/log-ingestor-logs-pv.yaml
  • tools/deployment/package-helm/templates/shared-data-streams-pv.yaml
  • tools/deployment/package-helm/templates/compression-worker-staged-archives-pv.yaml
  • tools/deployment/package-helm/templates/queue-logs-pv.yaml
  • tools/deployment/package-helm/templates/compression-scheduler-user-logs-pv.yaml
  • tools/deployment/package-helm/templates/compression-scheduler-logs-pv.yaml
  • tools/deployment/package-helm/templates/results-cache-data-pv.yaml
  • tools/deployment/package-helm/templates/compression-worker-logs-pvc.yaml
  • tools/deployment/package-helm/templates/garbage-collector-logs-pv.yaml
  • tools/deployment/package-helm/templates/redis-logs-pv.yaml
  • tools/deployment/package-helm/templates/log-ingestor-logs-pvc.yaml
  • tools/deployment/package-helm/templates/mcp-server-deployment.yaml
  • docs/src/user-docs/quick-start/clp-json.md
  • tools/deployment/package-helm/templates/db-table-creator-job.yaml
  • tools/deployment/package-helm/templates/query-scheduler-logs-pv.yaml
  • tools/deployment/package-helm/templates/api-server-deployment.yaml
  • tools/deployment/package-helm/templates/reducer-logs-pvc.yaml
  • tools/deployment/package-helm/templates/reducer-deployment.yaml
  • tools/deployment/package-helm/templates/query-worker-staged-streams-pv.yaml
  • tools/deployment/package-helm/templates/api-server-logs-pvc.yaml
  • tools/deployment/package-helm/templates/compression-worker-tmp-pvc.yaml
  • tools/deployment/package-helm/templates/database-logs-pv.yaml
  • tools/deployment/package-helm/templates/storage-class.yaml
  • tools/deployment/package-helm/templates/query-worker-logs-pvc.yaml
  • tools/deployment/package-helm/templates/database-data-pv.yaml
  • tools/deployment/package-helm/templates/query-scheduler-logs-pvc.yaml
  • tools/deployment/package-helm/templates/results-cache-indices-creator-job.yaml
  • tools/deployment/package-helm/templates/mcp-server-logs-pv.yaml
  • tools/deployment/package-helm/values.yaml
  • tools/deployment/package-helm/templates/reducer-logs-pv.yaml
  • tools/deployment/package-helm/templates/redis-data-pv.yaml
  • tools/deployment/package-helm/templates/log-ingestor-deployment.yaml
  • tools/deployment/package-helm/templates/database-statefulset.yaml
  • tools/deployment/package-helm/templates/results-cache-logs-pv.yaml
  • tools/deployment/package-helm/templates/mcp-server-logs-pvc.yaml
  • tools/deployment/package-helm/templates/compression-worker-staged-archives-pvc.yaml

Comment on lines +71 to +90
reducer_log_files: list[TextIO] = []
concurrency = max(int(parsed_args.concurrency), 1)
for i in range(concurrency):
reducer_instance_cmd = reducer_cmd + [str(clp_config.reducer.base_port + i)]

log_file_path = logs_dir / ("reducer-" + str(i) + ".log")
log_file = open(log_file_path, "a")
reducer_stdout = sys.stdout
reducer_stderr = sys.stderr
if logs_dir is not None:
log_file_path = logs_dir / f"reducer-{i}.log"
log_file = open(log_file_path, "a")
reducer_log_files.append(log_file)
reducer_stdout = log_file
reducer_stderr = log_file

reducers.append(
subprocess.Popen(
reducer_instance_cmd,
close_fds=True,
stdout=log_file,
stderr=log_file,
stdout=reducer_stdout,
stderr=reducer_stderr,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

find . -type f -name "reducer.py" | grep job-orchestration | head -5

Repository: y-scope/clp

Length of output: 122


🏁 Script executed:

cat -n ./components/job-orchestration/job_orchestration/reducer/reducer.py | sed -n '1,150p'

Repository: y-scope/clp

Length of output: 4726


🏁 Script executed:

cat -n ./components/job-orchestration/job_orchestration/reducer/reducer.py | sed -n '19,114p' | grep -E "try:|finally:|except"

Repository: y-scope/clp

Length of output: 147


Ensure reducer log files and handlers are closed on exceptions.

If open() or subprocess.Popen() raises during the spawn loop (lines 73-92), file handles in reducer_log_files and the optional logging_file_handler won't be closed. Currently, the cleanup code (lines 104-111) is only reached if the loop completes successfully. Wrap lines 73-103 in a try/finally block to guarantee cleanup of all resources.

🧹 Proposed fix
     reducers = []
     reducer_log_files: list[TextIO] = []
     concurrency = max(int(parsed_args.concurrency), 1)
-    for i in range(concurrency):
-        reducer_instance_cmd = reducer_cmd + [str(clp_config.reducer.base_port + i)]
-
-        reducer_stdout = sys.stdout
-        reducer_stderr = sys.stderr
-        if logs_dir is not None:
-            log_file_path = logs_dir / f"reducer-{i}.log"
-            log_file = open(log_file_path, "a")
-            reducer_log_files.append(log_file)
-            reducer_stdout = log_file
-            reducer_stderr = log_file
-
-        reducers.append(
-            subprocess.Popen(
-                reducer_instance_cmd,
-                close_fds=True,
-                stdout=reducer_stdout,
-                stderr=reducer_stderr,
-            )
-        )
-
-    logger.info("Reducers started.")
-    logger.info(
-        f"Host={clp_config.reducer.host}"
-        f" Base port={clp_config.reducer.base_port}"
-        f" Concurrency={concurrency}"
-        f" Upsert Interval={parsed_args.upsert_interval}"
-    )
-    for i, reducer in enumerate(reducers):
-        reducer.communicate()
-        logger.info(f"reducer-{i} exited with returncode={reducer.returncode}")
-    for reducer_log_file in reducer_log_files:
-        reducer_log_file.close()
-
-    logger.error("All reducers terminated")
-
-    if logging_file_handler is not None:
-        logger.removeHandler(logging_file_handler)
-        logging_file_handler.close()
+    try:
+        for i in range(concurrency):
+            reducer_instance_cmd = reducer_cmd + [str(clp_config.reducer.base_port + i)]
+
+            reducer_stdout = sys.stdout
+            reducer_stderr = sys.stderr
+            if logs_dir is not None:
+                log_file_path = logs_dir / f"reducer-{i}.log"
+                log_file = open(log_file_path, "a")
+                reducer_log_files.append(log_file)
+                reducer_stdout = log_file
+                reducer_stderr = log_file
+
+            reducers.append(
+                subprocess.Popen(
+                    reducer_instance_cmd,
+                    close_fds=True,
+                    stdout=reducer_stdout,
+                    stderr=reducer_stderr,
+                )
+            )
+
+        logger.info("Reducers started.")
+        logger.info(
+            f"Host={clp_config.reducer.host}"
+            f" Base port={clp_config.reducer.base_port}"
+            f" Concurrency={concurrency}"
+            f" Upsert Interval={parsed_args.upsert_interval}"
+        )
+        for i, reducer in enumerate(reducers):
+            reducer.communicate()
+            logger.info(f"reducer-{i} exited with returncode={reducer.returncode}")
+
+        logger.error("All reducers terminated")
+    finally:
+        for reducer_log_file in reducer_log_files:
+            reducer_log_file.close()
+        if logging_file_handler is not None:
+            logger.removeHandler(logging_file_handler)
+            logging_file_handler.close()
🧰 Tools
🪛 Ruff (0.15.2)

[warning] 74-74: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)


[error] 86-86: subprocess call: check for execution of untrusted input

(S603)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/job-orchestration/job_orchestration/reducer/reducer.py` around
lines 71 - 90, The loop that spawns reducers (building reducer_instance_cmd,
opening log_file and calling subprocess.Popen) must be wrapped in a try/finally
so any exception during open() or Popen() still closes resources: in the finally
iterate reducer_log_files and call .close() for each opened file and
remove/close the optional logging_file_handler, and also terminate/cleanup any
partially-started processes recorded in reducers; update code around
reducer_log_files, reducers, reducer_instance_cmd and subprocess.Popen to ensure
files and handlers are closed regardless of exceptions.

Comment on lines +10 to +13
# Creates required directories for CLP data
create_clp_directories() {
echo "Creating CLP directories at ${CLP_HOME}..."
mkdir -p "$CLP_HOME/var/"{data,log}/{database,queue,redis,results_cache} \
"$CLP_HOME/var/data/"{archives,streams,staged-archives,staged-streams} \
"$CLP_HOME/var/log/"{compression_scheduler,compression_worker,user} \
"$CLP_HOME/var/log/"{query_scheduler,query_worker,reducer} \
"$CLP_HOME/var/log/"{api_server,garbage_collector,log_ingestor,mcp_server} \
"$CLP_HOME/var/tmp" \
"$CLP_HOME/samples"
mkdir -p "$CLP_HOME/samples"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Update the function comment to reflect its narrowed scope.

The body now only creates the samples subdirectory, so the phrase "CLP data" in the JSDoc-style comment is a bit misleading.

✏️ Suggested comment wording
-# Creates required directories for CLP data
+# Creates required directories under CLP_HOME
 create_clp_directories() {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/deployment/package-helm/.set-up-common.sh` around lines 10 - 13, Update
the JSDoc-style comment for the create_clp_directories() function to reflect its
narrowed behavior: it now only creates the "samples" subdirectory under CLP_HOME
rather than all "CLP data". Edit the comment to briefly state that the function
creates the CLP samples directory (e.g., "Creates the CLP samples directory at
${CLP_HOME}/samples") and remove or replace the misleading "CLP data" phrasing
so the comment matches the mkdir -p "$CLP_HOME/samples" implementation.

Comment on lines +106 to +110
- name: {{ include "clp.volumeName" (dict
"component_category" "query-worker"
"name" "staged-streams"
) | nindent 10 }}
) | quote }}
emptyDir: {}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider setting a sizeLimit on the emptyDir for staged streams.

The emptyDir: {} for staged streams has no size limit. If the query worker stages large volumes of stream data before uploading to S3, this could consume significant node disk space and potentially affect other pods on the same node. A sizeLimit acts as a guardrail:

          emptyDir:
            sizeLimit: "5Gi"

This is non-blocking — just a resilience improvement to prevent unbounded disk usage on the node.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/deployment/package-helm/templates/query-worker-deployment.yaml` around
lines 106 - 110, The emptyDir for the staged-streams volume currently has no
sizeLimit; update the Helm template that defines the volume (the include
"clp.volumeName" with dict
"component_category":"query-worker","name":"staged-streams") to add a sizeLimit
to the emptyDir (e.g., "5Gi") or wire it to a values key so operators can
configure it, ensuring the emptyDir: block includes sizeLimit to prevent
unbounded node disk usage.

Comment on lines +63 to +70
- name: {{ include "clp.volumeName" (dict
"component_category" "redis"
"name" "data"
) | quote }}
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: "local-storage"
selector:
matchLabels:
{{- include "clp.selectorLabels" . | nindent 12 }}
app.kubernetes.io/component: "redis"
resources:
requests:
storage: "10Gi"
- metadata:
name: {{ include "clp.volumeName" (dict
"component_category" "redis"
"name" "logs"
) | quote }}
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: "local-storage"
selector:
matchLabels:
{{- include "clp.selectorLabels" . | nindent 12 }}
app.kubernetes.io/component: "redis"
resources:
requests:
storage: "5Gi"
emptyDir: {}
- name: "config"
configMap:
name: {{ include "clp.fullname" . }}-config

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix YAMLlint brace-spacing errors in the templated volume block.

Static analysis flagged braces on this block. Either adjust brace spacing or exempt Helm templates from that rule to keep linting green.

🔧 One possible brace-spacing adjustment
-        - name: {{ include "clp.volumeName" (dict
+        - name: {{include "clp.volumeName" (dict
             "component_category" "redis"
             "name" "data"
-          ) | quote }}
+          ) | quote}}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- name: {{ include "clp.volumeName" (dict
"component_category" "redis"
"name" "data"
) | quote }}
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: "local-storage"
selector:
matchLabels:
{{- include "clp.selectorLabels" . | nindent 12 }}
app.kubernetes.io/component: "redis"
resources:
requests:
storage: "10Gi"
- metadata:
name: {{ include "clp.volumeName" (dict
"component_category" "redis"
"name" "logs"
) | quote }}
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: "local-storage"
selector:
matchLabels:
{{- include "clp.selectorLabels" . | nindent 12 }}
app.kubernetes.io/component: "redis"
resources:
requests:
storage: "5Gi"
emptyDir: {}
- name: "config"
configMap:
name: {{ include "clp.fullname" . }}-config
- name: {{include "clp.volumeName" (dict
"component_category" "redis"
"name" "data"
) | quote}}
emptyDir: {}
- name: "config"
configMap:
name: {{ include "clp.fullname" . }}-config
🧰 Tools
🪛 YAMLlint (1.38.0)

[error] 63-63: too many spaces inside braces

(braces)


[error] 66-66: too many spaces inside braces

(braces)


[error] 70-70: too many spaces inside braces

(braces)


[error] 70-70: too many spaces inside braces

(braces)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/deployment/package-helm/templates/redis-statefulset.yaml` around lines
63 - 70, The templated volume block triggers YAMLlint brace-spacing errors; fix
by normalizing Helm template whitespace inside the mustaches for the volume name
and configMap name: collapse the multiline dict into a single expression and
ensure single spaces immediately after '{{' and before '}}' for the include
calls (references: include "clp.volumeName" and include "clp.fullname"), so the
volume name and configMap name lines render with properly spaced braces and no
extra line breaks.

Comment on lines 65 to +68
accessModes: ["ReadWriteOnce"]
storageClassName: "local-storage"
selector:
matchLabels:
{{- include "clp.selectorLabels" . | nindent 12 }}
app.kubernetes.io/component: "results-cache"
resources:
requests:
storage: "20Gi"
- metadata:
name: {{ include "clp.volumeName" (dict
"component_category" "results-cache"
"name" "logs"
) | quote }}
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: "local-storage"
selector:
matchLabels:
{{- include "clp.selectorLabels" . | nindent 12 }}
app.kubernetes.io/component: "results-cache"
resources:
requests:
storage: "5Gi"
storage: "10Gi"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Prevent PVC shrink on upgrade (K8s forbids size reduction).
Reducing the claim from 20Gi to 10Gi will be rejected on clusters with existing 20Gi PVCs, causing Helm upgrades to fail. Please keep the previous size, preserve the existing PVC size (e.g., via values or lookup), or document a required manual migration path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/deployment/package-helm/templates/results-cache-statefulset.yaml`
around lines 65 - 68, The templates currently set resources.requests.storage to
a smaller hard-coded value which will cause Helm upgrades to fail when an
existing PVC is larger; change the template so volumeClaimTemplates (or the
field where resources.requests.storage is defined) uses a configurable value
(e.g., .Values.persistence.size) and fall back to the existing PVC size via the
Helm lookup function if present (use lookup to read the current
PersistentVolumeClaim and use its spec.resources.requests.storage), or keep the
previous default size (e.g., 20Gi) to avoid shrinking; update the template to
reference that value instead of the hard-coded "10Gi".

@junhaoliao junhaoliao marked this pull request as draft February 24, 2026 14:37

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/clp-rust-utils/src/logging.rs`:
- Around line 13-14: The branch that handles CLP_LOGS_DIR currently treats an
empty string as a valid path, causing the appender to be rooted at "."; change
the check that reads std::env::var("CLP_LOGS_DIR") so it only proceeds when the
value is non-empty (e.g., use if let Ok(dir) = std::env::var("CLP_LOGS_DIR") and
guard with !dir.trim().is_empty()), then convert that non-empty string to a Path
with std::path::Path::new(dir.as_str()) and continue creating the file appender;
ensure you reference the CLP_LOGS_DIR env var and the local logs_directory
binding in logging.rs when implementing this guard.
- Around line 18-47: The duplicated tracing_subscriber setup should be extracted
into a shared builder/format variable: create the common event_format (the value
from
tracing_subscriber::fmt::format().with_level(true).with_target(false).with_file(true).with_line_number(true).json())
and the common EnvFilter/ANSI settings once, then call .with_writer(...) with
the differing writers (std::io::stdout.and(non_blocking_writer) in the branch
that returns Some(guard) and std::io::stdout in the other) and call .init() for
each; because Format is owned, clone the shared format/builder (use .clone())
when reusing it for the second call so ownership issues are avoided while
keeping the rest of the configuration identical.

In `@tools/deployment/package-helm/values.yaml`:
- Around line 17-18: Wrap the long comment that mentions filesystem storage and
PVC-backed directories so no line exceeds 100 characters; specifically split the
comment line that references archive_output.storage.type: "fs" and the sentence
about PVC-backed directories and examples (e.g., NFS/CephFS) into two or more
shorter comment lines so each is under the lint limit while preserving the
original wording and examples.

ℹ️ Review info

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d6e3362 and 8928fdf.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (8)
  • components/api-server/Cargo.toml
  • components/api-server/src/bin/api_server.rs
  • components/clp-rust-utils/Cargo.toml
  • components/clp-rust-utils/src/lib.rs
  • components/clp-rust-utils/src/logging.rs
  • components/log-ingestor/Cargo.toml
  • components/log-ingestor/src/bin/log_ingestor.rs
  • tools/deployment/package-helm/values.yaml
💤 Files with no reviewable changes (2)
  • components/api-server/Cargo.toml
  • components/log-ingestor/Cargo.toml

Comment on lines +13 to +14
if let Ok(logs_directory) = std::env::var("CLP_LOGS_DIR") {
let logs_directory = std::path::Path::new(logs_directory.as_str());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Empty CLP_LOGS_DIR will create a file appender rooted at the current working directory.

std::env::var returns Ok("") when the variable is set but empty, so this branch will attempt to write log files into ".". Consider checking for a non-empty value:

🛡️ Proposed fix
-    if let Ok(logs_directory) = std::env::var("CLP_LOGS_DIR") {
+    if let Ok(logs_directory) = std::env::var("CLP_LOGS_DIR").and_then(|v| {
+        if v.is_empty() {
+            Err(std::env::VarError::NotPresent)
+        } else {
+            Ok(v)
+        }
+    }) {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if let Ok(logs_directory) = std::env::var("CLP_LOGS_DIR") {
let logs_directory = std::path::Path::new(logs_directory.as_str());
if let Ok(logs_directory) = std::env::var("CLP_LOGS_DIR").and_then(|v| {
if v.is_empty() {
Err(std::env::VarError::NotPresent)
} else {
Ok(v)
}
}) {
let logs_directory = std::path::Path::new(logs_directory.as_str());
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/clp-rust-utils/src/logging.rs` around lines 13 - 14, The branch
that handles CLP_LOGS_DIR currently treats an empty string as a valid path,
causing the appender to be rooted at "."; change the check that reads
std::env::var("CLP_LOGS_DIR") so it only proceeds when the value is non-empty
(e.g., use if let Ok(dir) = std::env::var("CLP_LOGS_DIR") and guard with
!dir.trim().is_empty()), then convert that non-empty string to a Path with
std::path::Path::new(dir.as_str()) and continue creating the file appender;
ensure you reference the CLP_LOGS_DIR env var and the local logs_directory
binding in logging.rs when implementing this guard.

Comment on lines +18 to +47
tracing_subscriber::fmt()
.event_format(
tracing_subscriber::fmt::format()
.with_level(true)
.with_target(false)
.with_file(true)
.with_line_number(true)
.json(),
)
.with_env_filter(tracing_subscriber::EnvFilter::from_default_env())
.with_ansi(false)
.with_writer(std::io::stdout.and(non_blocking_writer))
.init();
return Some(guard);
}

tracing_subscriber::fmt()
.event_format(
tracing_subscriber::fmt::format()
.with_level(true)
.with_target(false)
.with_file(true)
.with_line_number(true)
.json(),
)
.with_env_filter(tracing_subscriber::EnvFilter::from_default_env())
.with_ansi(false)
.with_writer(std::io::stdout)
.init();
None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

🧩 Analysis chain

🏁 Script executed:

find . -name "Cargo.toml" -path "*/clp-rust-utils/*" | head -5

Repository: y-scope/clp

Length of output: 93


🏁 Script executed:

cat components/clp-rust-utils/Cargo.toml | grep -A 5 "tracing"

Repository: y-scope/clp

Length of output: 262


🏁 Script executed:

cat -n components/clp-rust-utils/src/logging.rs | head -60

Repository: y-scope/clp

Length of output: 2293


🌐 Web query:

tracing_subscriber 0.3.22 Format type Clone implementation

💡 Result:

In tracing-subscriber 0.3.22, fmt::format::Format implements Clone conditionally:

impl<F: Clone, T: Clone> Clone for Format<F, T>

So Format<F, T> is Clone iff both the format style type F (e.g. Full, Compact, Pretty, Json) and the timer type T implement Clone. [1]

Source: the rustdoc “Trait Implementations” section for Format. [1]

[1] https://docs.rs/tracing-subscriber/latest/tracing_subscriber/fmt/format/struct.Format.html


🌐 Web query:

tracing_subscriber fmt format() builder Clone trait

💡 Result:

In tracing-subscriber (e.g., 0.3.22), tracing_subscriber::fmt::SubscriberBuilder does not implement Clone (its documented trait impls are Debug, Default, and From, but not Clone). [1]

What is Clone is the formatter returned by tracing_subscriber::fmt::format() (fmt::format::Format implements Clone when its inner types do). [2]

So the usual pattern is: clone the Format (and any other clonable config you own), and create a fresh fmt()/Subscriber::builder() each time, rather than trying to clone the SubscriberBuilder. [1][2]

Sources:
[1] SubscriberBuilder trait impls (no Clone) — docs.rs tracing_subscriber::fmt::SubscriberBuilder
[2] Format trait impls (Clone) — docs.rs tracing_subscriber::fmt::format::Format


Extract duplicated subscriber configuration to reduce duplication.

The event format, env filter, and ANSI settings are identical between both branches—only the writer differs. The Format type from tracing_subscriber implements Clone, so the shared configuration can be extracted. Note that when reusing the format variable, you'll need to .clone() it since the builder methods take ownership:

♻️ Proposed refactor
 pub fn set_up_logging(log_filename: &str) -> Option<WorkerGuard> {
+    let event_format = tracing_subscriber::fmt::format()
+        .with_level(true)
+        .with_target(false)
+        .with_file(true)
+        .with_line_number(true)
+        .json();
+
     if let Ok(logs_directory) = std::env::var("CLP_LOGS_DIR") {
         let logs_directory = std::path::Path::new(logs_directory.as_str());
         let file_appender =
             RollingFileAppender::new(Rotation::HOURLY, logs_directory, log_filename);
         let (non_blocking_writer, guard) = tracing_appender::non_blocking(file_appender);
         tracing_subscriber::fmt()
-            .event_format(
-                tracing_subscriber::fmt::format()
-                    .with_level(true)
-                    .with_target(false)
-                    .with_file(true)
-                    .with_line_number(true)
-                    .json(),
-            )
+            .event_format(event_format.clone())
             .with_env_filter(tracing_subscriber::EnvFilter::from_default_env())
             .with_ansi(false)
             .with_writer(std::io::stdout.and(non_blocking_writer))
             .init();
         return Some(guard);
     }
 
     tracing_subscriber::fmt()
-        .event_format(
-            tracing_subscriber::fmt::format()
-                .with_level(true)
-                .with_target(false)
-                .with_file(true)
-                .with_line_number(true)
-                .json(),
-        )
+        .event_format(event_format.clone())
         .with_env_filter(tracing_subscriber::EnvFilter::from_default_env())
         .with_ansi(false)
         .with_writer(std::io::stdout)
         .init();
     None
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/clp-rust-utils/src/logging.rs` around lines 18 - 47, The
duplicated tracing_subscriber setup should be extracted into a shared
builder/format variable: create the common event_format (the value from
tracing_subscriber::fmt::format().with_level(true).with_target(false).with_file(true).with_line_number(true).json())
and the common EnvFilter/ANSI settings once, then call .with_writer(...) with
the differing writers (std::io::stdout.and(non_blocking_writer) in the branch
that returns Some(guard) and std::io::stdout in the other) and call .init() for
each; because Format is owned, clone the shared format/builder (use .clone())
when reusing it for the second call so ownership issues are avoided while
keeping the rest of the configuration identical.

Comment thread tools/deployment/package-helm/values.yaml Outdated
@junhaoliao junhaoliao marked this pull request as ready for review February 24, 2026 16:01

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (4)
components/job-orchestration/job_orchestration/reducer/reducer.py (1)

65-97: ⚠️ Potential issue | 🟡 Minor

Opened reducer log files are leaked if an exception occurs mid-loop.

If open() (line 72) or subprocess.Popen (line 78) raises on iteration N, all file handles opened during iterations 0…N-1 (and on the current iteration for open()) are already in reducer_log_files but the cleanup loop at lines 96-97 is never reached. Wrap the spawn loop and the communicate() block in a try/finally to guarantee cleanup.

🛡️ Proposed fix
     reducers = []
     reducer_log_files: list[TextIO] = []
     concurrency = max(int(parsed_args.concurrency), 1)
-    for i in range(concurrency):
-        reducer_instance_cmd = reducer_cmd + [str(clp_config.reducer.base_port + i)]
-
-        reducer_stdout = sys.stdout
-        reducer_stderr = sys.stderr
-        if logs_dir is not None:
-            log_file_path = logs_dir / f"reducer-{i}.log"
-            log_file = open(log_file_path, "a")
-            reducer_log_files.append(log_file)
-            reducer_stdout = log_file
-            reducer_stderr = log_file
-
-        reducers.append(
-            subprocess.Popen(
-                reducer_instance_cmd,
-                close_fds=True,
-                stdout=reducer_stdout,
-                stderr=reducer_stderr,
-            )
-        )
-
-    logger.info("Reducers started.")
-    logger.info(
-        f"Host={clp_config.reducer.host}"
-        f" Base port={clp_config.reducer.base_port}"
-        f" Concurrency={concurrency}"
-        f" Upsert Interval={parsed_args.upsert_interval}"
-    )
-    for i, reducer in enumerate(reducers):
-        reducer.communicate()
-        logger.info(f"reducer-{i} exited with returncode={reducer.returncode}")
-    for reducer_log_file in reducer_log_files:
-        reducer_log_file.close()
-
-    logger.error("All reducers terminated")
+    try:
+        for i in range(concurrency):
+            reducer_instance_cmd = [*reducer_cmd, str(clp_config.reducer.base_port + i)]
+
+            reducer_stdout = sys.stdout
+            reducer_stderr = sys.stderr
+            if logs_dir is not None:
+                log_file_path = logs_dir / f"reducer-{i}.log"
+                log_file = open(log_file_path, "a")
+                reducer_log_files.append(log_file)
+                reducer_stdout = log_file
+                reducer_stderr = log_file
+
+            reducers.append(
+                subprocess.Popen(
+                    reducer_instance_cmd,
+                    close_fds=True,
+                    stdout=reducer_stdout,
+                    stderr=reducer_stderr,
+                )
+            )
+
+        logger.info("Reducers started.")
+        logger.info(
+            f"Host={clp_config.reducer.host}"
+            f" Base port={clp_config.reducer.base_port}"
+            f" Concurrency={concurrency}"
+            f" Upsert Interval={parsed_args.upsert_interval}"
+        )
+        for i, reducer in enumerate(reducers):
+            reducer.communicate()
+            logger.info(f"reducer-{i} exited with returncode={reducer.returncode}")
+
+        logger.error("All reducers terminated")
+    finally:
+        for reducer_log_file in reducer_log_files:
+            reducer_log_file.close()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/job-orchestration/job_orchestration/reducer/reducer.py` around
lines 65 - 97, The reducer log file handles are leaked if open() or
subprocess.Popen raises during the spawn loop; wrap the whole spawn-and-wait
section (the loop that builds reducer_instance_cmd, opens files into
reducer_log_files, appends subprocess.Popen into reducers, and the subsequent
for i, reducer in enumerate(reducers): reducer.communicate() block) in a
try/finally so cleanup always runs; in the finally iterate over reducers to
gracefully stop/communicate/kill any started subprocesses (e.g., call
terminate() and then communicate() or kill() if needed) and then iterate over
reducer_log_files to close every open file handle, ensuring both
reducer_log_files and reducers are referenced in the finally block so no file or
process remains leaked.
tools/deployment/package-helm/templates/query-worker-deployment.yaml (1)

106-110: Consider setting a sizeLimit on the emptyDir for staged streams.

The emptyDir: {} for staged-streams has no size limit, which could lead to unbounded node disk usage when staging large volumes of stream data for S3 upload.

          emptyDir:
            sizeLimit: "5Gi"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/deployment/package-helm/templates/query-worker-deployment.yaml` around
lines 106 - 110, The emptyDir for the staged-streams volume currently has no
size limit which can exhaust node disk; update the deployment template where the
volume name is generated via include "clp.volumeName" with
dict("component_category":"query-worker","name":"staged-streams") to replace the
bare emptyDir: {} with an emptyDir that specifies a sizeLimit (e.g., "5Gi") or
make it configurable via a Helm value, so the staged-streams emptyDir has an
explicit size cap.
tools/deployment/package-helm/templates/compression-worker-deployment.yaml (1)

41-44: Same CLP_LOGS_DIR without backing volume issue as the query-worker.

As noted in the query-worker review, CLP_LOGS_DIR is set to /var/log/compression_worker but the corresponding volume has been removed. Any file writes to this path will go to the container overlay. The same remediation applies: remove the env var or add an emptyDir mount.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/deployment/package-helm/templates/compression-worker-deployment.yaml`
around lines 41 - 44, The CLP_LOGS_DIR environment variable in the compression
worker deployment points to /var/log/compression_worker but there is no backing
volume; either remove the CLP_LOGS_DIR env entry or add an emptyDir volume and
mount it into the compression worker container at /var/log/compression_worker.
To fix, edit the compression-worker container spec: if keeping CLP_LOGS_DIR, add
a volumeMount with mountPath "/var/log/compression_worker" and a matching
volumes entry (type emptyDir) in the pod spec; otherwise delete the env var
CLP_LOGS_DIR (and any related references) so logs default to CLP_WORKER_LOG_PATH
(/dev/stdout). Ensure the changes target the compression-worker container
definition and the CLP_LOGS_DIR env name.
components/clp-py-utils/clp_py_utils/clp_logging.py (1)

43-63: ⚠️ Potential issue | 🟠 Major

Guard FileHandler setup to avoid startup failure when CLP_LOGS_DIR is bad.

If CLP_LOGS_DIR points to a missing or unwritable path, logging.FileHandler will raise and abort startup. Consider a safe fallback to stdout (Line 57 onward).

🛡️ Suggested hardening
     logs_dir = os.getenv("CLP_LOGS_DIR")
     if logs_dir:
-        log_file = pathlib.Path(logs_dir) / f"{component_name}.log"
-        logging_file_handler = logging.FileHandler(filename=log_file, encoding="utf-8")
-        logging_file_handler.setFormatter(get_logging_formatter())
-        logger.addHandler(logging_file_handler)
+        try:
+            log_file = pathlib.Path(logs_dir) / f"{component_name}.log"
+            logging_file_handler = logging.FileHandler(filename=log_file, encoding="utf-8")
+            logging_file_handler.setFormatter(get_logging_formatter())
+            logger.addHandler(logging_file_handler)
+        except OSError:
+            logger.exception(
+                "Failed to set up file logging for %s; continuing with stdout.", component_name
+            )
     set_logging_level(logger, os.getenv("CLP_LOGGING_LEVEL"))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/clp-py-utils/clp_py_utils/clp_logging.py` around lines 43 - 63,
The FileHandler creation in configure_logging can raise if CLP_LOGS_DIR is
missing or unwritable; wrap the Path/ FileHandler setup in a try/except and
harden it by ensuring the directory exists (use
pathlib.Path(...).parent.mkdir(parents=True, exist_ok=True)) before creating
logging.FileHandler, and on exception fall back to adding a std stream handler
(logging.StreamHandler(sys.stdout)) and emit a warning via logger.warning so
startup doesn't abort; keep calls to get_logging_formatter() and
set_logging_level(logger, os.getenv("CLP_LOGGING_LEVEL")) intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/clp-py-utils/clp_py_utils/clp_logging.py`:
- Around line 31-37: The warning call in set_logging_level eagerly formats the
message; change logger.warning(f"Invalid logging level: {level}, using INFO as
default") to use lazy formatting (logger.warning("Invalid logging level: %s,
using INFO as default", level)) so the string interpolation is deferred when the
warning level is disabled; keep the existing behavior of falling back to
logger.setLevel(logging.INFO) and the checks against get_args(LoggingLevel).

In `@components/job-orchestration/job_orchestration/reducer/reducer.py`:
- Line 66: Replace the list concatenation that builds reducer_instance_cmd with
iterable unpacking to follow RUF005: instead of reducer_instance_cmd =
reducer_cmd + [str(clp_config.reducer.base_port + i)], construct
reducer_instance_cmd using unpacking of reducer_cmd and the new string element
(referencing reducer_instance_cmd, reducer_cmd, clp_config.reducer.base_port,
and i) so the resulting list is created with [*reducer_cmd,
str(clp_config.reducer.base_port + i)] style.

In `@tools/deployment/package-helm/templates/compression-worker-deployment.yaml`:
- Around line 87-91: Add a sizeLimit to the emptyDir volumes for the tmp and
staged-archives volumes to prevent unbounded disk usage; locate the volume
blocks where name is generated with {{ include "clp.volumeName" ... "tmp" }} and
the staged-archives entry and add a sizeLimit (e.g. "1Gi" or another value tuned
to expected workload) under emptyDir: so the emptyDir declaration uses sizeLimit
to cap disk usage for compression-worker intermediate data and S3 staging.

In `@tools/deployment/package-helm/templates/query-worker-deployment.yaml`:
- Around line 41-44: CLP_LOGS_DIR is set to /var/log/query_worker but there is
no corresponding volumeMount/volume, so container writes go to the ephemeral
overlay; either remove CLP_LOGS_DIR if file logs aren't needed or add a volume
and mount it at /var/log/query_worker (e.g., an emptyDir or a PVC) in the
Deployment spec for the container that defines CLP_LOGS_DIR and
CLP_WORKER_LOG_PATH; ensure the volume name used in volumeMounts matches the
volume entry and replicate the same change for the other worker deployment that
also sets CLP_LOGS_DIR.

---

Duplicate comments:
In `@components/clp-py-utils/clp_py_utils/clp_logging.py`:
- Around line 43-63: The FileHandler creation in configure_logging can raise if
CLP_LOGS_DIR is missing or unwritable; wrap the Path/ FileHandler setup in a
try/except and harden it by ensuring the directory exists (use
pathlib.Path(...).parent.mkdir(parents=True, exist_ok=True)) before creating
logging.FileHandler, and on exception fall back to adding a std stream handler
(logging.StreamHandler(sys.stdout)) and emit a warning via logger.warning so
startup doesn't abort; keep calls to get_logging_formatter() and
set_logging_level(logger, os.getenv("CLP_LOGGING_LEVEL")) intact.

In `@components/job-orchestration/job_orchestration/reducer/reducer.py`:
- Around line 65-97: The reducer log file handles are leaked if open() or
subprocess.Popen raises during the spawn loop; wrap the whole spawn-and-wait
section (the loop that builds reducer_instance_cmd, opens files into
reducer_log_files, appends subprocess.Popen into reducers, and the subsequent
for i, reducer in enumerate(reducers): reducer.communicate() block) in a
try/finally so cleanup always runs; in the finally iterate over reducers to
gracefully stop/communicate/kill any started subprocesses (e.g., call
terminate() and then communicate() or kill() if needed) and then iterate over
reducer_log_files to close every open file handle, ensuring both
reducer_log_files and reducers are referenced in the finally block so no file or
process remains leaked.

In `@tools/deployment/package-helm/templates/compression-worker-deployment.yaml`:
- Around line 41-44: The CLP_LOGS_DIR environment variable in the compression
worker deployment points to /var/log/compression_worker but there is no backing
volume; either remove the CLP_LOGS_DIR env entry or add an emptyDir volume and
mount it into the compression worker container at /var/log/compression_worker.
To fix, edit the compression-worker container spec: if keeping CLP_LOGS_DIR, add
a volumeMount with mountPath "/var/log/compression_worker" and a matching
volumes entry (type emptyDir) in the pod spec; otherwise delete the env var
CLP_LOGS_DIR (and any related references) so logs default to CLP_WORKER_LOG_PATH
(/dev/stdout). Ensure the changes target the compression-worker container
definition and the CLP_LOGS_DIR env name.

In `@tools/deployment/package-helm/templates/query-worker-deployment.yaml`:
- Around line 106-110: The emptyDir for the staged-streams volume currently has
no size limit which can exhaust node disk; update the deployment template where
the volume name is generated via include "clp.volumeName" with
dict("component_category":"query-worker","name":"staged-streams") to replace the
bare emptyDir: {} with an emptyDir that specifies a sizeLimit (e.g., "5Gi") or
make it configurable via a Helm value, so the staged-streams emptyDir has an
explicit size cap.

ℹ️ Review info

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8928fdf and e0e0c6f.

📒 Files selected for processing (12)
  • components/clp-mcp-server/clp_mcp_server/clp_mcp_server.py
  • components/clp-py-utils/clp_py_utils/clp_logging.py
  • components/job-orchestration/job_orchestration/garbage_collector/archive_garbage_collector.py
  • components/job-orchestration/job_orchestration/garbage_collector/garbage_collector.py
  • components/job-orchestration/job_orchestration/garbage_collector/search_result_garbage_collector.py
  • components/job-orchestration/job_orchestration/garbage_collector/utils.py
  • components/job-orchestration/job_orchestration/reducer/reducer.py
  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
  • components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py
  • tools/deployment/package-helm/templates/compression-worker-deployment.yaml
  • tools/deployment/package-helm/templates/query-worker-deployment.yaml
  • tools/deployment/package-helm/values.yaml
💤 Files with no reviewable changes (1)
  • components/job-orchestration/job_orchestration/garbage_collector/utils.py

Comment on lines +31 to 37
def set_logging_level(logger: logging.Logger, level: str | None):
if level is None:
logger.setLevel(logging.INFO)
return
if level not in get_args(LoggingLevel):
logger.warning(f"Invalid logging level: {level}, using INFO as default")
logger.setLevel(logging.INFO)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cat -n components/clp-py-utils/clp_py_utils/clp_logging.py | sed -n '25,42p'

Repository: y-scope/clp

Length of output: 671


🏁 Script executed:

rg -n "logger\.(debug|info|warning|error|exception)" components/clp-py-utils/clp_py_utils/clp_logging.py

Repository: y-scope/clp

Length of output: 138


Use lazy formatting for the warning log.

This avoids eager string interpolation when the warning level is disabled. Line 36 requires adjustment to comply with codebase logging standards.

♻️ Suggested change
-        logger.warning(f"Invalid logging level: {level}, using INFO as default")
+        logger.warning("Invalid logging level: %s, using INFO as default", level)
🧰 Tools
🪛 Ruff (0.15.2)

[warning] 36-36: Logging statement uses f-string

(G004)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/clp-py-utils/clp_py_utils/clp_logging.py` around lines 31 - 37,
The warning call in set_logging_level eagerly formats the message; change
logger.warning(f"Invalid logging level: {level}, using INFO as default") to use
lazy formatting (logger.warning("Invalid logging level: %s, using INFO as
default", level)) so the string interpolation is deferred when the warning level
is disabled; keep the existing behavior of falling back to
logger.setLevel(logging.INFO) and the checks against get_args(LoggingLevel).

reducer_log_files: list[TextIO] = []
concurrency = max(int(parsed_args.concurrency), 1)
for i in range(concurrency):
reducer_instance_cmd = reducer_cmd + [str(clp_config.reducer.base_port + i)]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Prefer iterable unpacking over list concatenation (RUF005).

♻️ Proposed refactor
-        reducer_instance_cmd = reducer_cmd + [str(clp_config.reducer.base_port + i)]
+        reducer_instance_cmd = [*reducer_cmd, str(clp_config.reducer.base_port + i)]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
reducer_instance_cmd = reducer_cmd + [str(clp_config.reducer.base_port + i)]
reducer_instance_cmd = [*reducer_cmd, str(clp_config.reducer.base_port + i)]
🧰 Tools
🪛 Ruff (0.15.2)

[warning] 66-66: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/job-orchestration/job_orchestration/reducer/reducer.py` at line
66, Replace the list concatenation that builds reducer_instance_cmd with
iterable unpacking to follow RUF005: instead of reducer_instance_cmd =
reducer_cmd + [str(clp_config.reducer.base_port + i)], construct
reducer_instance_cmd using unpacking of reducer_cmd and the new string element
(referencing reducer_instance_cmd, reducer_cmd, clp_config.reducer.base_port,
and i) so the resulting list is created with [*reducer_cmd,
str(clp_config.reducer.base_port + i)] style.

Comment on lines +87 to +91
- name: {{ include "clp.volumeName" (dict
"component_category" "compression-worker"
"name" "tmp"
) | nindent 10 }}
) | quote }}
emptyDir: {}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider adding sizeLimit to emptyDir volumes for tmp and staged-archives.

Both the tmp volume (line 91) and the staged-archives volume (line 106) use emptyDir: {} without a sizeLimit. Compression workers can generate significant intermediate data — especially when staging archives for S3 upload or writing to /var/tmp during compression tasks. Without a limit, a misbehaving or high-throughput workload could exhaust node disk space and affect co-located pods.

🛡️ Suggested improvement
           ) | quote }}
-          emptyDir: {}
+          emptyDir:
+            sizeLimit: "10Gi"

Apply similar limits (tuned to expected workloads) to both emptyDir declarations.

Also applies to: 102-106

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/deployment/package-helm/templates/compression-worker-deployment.yaml`
around lines 87 - 91, Add a sizeLimit to the emptyDir volumes for the tmp and
staged-archives volumes to prevent unbounded disk usage; locate the volume
blocks where name is generated with {{ include "clp.volumeName" ... "tmp" }} and
the staged-archives entry and add a sizeLimit (e.g. "1Gi" or another value tuned
to expected workload) under emptyDir: so the emptyDir declaration uses sizeLimit
to cap disk usage for compression-worker intermediate data and S3 staging.

Comment on lines 41 to +44
- name: "CLP_LOGS_DIR"
value: "/var/log/query_worker"
- name: "CLP_WORKER_LOG_PATH"
value: "/var/log/query_worker/worker.log"
value: "/dev/stdout"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check references to CLP_LOGS_DIR in Python and shell code
echo "=== CLP_LOGS_DIR references (Python) ==="
rg 'CLP_LOGS_DIR' --type py -C 2

echo ""
echo "=== CLP_LOGS_DIR references (Shell) ==="
rg 'CLP_LOGS_DIR' --type sh -C 2

echo ""
echo "=== CLP_LOGS_DIR references (YAML) ==="
rg 'CLP_LOGS_DIR' --type yaml -C 2

Repository: y-scope/clp

Length of output: 11195


🏁 Script executed:

#!/bin/bash
# Examine the query-worker-deployment.yaml file to see volumes and volume mounts
echo "=== Checking query-worker-deployment.yaml structure ==="
if [ -f "tools/deployment/package-helm/templates/query-worker-deployment.yaml" ]; then
  echo "File exists. Total lines:"
  wc -l "tools/deployment/package-helm/templates/query-worker-deployment.yaml"
  echo ""
  echo "=== Environment variables section (lines 40-50) ==="
  sed -n '40,50p' "tools/deployment/package-helm/templates/query-worker-deployment.yaml"
  echo ""
  echo "=== Volumes section (lines 95-120) ==="
  sed -n '95,120p' "tools/deployment/package-helm/templates/query-worker-deployment.yaml"
  echo ""
  echo "=== Volume mounts section (lines 50-75) ==="
  sed -n '50,75p' "tools/deployment/package-helm/templates/query-worker-deployment.yaml"
else
  echo "File not found"
fi

Repository: y-scope/clp

Length of output: 2659


🏁 Script executed:

#!/bin/bash
# Check compression-worker-deployment.yaml for same pattern
echo "=== Checking compression-worker-deployment.yaml ==="
if [ -f "tools/deployment/package-helm/templates/compression-worker-deployment.yaml" ]; then
  echo "File exists. Checking for CLP_LOGS_DIR:"
  rg -n 'CLP_LOGS_DIR' "tools/deployment/package-helm/templates/compression-worker-deployment.yaml" -C 2
else
  echo "File not found"
fi

Repository: y-scope/clp

Length of output: 415


CLP_LOGS_DIR is set but has no backing volume mount in the Helm deployment.

CLP_LOGS_DIR is configured to /var/log/query_worker (line 42), but there is no corresponding volumeMount or volume for this path. The worker task code (extract_stream_task.py, fs_search_task.py, compression_task.py) actively reads and uses this directory to write log files. Without a volume mount, these writes land in the container's writable overlay layer — consuming node disk without limits and lost on container restart.

In docker-compose, this path is correctly backed by the clp-logs volume, but it is missing in the Helm deployment.

Either:

  1. Remove the CLP_LOGS_DIR env var if file-based logging is not needed, or
  2. Add an emptyDir or persistent volume mount at /var/log/query_worker.

The same pattern exists in compression-worker-deployment.yaml (lines 41–42).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/deployment/package-helm/templates/query-worker-deployment.yaml` around
lines 41 - 44, CLP_LOGS_DIR is set to /var/log/query_worker but there is no
corresponding volumeMount/volume, so container writes go to the ephemeral
overlay; either remove CLP_LOGS_DIR if file logs aren't needed or add a volume
and mount it at /var/log/query_worker (e.g., an emptyDir or a PVC) in the
Deployment spec for the container that defines CLP_LOGS_DIR and
CLP_WORKER_LOG_PATH; ensure the volume name used in volumeMounts matches the
volume entry and replicate the same change for the other worker deployment that
also sets CLP_LOGS_DIR.

# Conflicts:
#	tools/deployment/package-helm/templates/database-data-pv.yaml
#	tools/deployment/package-helm/templates/database-logs-pv.yaml
#	tools/deployment/package-helm/templates/database-statefulset.yaml
#	tools/deployment/package-helm/templates/db-table-creator-job.yaml
#	tools/deployment/package-helm/templates/queue-logs-pv.yaml
#	tools/deployment/package-helm/templates/queue-statefulset.yaml
#	tools/deployment/package-helm/templates/redis-data-pv.yaml
#	tools/deployment/package-helm/templates/redis-logs-pv.yaml
#	tools/deployment/package-helm/templates/redis-statefulset.yaml
#	tools/deployment/package-helm/templates/results-cache-data-pv.yaml
#	tools/deployment/package-helm/templates/results-cache-indices-creator-job.yaml
#	tools/deployment/package-helm/templates/results-cache-logs-pv.yaml
#	tools/deployment/package-helm/templates/results-cache-statefulset.yaml

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tools/deployment/package-helm/Chart.yaml`:
- Line 3: The Chart.yaml version was only incremented in the pre-release counter
("version: \"0.1.4-dev.4\"") despite this PR being a breaking change (removes
the local-storage StorageClass and many PV/PVC templates and values), so update
the Chart.yaml version to reflect semver-breaking changes—bump at least the
minor segment (e.g., "0.2.0-dev.0") or the major if your release policy requires
it (e.g., "1.0.0-dev.0") so consumers are correctly signaled; ensure the new
version string replaces the existing "0.1.4-dev.4" value in Chart.yaml.
- Line 3: This PR is a breaking change but the Chart.yaml "version" field
remains a pre-release patch; update the Chart.yaml version value (the version
key in Chart.yaml) to bump at least the MINOR for a breaking change—for example
change "version: \"0.1.4-dev.4\"" to "version: \"0.2.0-dev.0\"" (or to your
team's chosen stable-release signaling scheme) so the chart semver reflects the
breaking change.

In
`@tools/deployment/package-helm/templates/compression-scheduler-deployment.yaml`:
- Around line 88-93: The emptyDir for the volume named via include
"clp.volumeName" with component_category "compression-scheduler" and name "log"
is unbounded; update the volume spec (the block that currently shows emptyDir:
{}) to include a sizeLimit (e.g. set emptyDir: { sizeLimit: "<appropriate-size>"
}) to cap disk usage and prevent node disk pressure, choosing a size appropriate
for compression-scheduler logs/artifacts; ensure the change only affects that
specific volume entry and keep the "config" volume unchanged.

ℹ️ Review info

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e0e0c6f and b948b7c.

📒 Files selected for processing (8)
  • docs/src/user-docs/quick-start/clp-json.md
  • docs/src/user-docs/quick-start/clp-text.md
  • tools/deployment/package-helm/Chart.yaml
  • tools/deployment/package-helm/templates/_helpers.tpl
  • tools/deployment/package-helm/templates/compression-scheduler-deployment.yaml
  • tools/deployment/package-helm/templates/configmap.yaml
  • tools/deployment/package-helm/templates/query-scheduler-deployment.yaml
  • tools/deployment/package-helm/values.yaml
💤 Files with no reviewable changes (3)
  • docs/src/user-docs/quick-start/clp-text.md
  • docs/src/user-docs/quick-start/clp-json.md
  • tools/deployment/package-helm/templates/query-scheduler-deployment.yaml

Comment thread tools/deployment/package-helm/Chart.yaml Outdated
Comment on lines +88 to 93
- name: {{ include "clp.volumeName" (dict
"component_category" "compression-scheduler"
"name" "logs"
) | nindent 10 }}
- {{- include "clp.pvcVolume" (dict
"root" .
"component_category" "compression-scheduler"
"name" "user-logs"
) | nindent 10 }}
"name" "log"
) | quote }}
emptyDir: {}
- name: "config"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider bounding the emptyDir size for better node-level resilience.

Line 92 currently creates an unbounded scratch volume. Adding a sizeLimit helps avoid disk pressure from unexpected log/artifact growth.

♻️ Proposed hardening
         - name: {{ include "clp.volumeName" (dict
             "component_category" "compression-scheduler"
             "name" "log"
           ) | quote }}
-          emptyDir: {}
+          emptyDir:
+            sizeLimit: "1Gi"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- name: {{ include "clp.volumeName" (dict
"component_category" "compression-scheduler"
"name" "logs"
) | nindent 10 }}
- {{- include "clp.pvcVolume" (dict
"root" .
"component_category" "compression-scheduler"
"name" "user-logs"
) | nindent 10 }}
"name" "log"
) | quote }}
emptyDir: {}
- name: "config"
- name: {{ include "clp.volumeName" (dict
"component_category" "compression-scheduler"
"name" "log"
) | quote }}
emptyDir:
sizeLimit: "1Gi"
- name: "config"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tools/deployment/package-helm/templates/compression-scheduler-deployment.yaml`
around lines 88 - 93, The emptyDir for the volume named via include
"clp.volumeName" with component_category "compression-scheduler" and name "log"
is unbounded; update the volume spec (the block that currently shows emptyDir:
{}) to include a sizeLimit (e.g. set emptyDir: { sizeLimit: "<appropriate-size>"
}) to cap disk usage and prevent node disk pressure, choosing a size appropriate
for compression-scheduler logs/artifacts; ensure the change only affects that
specific volume entry and keep the "config" volume unchanged.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
tools/deployment/package-helm/Chart.yaml (1)

3-3: LGTM — version bump correctly addresses the breaking-change semver concern.

Advancing from 0.1.4-dev.4 to 0.2.0-dev.0 resets the MINOR segment as recommended for a breaking change and resets the pre-release counter, which is the expected convention.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/deployment/package-helm/Chart.yaml` at line 3, The Chart.yaml version
was correctly bumped to "0.2.0-dev.0" to reflect the breaking change; no code
changes are required—confirm the version string in Chart.yaml (the version
field) is exactly "0.2.0-dev.0" and keep the change as-is before merging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@tools/deployment/package-helm/Chart.yaml`:
- Line 3: The Chart.yaml version was correctly bumped to "0.2.0-dev.0" to
reflect the breaking change; no code changes are required—confirm the version
string in Chart.yaml (the version field) is exactly "0.2.0-dev.0" and keep the
change as-is before merging.

ℹ️ Review info

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b948b7c and ac11c5a.

📒 Files selected for processing (1)
  • tools/deployment/package-helm/Chart.yaml

Comment thread components/clp-rust-utils/src/logging.rs
Comment thread components/job-orchestration/job_orchestration/reducer/reducer.py
Comment on lines -52 to -56
- name: {{ include "clp.volumeName" (dict
"component_category" "query-worker"
"name" "logs"
) | quote }}
mountPath: "/var/log/query_worker"

@hoophalab hoophalab Feb 26, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm: this writes all %d-clo.log to a pod's folder, which is not persistent and is only accessible by kubectl exec to that pod?

Comment thread docs/src/user-docs/guides-k8s-deployment.md Outdated
Comment thread tools/deployment/package-helm/templates/compression-worker-logs-pvc.yaml Outdated
junhaoliao and others added 6 commits February 26, 2026 15:55
Co-authored-by: hoophalab <200652805+hoophalab@users.noreply.github.com>
Co-authored-by: hoophalab <200652805+hoophalab@users.noreply.github.com>
The compression-worker and query-worker set CLP_LOGS_DIR to
/var/log/compression_worker and /var/log/query_worker respectively, but
no volume was mounted at those paths, causing FileNotFoundError when
tasks try to write stderr log files. Add emptyDir volumes for both.
The {{- if has "..." .Values.clpConfig.bundled }} guards from PR y-scope#1681
were lost when resolving merge conflicts with main. Restore them for:
- database-statefulset.yaml
- db-table-creator-job.yaml (initContainers)
- queue-statefulset.yaml
- redis-statefulset.yaml
- results-cache-statefulset.yaml
- results-cache-indices-creator-job.yaml (initContainers)

Without these guards, Helm renders bundled third-party resources even
when configured to use external services.
The merge incorrectly hardcoded the MongoDB URI using clp.fullname
instead of the clp.resultsCacheHost/clp.resultsCachePort helpers
introduced in PR y-scope#1681 for external service support.
Replace the detailed directory listing and NFS setup instructions with a
concise note about manually provisioning PVs and updating accessModes
when using filesystem storage.
@junhaoliao junhaoliao requested a review from hoophalab February 26, 2026 14:55
hoophalab
hoophalab previously approved these changes Feb 26, 2026

@hoophalab hoophalab left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Validations:

  1. docker compose logs are the same as before
  2. helm logs are written to stdout
  3. compressing and querying in webui on helm deploys works as before

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants