feat(clp-package)!: Add support for specifying a temporary directory; Remove hack in var/data conditional mounts (resolves #1478).#1480
Conversation
# Conflicts: # components/clp-package-utils/clp_package_utils/scripts/start_clp.py
…onment variable handling
…onfig directory in controller methods
…rt paths in start_clp.py
…eflect private usage
# Conflicts: # docs/src/user-docs/guides-multi-node.md
Co-authored-by: Junhao Liao <junhao.liao@yscope.com>
…Remove hack in `var/data` conditional mounts (resolves y-scope#1478).
WalkthroughIntroduces a new tmp_directory configuration for temporary runtime data, updates config classes and validations, assigns tmp_directory into worker configs and startup initialization, switches compression tasks to use tmp_directory, and updates Docker Compose to bind-mount the host tmp directory (removing the prior archive mount hack). Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant CLPConfig
participant StartScript
participant Controller
participant DockerCompose
participant Worker
participant CompressionTask
rect #f0f8ff
User->>CLPConfig: edit clp-config.yml (tmp_directory)
CLPConfig-->>User: tmp_directory set (var/tmp)
end
rect #f7fff0
StartScript->>CLPConfig: load_config_file()
CLPConfig->>CLPConfig: validate_tmp_dir()
StartScript->>StartScript: create tmp_directory (host)
end
rect #fff7f0
Controller->>CLPConfig: generate_worker_config()
Controller->>DockerCompose: set env CLP_TMP_DIR_HOST=host tmp path
DockerCompose->>Worker: bind-mount /var/tmp <- CLP_TMP_DIR_HOST
end
rect #f0fff7
Worker->>CompressionTask: run_clp()
CompressionTask->>Worker: use tmp_directory for intermediate files
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
# Conflicts: # components/clp-package-utils/clp_package_utils/controller.py # components/clp-package-utils/clp_package_utils/general.py # components/clp-package-utils/clp_package_utils/scripts/start_clp.py # components/clp-py-utils/clp_py_utils/clp_config.py # tools/deployment/package/docker-compose.base.yaml
|
@sitaowang1998 could you help review this before passing it onto @kirkrodrigues ? |
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (7)
components/clp-package-utils/clp_package_utils/controller.py(1 hunks)components/clp-package-utils/clp_package_utils/general.py(2 hunks)components/clp-package-utils/clp_package_utils/scripts/start_clp.py(2 hunks)components/clp-py-utils/clp_py_utils/clp_config.py(6 hunks)components/job-orchestration/job_orchestration/executor/compress/compression_task.py(2 hunks)components/package-template/src/etc/clp-config.yml(1 hunks)tools/deployment/package/docker-compose.base.yaml(1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-10-17T19:59:25.596Z
Learnt from: junhaoliao
PR: y-scope/clp#1178
File: components/clp-package-utils/clp_package_utils/controller.py:315-315
Timestamp: 2025-10-17T19:59:25.596Z
Learning: In components/clp-package-utils/clp_package_utils/controller.py, worker log directories (compression_worker, query_worker, reducer) created via `mkdir()` do not need `_chown_paths_if_root()` calls because directories are created with the same owner as the script caller. This differs from infrastructure service directories (database, queue, Redis, results cache) which do require explicit ownership changes.
Applied to files:
components/clp-package-utils/clp_package_utils/general.py
📚 Learning: 2025-09-28T15:00:22.170Z
Learnt from: LinZhihao-723
PR: y-scope/clp#1340
File: components/job-orchestration/job_orchestration/executor/compress/compression_task.py:528-528
Timestamp: 2025-09-28T15:00:22.170Z
Learning: In components/job-orchestration/job_orchestration/executor/compress/compression_task.py, there is a suggestion to refactor from passing logger as a parameter through multiple functions to creating a ClpCompressor class that takes the logger as a class member, with current helper functions becoming private member functions.
Applied to files:
components/job-orchestration/job_orchestration/executor/compress/compression_task.py
📚 Learning: 2025-10-07T07:54:32.427Z
Learnt from: junhaoliao
PR: y-scope/clp#1178
File: components/clp-py-utils/clp_py_utils/clp_config.py:47-47
Timestamp: 2025-10-07T07:54:32.427Z
Learning: In components/clp-py-utils/clp_py_utils/clp_config.py, the CONTAINER_AWS_CONFIG_DIRECTORY constant is intentionally set to pathlib.Path("/") / ".aws" (i.e., `/.aws`) rather than a user-specific home directory. This hardcoded path is part of the container orchestration design.
Applied to files:
components/clp-py-utils/clp_py_utils/clp_config.py
🧬 Code graph analysis (2)
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (1)
components/clp-py-utils/clp_py_utils/clp_config.py (4)
validate_aws_config_dir(714-742)validate_data_dir(696-700)validate_logs_dir(702-706)validate_tmp_dir(708-712)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
components/clp-py-utils/clp_py_utils/core.py (2)
make_config_path_absolute(42-53)validate_path_could_be_dir(65-72)
🪛 Ruff (0.14.1)
components/clp-py-utils/clp_py_utils/clp_config.py
712-712: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
712-712: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: lint-check (ubuntu-24.04)
- GitHub Check: lint-check (macos-15)
- GitHub Check: antlr-code-committed (ubuntu-24.04)
🔇 Additional comments (14)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py (2)
340-340: LGTM: Correct usage of tmp_directory for temporary artifacts.The change from
data_directorytotmp_directorycorrectly aligns with the PR objective to separate temporary runtime data from persistent data storage.
378-378: LGTM: Appropriate location for temporary log list file.Using
tmp_dirfor the temporary logs list file is correct, as this file is generated per-task and doesn't need to persist beyond the compression operation.components/clp-package-utils/clp_package_utils/controller.py (1)
715-715: LGTM: Environment variable properly exposes tmp_directory to containers.The addition of
CLP_TMP_DIR_HOSTfollows the established pattern for exposing host directories to Docker Compose containers and correctly referencesclp_config.tmp_directory.components/package-template/src/etc/clp-config.yml (1)
133-137: LGTM: Clear and comprehensive documentation for tmp_directory.The documentation properly describes the purpose, creation behavior, and constraints of the
tmp_directoryconfiguration option, consistent with the documentation style fordata_directoryandlogs_directory.components/clp-package-utils/clp_package_utils/scripts/start_clp.py (2)
60-63: LGTM: Validation order updated correctly.The early call to
validate_aws_config_dir()and the addition ofvalidate_tmp_dir()properly extend validation coverage to the new tmp_directory configuration field.
72-72: LGTM: tmp_directory creation added consistently.The creation of
tmp_directoryfollows the same pattern asdata_directoryandlogs_directory, ensuring the temporary directory exists before CLP components start.components/clp-py-utils/clp_py_utils/clp_config.py (5)
60-60: LGTM: Appropriate default for tmp_directory.The default path
var/tmpis a sensible choice for temporary runtime data, following conventions for temporary directory placement.
621-621: LGTM: tmp_directory field added consistently to CLPConfig.The new field follows the same pattern as
data_directoryandlogs_directory, with appropriate type annotation and default value.
646-646: LGTM: Path resolution for tmp_directory added correctly.The
make_config_path_absolutecall fortmp_directoryis correctly placed alongside similar calls fordata_directoryandlogs_directory.
795-795: LGTM: Container path transformation for tmp_directory.The transformation correctly maps
tmp_directoryto/var/tmpin the container, consistent with the default path structure.
813-813: LGTM: WorkerConfig updated to use tmp_directory.The replacement of
data_directorywithtmp_directoryin WorkerConfig correctly reflects the shift to using a dedicated temporary directory for worker operations.tools/deployment/package/docker-compose.base.yaml (1)
265-271: LGTM: Docker Compose mounts simplified as intended.The changes successfully remove the conditional mount hack by:
- Providing direct bind mounts for archive directories (lines 265, 268)
- Adding a dedicated tmp directory mount (lines 270-271) that defaults to
./var/tmpThis aligns with the PR objective to eliminate the
/var/databind-mount indirection and resolves issue #1478.components/clp-package-utils/clp_package_utils/general.py (2)
328-328: LGTM: Worker configuration correctly updated to use tmp_directory.The assignment of
worker_config.tmp_directoryfromclp_config.tmp_directoryproperly propagates the new temporary directory configuration to worker processes, replacing the previousdata_directoryusage.
443-443: LGTM: Validation added for tmp_directory to prevent container path conflicts.The
validate_path_for_container_mountcall fortmp_directoryis correctly placed alongside similar validations fordata_directoryandlogs_directory, ensuring the path doesn't conflict with container-internal paths.
| def validate_tmp_dir(self): | ||
| try: | ||
| validate_path_could_be_dir(self.tmp_directory) | ||
| except ValueError as ex: | ||
| raise ValueError(f"tmp_directory is invalid: {ex}") |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Consider improving exception handling for consistency.
The validation method is functionally correct, but could be improved for better exception handling practices.
Apply this diff to improve exception chaining and message handling:
def validate_tmp_dir(self):
try:
validate_path_could_be_dir(self.tmp_directory)
except ValueError as ex:
- raise ValueError(f"tmp_directory is invalid: {ex}")
+ raise ValueError(f"tmp_directory is invalid: {ex}") from exNote: This is an optional refinement. The existing code is functional, and this pattern is consistent with validate_data_dir() and validate_logs_dir() in the same file, which also don't use exception chaining.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def validate_tmp_dir(self): | |
| try: | |
| validate_path_could_be_dir(self.tmp_directory) | |
| except ValueError as ex: | |
| raise ValueError(f"tmp_directory is invalid: {ex}") | |
| def validate_tmp_dir(self): | |
| try: | |
| validate_path_could_be_dir(self.tmp_directory) | |
| except ValueError as ex: | |
| raise ValueError(f"tmp_directory is invalid: {ex}") from ex |
🧰 Tools
🪛 Ruff (0.14.1)
712-712: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
712-712: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In components/clp-py-utils/clp_py_utils/clp_config.py around lines 708 to 712,
the validate_tmp_dir method currently catches ValueError and re-raises a new
ValueError without exception chaining; update the raise to preserve the original
exception context by using "raise ValueError(f'tmp_directory is invalid: {ex}')
from ex" so the original stack trace is retained and debugging is easier.
sitaowang1998
left a comment
There was a problem hiding this comment.
Just from this PR, it looks like the temporary directory is used exclusively inside the worker container, so why do we need to mount it?
for the background, the compression scheduler write an input files catalog to a temp location and pass the location to the clp binaries: as for whether to write to it's true that we don't need persistence of such catalog across service (i.e., |
Correct me if I'm wrong, but I thinks it is the compression worker who writes the input files catalog to a temp location and then call clp binaries to read from the created file. |
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (2)
components/clp-package-utils/clp_package_utils/controller.py(1 hunks)tools/deployment/package/docker-compose.base.yaml(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: package-image
- GitHub Check: build (macos-15)
- GitHub Check: lint-check (ubuntu-24.04)
- GitHub Check: lint-check (macos-15)
🔇 Additional comments (2)
components/clp-package-utils/clp_package_utils/controller.py (1)
721-721: LGTM! Environment variable follows the established pattern.The addition of
CLP_TMP_DIR_HOSTcorrectly exports the tmp_directory configuration to Docker Compose, following the same pattern as other directory environment variables. The placement in the Paths section is appropriate and consistent with the codebase structure.tools/deployment/package/docker-compose.base.yaml (1)
265-265: These environment variables are already properly exported and initialized—no issues found.The verification confirms all three variables (
CLP_TMP_DIR_HOST,CLP_ARCHIVE_OUTPUT_DIR_HOST,CLP_STAGED_ARCHIVE_OUTPUT_DIR_HOST) follow a sound initialization pattern:
Variable export: Populated in
controller.py(lines 721, 735–737) and written to the.envfile inclp_home(line 757). Docker Compose automatically loads this.envwhen invoked from that directory.Directory initialization: Created in
start_clp.py(lines 72–74) withmkdir(parents=True, exist_ok=True)beforecontroller.start()is called.Path validation: Validated through
clp_configproperties derived from the loaded configuration file, ensuring directories are properly set up before Docker Compose invocation.Likely an incorrect or invalid review comment.
| volumes: | ||
| - *volume_clp_config_readonly | ||
| - *volume_clp_logs | ||
| - "${CLP_ARCHIVE_OUTPUT_DIR_HOST:-empty}:/var/data/archives" |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Standardize mount syntax for consistency.
Lines 265 and 268 use shorthand string syntax for mounts, while lines 269–271 use explicit YAML object syntax. While both are valid Docker Compose syntax, this inconsistency within the same volumes list reduces clarity. Consider adopting one style consistently.
For example, convert lines 265 and 268 to match the explicit style:
volumes:
- *volume_clp_config_readonly
- *volume_clp_logs
- - "${CLP_ARCHIVE_OUTPUT_DIR_HOST:-empty}:/var/data/archives"
- "${CLP_AWS_CONFIG_DIR_HOST:-empty}:/.aws:ro"
- "${CLP_LOGS_INPUT_DIR_HOST:-empty}:${CLP_LOGS_INPUT_DIR_CONTAINER:-/mnt/logs}"
- - "${CLP_STAGED_ARCHIVE_OUTPUT_DIR_HOST:-empty}:/var/data/staged-archives"
+ - type: "bind"
+ source: "${CLP_ARCHIVE_OUTPUT_DIR_HOST:-empty}"
+ target: "/var/data/archives"
+ - type: "bind"
+ source: "${CLP_STAGED_ARCHIVE_OUTPUT_DIR_HOST:-empty}"
+ target: "/var/data/staged-archives"
- type: "bind"
source: "${CLP_TMP_DIR_HOST:-./var/tmp}"
target: "/var/tmp"Alternatively, if the shorthand syntax is preferred, consolidate all mounts to that format.
Also applies to: 268-271
🤖 Prompt for AI Agents
In tools/deployment/package/docker-compose.base.yaml around lines 265 and
268–271, the volumes list mixes shorthand string mounts and explicit YAML object
mounts; convert the shorthand entries on lines 265 and 268 into the same
explicit object form used on 269–271. Replace each shorthand
"${ENV_VAR:-default}:/container/path" with an object using type: bind, source:
the environment-expanded host path, and target: the container path (and include
read_only: true if the matching explicit entries are read-only), so all volume
entries use the explicit YAML mapping style consistently.
It's useful for debugging when things go wrong, so I'd prefer to persist it for now. |
… Remove hack in `var/data` conditional mounts (resolves y-scope#1478). (y-scope#1480)
Description
This change adds support for a configurable temporary directory used by CLP runtime
components and removes the prior conditional mount hack around
/var/datain the Docker Compose packageorchestration.
This change is BREAKING because now temporary runtime data is stored in the configurable tmp_directory (default
var/tmp, mounted as/var/tmpin the container) instead of thedata_directory(/var/data), andWorkerConfignow usestmp_directoryfor temp artifacts.Summary of changes:
clp_py_utils.clp_configCLP_DEFAULT_TMP_DIRECTORY_PATHdefaulting tovar/tmp.tmp_directorytoCLPConfigwith absolute-path transformation,validate_tmp_dir(),and container translation in
transform_for_container().WorkerConfigto usetmp_directory(replacingdata_directory) for temp files.CLP_TMP_DIR_HOSTenv for compose.*-log-paths.txt) to be created underworker_config.tmp_directory.tmp_directoryoption and constraints./var/datamount targets; added a straightforwardbind mount for
${CLP_TMP_DIR_HOST:-./var/tmp} -> /var/tmpand direct mounts for archives andstaged archives. This eliminates the
/tmpindirection hack.Checklist
breaking change.
Validation performed
Summary by CodeRabbit
New Features
Configuration Changes
Improvements