fix(clp-package): Resolve relative paths and symlinks in host input and output paths passed through the package's config or CLI arguments (fixes #1558).#1559
Conversation
y-scope#1558); Resolve symlinks recursively in `resolve_host_path_in_container`.
WalkthroughThis change introduces a new Changes
Sequence DiagramsequenceDiagram
participant User as User Input
participant Script as compress.py/<br/>decompress.py
participant resolve_host_path as resolve_host_path()
participant resolve_symlinks as _resolve_symlinks_<br/>in_path()
participant Container as resolve_host_path_<br/>in_container()
User->>Script: Provide host path (relative or absolute)
Script->>resolve_host_path: resolve_host_path(Path)
activate resolve_host_path
resolve_host_path->>Container: Call resolve_host_path_in_container()
activate Container
Container->>Container: Expand user tilde<br/>Convert relative→absolute<br/>(via CLP_PWD_HOST)
Container->>resolve_symlinks: _resolve_symlinks_in_path()
activate resolve_symlinks
resolve_symlinks->>resolve_symlinks: Walk components,<br/>follow symlinks<br/>(with inode tracking)
alt Symlinks resolved
resolve_symlinks-->>Container: Resolved Path
else Resolution failed
resolve_symlinks-->>Container: None
end
deactivate resolve_symlinks
alt Got resolved symlinks
Container-->>resolve_host_path: Container path (resolved)
else Fallback to mount root
Container-->>resolve_host_path: Container path (mount-rooted)
end
deactivate Container
resolve_host_path->>resolve_host_path: Convert to host-space<br/>(rooted at /)
resolve_host_path-->>Script: Host Path
deactivate resolve_host_path
Script->>Script: Use resolved path
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Possibly related issues
Pre-merge checks and finishing touches✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…ath_in_container`.
…ath_in_container`.
resolve_host_path_in_container.resolve_host_path_in_container.
There was a problem hiding this comment.
see the validation steps for the rationale of adding those extra handling in resolve_host_path_in_container
kirkrodrigues
left a comment
There was a problem hiding this comment.
For the PR title, how about:
fix(clp-package): Resolve symlinks in host input and output paths passed through the package's config or CLI arguments (fixes #1558).
Co-authored-by: kirkrodrigues <2454684+kirkrodrigues@users.noreply.github.com>
…ymlink handling logic.
There was a problem hiding this comment.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
components/clp-py-utils/clp_py_utils/core.py(2 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: quinntaylormitchell
Repo: y-scope/clp PR: 1125
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:267-291
Timestamp: 2025-09-15T22:20:40.750Z
Learning: For CLP compression jobs, the team has decided to fail the entire job immediately upon encountering any invalid input path, rather than continuing to process valid paths. This decision was made during PR #1125 development.
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1414
File: tools/docker-images/clp-package/Dockerfile:20-24
Timestamp: 2025-10-13T03:32:19.293Z
Learning: In the clp repository's Dockerfiles (e.g., tools/docker-images/clp-package/Dockerfile), ENV directives should be split into separate lines for readability rather than consolidated to reduce layer count. This is especially true for PATH modifications, as agreed upon in PR #1166. Later ENV settings may depend on earlier ones (e.g., referencing CLP_HOME).
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
Learnt from: junhaoliao
Repo: y-scope/clp PR: 0
File: :0-0
Timestamp: 2025-10-22T21:02:31.113Z
Learning: Repository y-scope/clp: Maintain deterministic CI/builds for Rust; add a check to verify Cargo.lock is in sync with Cargo.toml without updating dependencies (non-mutating verification in clp-rust-checks workflow).
Learnt from: haiqi96
Repo: y-scope/clp PR: 651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.
🪛 Ruff (0.14.3)
components/clp-py-utils/clp_py_utils/core.py
81-81: Avoid specifying long messages outside the exception class
(TRY003)
108-108: Consider moving this statement to an else block
(TRY300)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: package-image
- GitHub Check: lint-check (ubuntu-24.04)
- GitHub Check: lint-check (macos-15)
🔇 Additional comments (4)
components/clp-py-utils/clp_py_utils/core.py (4)
1-1: LGTM!The
osimport is necessary for accessing theCLP_PWD_HOSTenvironment variable.
68-74: LGTM!The docstring clearly describes the function's behaviour and appropriately references
resolve_host_pathfor resolution details.
78-83: Verify that.absolute()is necessary.After joining a relative
host_pathwithpwd_host, the result should already be absolute ifpwd_hostitself is absolute. The explicit.absolute()call on line 83 might be redundant.If you want to be defensive against cases where
pwd_hostmight be relative, consider validating thatpwd_hostis absolute instead:if not host_path.is_absolute(): pwd_host = os.getenv("CLP_PWD_HOST") if pwd_host is None: raise ValueError("Relative host path provided but CLP_PWD_HOST is not set.") + pwd_host = pathlib.Path(pwd_host) + if not pwd_host.is_absolute(): + raise ValueError("CLP_PWD_HOST must be an absolute path.") - host_path = pathlib.Path(pwd_host) / host_path - host_path = host_path.absolute() + host_path = pwd_host / host_path
116-128: LGTM!The new
resolve_host_pathhelper is well-designed with a clear docstring. The implementation correctly delegates toresolve_host_path_in_containerand translates the result back to host path space.
There was a problem hiding this comment.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
components/clp-py-utils/clp_py_utils/core.py(3 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: quinntaylormitchell
Repo: y-scope/clp PR: 1125
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:267-291
Timestamp: 2025-09-15T22:20:40.750Z
Learning: For CLP compression jobs, the team has decided to fail the entire job immediately upon encountering any invalid input path, rather than continuing to process valid paths. This decision was made during PR #1125 development.
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1414
File: tools/docker-images/clp-package/Dockerfile:20-24
Timestamp: 2025-10-13T03:32:19.293Z
Learning: In the clp repository's Dockerfiles (e.g., tools/docker-images/clp-package/Dockerfile), ENV directives should be split into separate lines for readability rather than consolidated to reduce layer count. This is especially true for PATH modifications, as agreed upon in PR #1166. Later ENV settings may depend on earlier ones (e.g., referencing CLP_HOME).
Learnt from: junhaoliao
Repo: y-scope/clp PR: 0
File: :0-0
Timestamp: 2025-10-22T21:02:31.113Z
Learning: Repository y-scope/clp: Maintain deterministic CI/builds for Rust; add a check to verify Cargo.lock is in sync with Cargo.toml without updating dependencies (non-mutating verification in clp-rust-checks workflow).
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1156
File: components/core/CMakeLists.txt:772-772
Timestamp: 2025-08-09T04:07:27.083Z
Learning: In the CLP project's CMakeLists.txt, when reviewing changes related to the ${zstd_TARGET} variable usage in test linking, the team is planning a refactoring PR to improve this mechanism. Guards for undefined target variables should be deferred to that separate PR rather than being added in focused dependency migration PRs.
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1466
File: .github/workflows/clp-rust-checks.yaml:14-15
Timestamp: 2025-10-22T21:14:12.225Z
Learning: Repository y-scope/clp: In GitHub Actions workflows (e.g., .github/workflows/clp-rust-checks.yaml), YAML anchors/aliases are acceptable and preferred to avoid duplication; if actionlint flags an alias node (e.g., on push.paths) as an error, treat it as a tool limitation and do not require inlining unless the team asks to silence the warning.
Learnt from: haiqi96
Repo: y-scope/clp PR: 651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1261
File: docs/src/dev-guide/components-core/manylinux-2-28-deps-install.md:24-24
Timestamp: 2025-08-25T06:32:48.313Z
Learning: In the CLP project documentation, when linking to scripts or other files within the repository, use relative paths (e.g., components/core/tools/scripts/...) rather than commit-pinned GitHub URLs to ensure docs and referenced files always belong to the same commit and stay synchronized.
Learnt from: haiqi96
Repo: y-scope/clp PR: 1144
File: components/clp-package-utils/clp_package_utils/scripts/native/dataset_manager.py:164-170
Timestamp: 2025-08-13T15:07:37.767Z
Learning: In the CLP codebase, the ArchiveOutput.get_directory() method already returns a resolved Path object, so additional .resolve() calls on its return value are redundant.
Learnt from: haiqi96
Repo: y-scope/clp PR: 1144
File: components/clp-package-utils/clp_package_utils/scripts/native/dataset_manager.py:164-170
Timestamp: 2025-08-13T15:07:37.767Z
Learning: In the CLP codebase, the ArchiveOutput.get_directory() method already returns a resolved Path object, so additional .resolve() calls on its return value are redundant.
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-package-utils/clp_package_utils/controller.py:315-315
Timestamp: 2025-10-17T19:59:25.596Z
Learning: In components/clp-package-utils/clp_package_utils/controller.py, worker log directories (compression_worker, query_worker, reducer) created via `mkdir()` do not need `_chown_paths_if_root()` calls because directories are created with the same owner as the script caller. This differs from infrastructure service directories (database, queue, Redis, results cache) which do require explicit ownership changes.
Learnt from: haiqi96
Repo: y-scope/clp PR: 1144
File: components/clp-package-utils/clp_package_utils/scripts/native/dataset_manager.py:164-170
Timestamp: 2025-08-13T15:07:37.767Z
Learning: In the CLP codebase, the ArchiveOutput.get_directory() method already returns a resolved/absolute Path object because FsStorage.make_config_paths_absolute() processes the directory path during configuration loading, making additional .resolve() calls redundant.
Learnt from: haiqi96
Repo: y-scope/clp PR: 1144
File: components/clp-package-utils/clp_package_utils/scripts/native/dataset_manager.py:164-170
Timestamp: 2025-08-13T15:07:37.767Z
Learning: In the CLP codebase, the ArchiveOutput.get_directory() method already returns a resolved/absolute Path object because FsStorage.make_config_paths_absolute() processes the directory path during configuration loading, making additional .resolve() calls redundant.
🪛 Ruff (0.14.3)
components/clp-py-utils/clp_py_utils/core.py
96-96: Avoid specifying long messages outside the exception class
(TRY003)
144-144: Consider moving this statement to an else block
(TRY300)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: package-image
- GitHub Check: lint-check (macos-15)
🔇 Additional comments (2)
components/clp-py-utils/clp_py_utils/core.py (2)
1-1: LGTM!The
osimport is necessary for accessing environment variables in the updated resolution logic.
68-81: LGTM!The new
resolve_host_pathfunction provides a clean public API for host-space path resolution. The implementation correctly delegates toresolve_host_path_in_containerand strips the/mnt/hostprefix to return the resolved host path.
| def resolve_host_path_in_container(host_path: pathlib.Path) -> pathlib.Path: | ||
| """ | ||
| Translates a host path to its container-mount equivalent. It also resolves a single level of | ||
| symbolic link if the host path itself is a symlink. | ||
| Resolves a host path and translates it into its container-mount equivalent absolute path. See | ||
| `resolve_host_path` for details about the resolution. | ||
|
|
||
| :param host_path: The host path. | ||
| :return: The translated path. | ||
| :return: The translated path (with /mnt/host prefix). | ||
| """ | ||
| host_path = host_path.absolute() | ||
| translated_path = CONTAINER_DIR_FOR_HOST_ROOT / host_path.relative_to("/") | ||
| host_path = host_path.expanduser() | ||
|
|
||
| try: | ||
| if not translated_path.is_symlink(): | ||
| return translated_path | ||
|
|
||
| link_target = translated_path.readlink() | ||
| if link_target.is_absolute(): | ||
| return CONTAINER_DIR_FOR_HOST_ROOT / link_target.relative_to("/") | ||
| else: | ||
| # If the symlink points to a relative path, resolve it relative to the symlink's parent. | ||
| return (translated_path.parent / link_target).resolve() | ||
| except OSError: | ||
| # Ignore if reading the symlink fails (e.g., broken link or permission error). | ||
| pass | ||
| if not host_path.is_absolute(): | ||
| pwd_host = os.getenv("CLP_PWD_HOST") | ||
| if pwd_host is None: | ||
| raise ValueError("Relative host path provided but CLP_PWD_HOST is not set.") | ||
| host_path = pathlib.Path(pwd_host) / host_path | ||
| host_path = host_path.absolute() | ||
|
|
||
| return translated_path | ||
| resolved_path = _resolve_symlinks_in_path(host_path) | ||
| if resolved_path is not None: | ||
| return resolved_path | ||
|
|
||
| return CONTAINER_DIR_FOR_HOST_ROOT / host_path.relative_to("/") |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
LGTM!
The enhanced resolution logic correctly handles tilde expansion, relative paths via CLP_PWD_HOST, and delegates symlink resolution to the new helper function. The fallback behaviour is appropriate.
Optional: Consider extracting the error message (line 96).
Static analysis suggests defining the error message outside the exception class for reusability, though the current inline approach is acceptable for this single use case.
🧰 Tools
🪛 Ruff (0.14.3)
96-96: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In components/clp-py-utils/clp_py_utils/core.py around lines 83 to 104, extract
the inline error string used when CLP_PWD_HOST is missing into a named constant
(e.g., RELATIVE_PATH_PWD_HOST_ERROR = "Relative host path provided but
CLP_PWD_HOST is not set.") at module scope, then use that constant when raising
the ValueError so the message is reusable and easier to test/maintain (replace
the inline string in raise ValueError(...) with the constant).
| def _resolve_symlinks_in_path(host_path: pathlib.Path) -> pathlib.Path | None: | ||
| """ | ||
| Resolves symlinks in a path by walking through each component and following symlinks. | ||
|
|
||
| :param host_path: The host path. | ||
| :return: The resolved path with all symlinks followed, or None on error. | ||
| """ | ||
| try: | ||
| visited_symlink_inodes = set() | ||
| current_path = CONTAINER_DIR_FOR_HOST_ROOT | ||
| for part in host_path.relative_to("/").parts: | ||
| current_path = current_path / part | ||
|
|
||
| while current_path.is_symlink(): | ||
| stat = current_path.lstat() | ||
| if stat.st_ino in visited_symlink_inodes: | ||
| break | ||
| visited_symlink_inodes.add(stat.st_ino) | ||
|
|
||
| link_target = current_path.readlink() | ||
| if link_target.is_absolute(): | ||
| current_path = CONTAINER_DIR_FOR_HOST_ROOT / link_target.relative_to("/") | ||
| else: | ||
| # If the symlink points to a relative path, resolve it relative to the symlink's | ||
| # parent. | ||
| current_path = (current_path.parent / link_target).resolve() | ||
|
|
||
| return current_path | ||
| except OSError: | ||
| # Ignore if reading the symlink fails (e.g., broken link or permission error). | ||
| return None |
There was a problem hiding this comment.
Critical: Remove .resolve() usage that contradicts PR objectives.
Line 142 uses .resolve(), which contradicts the PR's core objective of removing .resolve() usages that could misinterpret paths inside bootstrap containers. This call resolves symlinks in the container's filesystem context rather than the host's context, bypassing the cycle detection mechanism implemented in this helper.
As noted in the previous review, the .resolve() call should be replaced with manual path normalization.
Apply this diff to manually normalize the relative symlink target:
else:
# If the symlink points to a relative path, resolve it relative to the symlink's
# parent.
- current_path = (current_path.parent / link_target).resolve()
+ normalized_path_str = os.path.normpath(str(current_path.parent / link_target))
+ current_path = pathlib.Path(normalized_path_str)This approach normalizes the path (handling .. and . components) without resolving symlinks, allowing the outer loop to continue handling any additional symlinks encountered.
Optional: Consider restructuring the try-except for clarity.
Static analysis suggests moving the return current_path statement (line 144) to an else block to follow the TRY300 style guideline, though the current structure is acceptable.
Committable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 Ruff (0.14.3)
144-144: Consider moving this statement to an else block
(TRY300)
🤖 Prompt for AI Agents
components/clp-py-utils/clp_py_utils/core.py around lines 117-147: the use of
.resolve() on line ~142 must be removed because it resolves symlinks in the
container FS and bypasses the helper's cycle-detection; replace that .resolve()
call with manual path normalization: compute the candidate path as
current_path.parent / link_target, then normalize its parts by processing each
component (skip '.'; on '..' remove the previous component if any; otherwise
append the part) to produce a normalized path without following symlinks, and
assign that normalized path to current_path so the outer loop can continue
detecting and handling any subsequent symlinks; keep the visited_symlink_inodes
logic intact. Optionally, you can move the final return current_path into an
else block of the try for clearer flow.
|
Retested following the "Validation performed" steps |
kirkrodrigues
left a comment
There was a problem hiding this comment.
See title suggested in previous review.
resolve_host_path_in_container.…nd output paths passed through the package's config or CLI arguments (fixes y-scope#1558). (y-scope#1559) Co-authored-by: kirkrodrigues <2454684+kirkrodrigues@users.noreply.github.com>
Description
This change strengthens host-path handling and removes
.resolve()usages that could misinterpret paths inside bootstrap containers of the sbin scripts.Summary of changes:
resolve_host_path_in_containerto:~to the user’s home directory.CLP_PWD_HOST(with a clear error if unset)./mnt/host.resolve_host_path(host_path: Path) -> Path, returning the fully resolved host-space path (without the/mnt/hostprefix).resolve_host_path(...)forlogs_input.directorywhen building container configs.Path(...).resolve()withresolve_host_path(...)for input paths and lists.extraction_dirbefore translation; pass through toresolve_host_path_in_container(...).--extraction-dirto"."instead of relying onCLP_PWD_HOSTat arg-parse time (resolution now happens in core).resolve_host_path(extraction_dir)as the source for the docker bind mount.Checklist
breaking change.
Validation performed
Tested
compress.sh- input paths that are relative, contains ~, symlinks, or regular but whose parent is a symlink can now be properly ingestedTested the decompression script in various directories
Reset and restarted the package with below settings that set the input directory to a symlink one
->
->
which is expected. The rationale can be seen in the clp-config.yml comment.
Summary by CodeRabbit
Release Notes