fix(clp-package): Move FileMetadata into a module that's safe to import from scripts run outside the execution container (fixes #895).#1105
Conversation
…rt from scripts run outside the execution container (fixes y-scope#895).
WalkthroughThe changes relocate the Changes
Suggested reviewers
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (4)
components/clp-py-utils/clp_py_utils/compression.py(1 hunks)components/clp-py-utils/clp_py_utils/core.py(1 hunks)components/clp-py-utils/clp_py_utils/s3_utils.py(1 hunks)components/job-orchestration/job_orchestration/scheduler/compress/partition.py(1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py:104-110
Timestamp: 2024-11-15T16:21:52.122Z
Learning: In `clp_package_utils/scripts/native/del_archives.py`, when deleting archives, the `archive` variable retrieved from the database is controlled and is always a single string without path components. Therefore, it's acceptable to skip additional validation checks for directory traversal in this context.
Learnt from: haiqi96
PR: y-scope/clp#594
File: components/clp-package-utils/clp_package_utils/scripts/del_archives.py:56-65
Timestamp: 2024-11-18T16:49:20.248Z
Learning: When reviewing wrapper scripts in `components/clp-package-utils/clp_package_utils/scripts/`, note that it's preferred to keep error handling simple without adding extra complexity.
Learnt from: jackluo923
PR: y-scope/clp#718
File: components/core/tools/scripts/utils/create-debian-package.py:41-41
Timestamp: 2025-02-12T22:24:17.723Z
Learning: For the clp-core Debian package creation script, strict version format validation is considered unnecessary complexity and should be avoided.
Learnt from: jackluo923
PR: y-scope/clp#1054
File: components/core/tools/scripts/lib_install/musllinux_1_2/install-prebuilt-packages.sh:27-32
Timestamp: 2025-07-07T17:41:15.655Z
Learning: In CLP installation scripts, consistency across platform scripts is prioritized over defensive programming improvements. For example, when extracting Task binaries with tar in `install-prebuilt-packages.sh`, the extraction pattern should remain consistent with other platform scripts rather than adding defensive flags like `--strip-components=1` to handle potential tarball layout changes.
Learnt from: kirkrodrigues
PR: y-scope/clp#881
File: components/core/tools/scripts/lib_install/ubuntu-jammy/install-prebuilt-packages.sh:35-41
Timestamp: 2025-05-06T09:48:55.408Z
Learning: For installation scripts in the CLP project, prefer explicit error handling over automatic dependency resolution (like `apt-get install -f`) when installing packages to give users more control over their system.
Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.
Learnt from: haiqi96
PR: y-scope/clp#619
File: components/core/src/clp/clp/decompression.cpp:313-313
Timestamp: 2024-12-05T16:32:21.507Z
Learning: In the C++ `FileDecompressor` implementations at `components/core/src/clp/clp/FileDecompressor.cpp` and `components/core/src/glt/glt/FileDecompressor.cpp`, the `temp_output_path` variable and associated logic are used to handle multiple compressed files with the same name, and should be kept. This logic is separate from the temporary output directory code removed in PR #619 and is necessary for proper file handling.
Learnt from: haiqi96
PR: y-scope/clp#634
File: components/job-orchestration/job_orchestration/executor/compress/fs_compression_task.py:0-0
Timestamp: 2024-12-16T21:49:06.483Z
Learning: In `fs_compression_task.py`, when refactoring the archive processing loop in Python, the `src_archive_file.unlink()` operation should remain outside of the `process_archive` function.
Learnt from: AVMatthews
PR: y-scope/clp#595
File: components/core/tests/test-end_to_end.cpp:59-65
Timestamp: 2024-11-19T17:30:04.970Z
Learning: In 'components/core/tests/test-end_to_end.cpp', during the 'clp-s_compression_and_extraction_no_floats' test, files and directories are intentionally removed at the beginning of the test to ensure that any existing content doesn't influence the test results.
components/job-orchestration/job_orchestration/scheduler/compress/partition.py (2)
Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.
Learnt from: haiqi96
PR: y-scope/clp#634
File: components/job-orchestration/job_orchestration/executor/compress/fs_compression_task.py:0-0
Timestamp: 2024-12-16T21:49:06.483Z
Learning: In `fs_compression_task.py`, when refactoring the archive processing loop in Python, the `src_archive_file.unlink()` operation should remain outside of the `process_archive` function.
components/clp-py-utils/clp_py_utils/compression.py (4)
Learnt from: haiqi96
PR: y-scope/clp#634
File: components/job-orchestration/job_orchestration/executor/compress/fs_compression_task.py:0-0
Timestamp: 2024-12-16T21:49:06.483Z
Learning: In `fs_compression_task.py`, when refactoring the archive processing loop in Python, the `src_archive_file.unlink()` operation should remain outside of the `process_archive` function.
Learnt from: haiqi96
PR: y-scope/clp#619
File: components/core/src/clp/clp/decompression.cpp:313-313
Timestamp: 2024-12-05T16:32:21.507Z
Learning: In the C++ `FileDecompressor` implementations at `components/core/src/clp/clp/FileDecompressor.cpp` and `components/core/src/glt/glt/FileDecompressor.cpp`, the `temp_output_path` variable and associated logic are used to handle multiple compressed files with the same name, and should be kept. This logic is separate from the temporary output directory code removed in PR #619 and is necessary for proper file handling.
Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.
Learnt from: AVMatthews
PR: y-scope/clp#595
File: components/core/tests/test-end_to_end.cpp:59-65
Timestamp: 2024-11-19T17:30:04.970Z
Learning: In 'components/core/tests/test-end_to_end.cpp', during the 'clp-s_compression_and_extraction_no_floats' test, files and directories are intentionally removed at the beginning of the test to ensure that any existing content doesn't influence the test results.
🧬 Code Graph Analysis (3)
components/clp-py-utils/clp_py_utils/s3_utils.py (1)
components/clp-py-utils/clp_py_utils/core.py (1)
FileMetadata(7-22)
components/job-orchestration/job_orchestration/scheduler/compress/partition.py (1)
components/clp-py-utils/clp_py_utils/core.py (1)
FileMetadata(7-22)
components/clp-py-utils/clp_py_utils/compression.py (1)
components/clp-py-utils/clp_py_utils/core.py (1)
FileMetadata(7-22)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: lint-check (ubuntu-latest)
- GitHub Check: lint-check (macos-latest)
🔇 Additional comments (3)
components/clp-py-utils/clp_py_utils/s3_utils.py (1)
25-25: No legacyFileMetadataimports remainVerified with ripgrep that there are zero occurrences of:
from clp_py_utils.compression import FileMetadata— all imports have been successfully migrated.
components/job-orchestration/job_orchestration/scheduler/compress/partition.py (1)
11-11: Import update looks rightThe switch to
clp_py_utils.corekeeps this module decoupled from the heavy compression dependencies. No issues spotted.components/clp-py-utils/clp_py_utils/compression.py (1)
6-7: Good decouplingReplacing the local definition with an import from
coreremoves duplication and avoids the Levenshtein→C-extension issue. 👍
| class FileMetadata: | ||
| __slots__ = ("path", "size", "estimated_uncompressed_size") | ||
|
|
||
| def __init__(self, path: pathlib.Path, size: int): | ||
| self.path = path | ||
| self.size = size | ||
| self.estimated_uncompressed_size = size | ||
|
|
||
| filename = path.name | ||
| if any(filename.endswith(extension) for extension in [".gz", ".gzip", ".tgz", ".tar.gz"]): | ||
| self.estimated_uncompressed_size *= 13 | ||
| elif any( | ||
| filename.endswith(extension) | ||
| for extension in [".zstd", ".zstandard", ".tar.zstd", ".tar.zstandard"] | ||
| ): | ||
| self.estimated_uncompressed_size *= 8 | ||
|
|
There was a problem hiding this comment.
🧹 Nitpick (assertive)
Replace magic numbers & make extension check case-insensitive
Two quick wins:
- The multiplication factors (13, 8) are magic numbers that will be hard to trace later.
filename.endswith()is case-sensitive; “.GZ” or “.Tgz” would be missed.
Consider extracting both the constants and extension sets, then lower-casing once:
-GZIP_RATIO = 13
-ZSTD_RATIO = 8
-GZIP_EXTS = (".gz", ".gzip", ".tgz", ".tar.gz")
-ZSTD_EXTS = (".zstd", ".zstandard", ".tar.zstd", ".tar.zstandard")
def __init__(self, path: pathlib.Path, size: int):
...
- filename = path.name
- if any(filename.endswith(extension) for extension in GZIP_EXTS):
- self.estimated_uncompressed_size *= GZIP_RATIO
- elif any(filename.endswith(extension) for extension in ZSTD_EXTS):
- self.estimated_uncompressed_size *= ZSTD_RATIO
+ filename = path.name.lower()
+ if any(filename.endswith(ext) for ext in GZIP_EXTS):
+ self.estimated_uncompressed_size *= GZIP_RATIO
+ elif any(filename.endswith(ext) for ext in ZSTD_EXTS):
+ self.estimated_uncompressed_size *= ZSTD_RATIOThis improves readability and robustness without altering behaviour.
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In components/clp-py-utils/clp_py_utils/core.py around lines 7 to 23, replace
the magic numbers 13 and 8 with named constants to improve code clarity. Also,
make the file extension checks case-insensitive by converting the filename to
lowercase once before checking extensions. Extract the extension lists into
constants for better readability and maintainability.
FileMetadata into module that's safe to import from scripts run outside the execution container (fixes #895).FileMetadata into a module that's safe to import from scripts run outside the execution container (fixes #895).
…port from scripts run outside the execution container (fixes y-scope#895). (y-scope#1105)
Description
As #895 describes, starting the package on certain operating systems results in an error due to importing a transitive dependency,
Levenshtein.levenshtein_cpp. This native module is likely incompatible with the operating system that the package is running on, resulting in the error.To resolve the issue, this PR moves the direct dependency (
FileMetadata) that's necessary when starting CLP intoclp_py_utils.core, that should be safe to import from scripts that run outside the execution container.Checklist
breaking change.
Validation performed
Summary by CodeRabbit
Refactor
FileMetadataclass for consistent file metadata handling across modules.FileMetadata.Chores