feat(clp-package)!: Refactor input log config so that S3 source locations are specified in the config file rather than through the CLI.#788
Conversation
WalkthroughThe pull request updates the handling of log input configuration across CLP components. Changes include modifying the Changes
Possibly related PRs
Suggested reviewers
Tip ⚡💬 Agentic Chat (Pro Plan, General Availability)
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
docs/src/user-guide/guides-using-object-storage/clp-config.md (1)
30-32: Consistent sentence structure could be improvedThree consecutive bullet points begin with the same pattern. Consider varying the sentence structure to improve readability.
* `<region-code>` is the AWS region [code][aws-region-codes] for the bucket. * `<bucket-name>` is the bucket's name. - * `<key-prefix>` is the prefix of all logs you wish to compress and should be the same as the + * For `<key-prefix>`, specify the prefix of all logs you wish to compress. This should match the `<all-logs-prefix>` value from the [compression IAM policy][compression-iam-policy].🧰 Tools
🪛 LanguageTool
[style] ~30-~30: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...s the bucket's name. *<key-prefix>is the prefix of all logs you wish to comp...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
components/clp-package-utils/clp_package_utils/scripts/compress.py (1)
59-60: Simplified S3 path handling lacks clarityThe logic for S3 input has been simplified to write the first path directly instead of constructing a URL, but the code lacks clarity about what this path represents.
Consider adding a comment to explain the meaning of this path:
with open(container_logs_list_path, "w") as container_logs_list_file: + # Write the key prefix path - will be appended to the S3 key_prefix in the config container_logs_list_file.write(f"{parsed_args.paths[0]}\n")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (12)
components/clp-package-utils/clp_package_utils/general.py(3 hunks)components/clp-package-utils/clp_package_utils/scripts/compress.py(8 hunks)components/clp-package-utils/clp_package_utils/scripts/native/compress.py(6 hunks)components/clp-package-utils/clp_package_utils/scripts/start_clp.py(2 hunks)components/clp-py-utils/clp_py_utils/clp_config.py(7 hunks)components/clp-py-utils/clp_py_utils/s3_utils.py(0 hunks)components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py(1 hunks)components/package-template/src/etc/clp-config.yml(1 hunks)docs/src/user-guide/guides-using-object-storage/clp-config.md(2 hunks)docs/src/user-guide/guides-using-object-storage/clp-usage.md(2 hunks)docs/src/user-guide/quick-start-compression/json.md(1 hunks)docs/src/user-guide/quick-start-compression/text.md(1 hunks)
💤 Files with no reviewable changes (1)
- components/clp-py-utils/clp_py_utils/s3_utils.py
🧰 Additional context used
🧬 Code Definitions (5)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (1)
components/clp-py-utils/clp_py_utils/s3_utils.py (1)
s3_get_object_metadata(29-65)
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (1)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
StorageType(52-54)
components/clp-package-utils/clp_package_utils/general.py (1)
components/clp-py-utils/clp_py_utils/clp_config.py (2)
StorageType(52-54)validate_logs_input_config(597-602)
components/clp-package-utils/clp_package_utils/scripts/native/compress.py (3)
components/clp-py-utils/clp_py_utils/clp_config.py (2)
CLPConfig(561-697)validate_logs_input_config(597-602)components/job-orchestration/job_orchestration/scheduler/job_config.py (2)
S3InputConfig(30-38)FsInputConfig(23-27)components/clp-py-utils/clp_py_utils/create-db-tables.py (1)
main(18-43)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
components/clp-py-utils/clp_py_utils/core.py (2)
make_config_path_absolute(24-35)validate_path_could_be_dir(47-54)
🪛 LanguageTool
docs/src/user-guide/guides-using-object-storage/clp-config.md
[style] ~30-~30: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...s the bucket's name. * <key-prefix> is the prefix of all logs you wish to comp...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
🔇 Additional comments (33)
docs/src/user-guide/quick-start-compression/text.md (1)
6-6: Command syntax simplified by removing thefssubcommand.The removal of the
fssubcommand from the compression command simplifies the usage pattern, aligning with the PR's objective to eliminate fs/s3 subcommands for a unified command structure.docs/src/user-guide/quick-start-compression/json.md (1)
6-6: Command syntax simplified by removing thefssubcommand.The updated command syntax allows for direct specification of the timestamp key and paths without requiring the
fssubcommand. This change makes the command structure more intuitive and simpler to use.components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (1)
137-137: Improved terminology consistency from "URL" to "path".The error message and documentation have been updated to use "path" instead of "URL" which provides more consistent terminology when referring to S3 object locations. This aligns with how S3 objects are conceptually addressed in the codebase.
Also applies to: 143-143
components/package-template/src/etc/clp-config.yml (1)
5-7: Configuration structure updated to support different storage types.The configuration has been changed from a simple string-based
input_logs_directoryto a structuredlogs_inputobject with explicittypeanddirectoryfields. This change aligns with the PR's objective to replaceinput_logs_dirwith a union ofFsStorageandS3Storage.components/clp-package-utils/clp_package_utils/scripts/start_clp.py (2)
601-605: Conditional mount for input logs directory based on storage typeThis change appropriately ensures that the input logs directory is only mounted when using the file system storage type for the compression scheduler, which is a logical improvement. This aligns with the changes in the configuration structure that now support different storage types.
748-749: Consistent application of storage type check for worker componentGood implementation of the same storage type conditional check in the worker component, maintaining consistency with the scheduler component changes. This ensures that input log directories are only mounted when the storage type is FS.
components/clp-package-utils/clp_package_utils/general.py (3)
23-23: Added StorageType importGood addition of the StorageType import which is needed for the storage type checks added in this file.
220-227: Conditional configuration of input logs directoryThis change correctly implements the conditional configuration of input logs directory based on storage type. The container configuration is now properly set up only when using file system storage, which aligns with the changes in start_clp.py.
499-499: Updated validation method callThe validation method call has been correctly updated to use
validate_logs_input_config()which reflects the new configuration structure that supports multiple storage types.docs/src/user-guide/guides-using-object-storage/clp-usage.md (4)
8-15: Simplified command syntax for S3 compressionThe documentation has been appropriately updated to reflect the new simplified command syntax. The removal of the
s3subcommand and AWS credentials file option makes the interface more straightforward.
17-18: Updated prefix description with config referenceGood update to the prefix description that now correctly references the logs-input s3_config, providing users with a clear link to the relevant configuration documentation.
21-22: Consistent terminology in limitation noteThe limitation note has been correctly updated to use "prefix" terminology instead of "URL," maintaining consistency with the new configuration approach.
32-33: Added helpful reference linksThe additional reference links to compression IAM policy and logs-input S3 configuration enhance the documentation and help users find related information.
docs/src/user-guide/guides-using-object-storage/clp-config.md (2)
9-33: New section for input logs configurationThis new section excellently documents how to configure CLP for compressing logs from S3. The YAML example and detailed explanations for each configuration property provide clear guidance for users.
🧰 Tools
🪛 LanguageTool
[style] ~30-~30: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...s the bucket's name. *<key-prefix>is the prefix of all logs you wish to comp...(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
104-104: Added reference link for compression IAM policyGood addition of the compression IAM policy reference link which complements the new input logs configuration section.
components/clp-package-utils/clp_package_utils/scripts/native/compress.py (4)
144-150: Improved S3 configuration structureThe refactoring to use
s3_configdirectly fromclp_config.logs_inputinstead of parsing URLs and command-line arguments improves code maintainability and aligns with the project's objective to streamline configuration handling.
173-208: Clean command line argument simplificationThe refactoring eliminates subparsers for FS and S3 input types, simplifying the command-line interface as intended in the PR objectives. The configuration file now drives the input type determination, which is a more structured approach.
208-208: Updated validation method callThe validation method call has been updated to
validate_logs_input_config()which aligns with the changes in theCLPConfigclass where this method has been introduced to validate the newlogs_inputstructure.
219-219: Properly updated function callThe call to
_generate_clp_io_confighas been correctly updated to include the newclp_configparameter, maintaining consistency with the function signature change.components/clp-package-utils/clp_package_utils/scripts/compress.py (5)
33-33: Configuration-driven input typeUsing
clp_config.logs_input.typeinstead of command-line arguments to determine the input type aligns with the PR objective to unify configuration handling. This change makes the code more maintainable and consistent.
180-182: Conditional Docker mount for FS inputThe Docker mount for
input_logs_diris now only applied when the input type is FS, which aligns with the PR objective. This change correctly avoids mounting unnecessary volumes when using S3 as the input source.
195-195: Correctly updated log list generationThe call to
_generate_logs_listhas been updated to include theclp_configparameter, maintaining consistency with the function signature change.
201-201: Properly updated command generationThe call to
_generate_compress_cmdhas been updated to include theclp_configparameter, ensuring consistent behavior with the function signature changes.
117-120:Details
❓ Verification inconclusive
More restrictive S3 input validation
The validation for S3 input is now more restrictive, enforcing that only one key prefix can be specified and that path list files are not supported. This is consistent with the changes in the configuration structure, but users of the previous API may be impacted.
Verify that this change has been properly documented for users:
🏁 Script executed:
#!/bin/bash # Check if this change is documented in README or documentation files grep -r "key prefix" --include="*.md" components/ grep -r "path list" --include="*.md" components/Length of output: 102
Action Required: Verify Documentation for S3 Input Validation Changes
The new S3 validation changes – enforcing exactly one key prefix and disallowing path list files – appear to have been implemented correctly in the code. However, our initial documentation search in Markdown files beneath the
components/directory did not return any references to “key prefix” or “path list.” Please manually verify if these changes have been documented elsewhere (for example, in a central README or docs folder). If no documentation exists, kindly update the user documentation accordingly.components/clp-py-utils/clp_py_utils/clp_config.py (9)
386-388: Simplified S3Storage serializationThe
dump_to_primitive_dictmethod has been simplified to just return the dictionary representation of the S3Storage object, which is sufficient since it doesn't contain any path attributes that need special handling.
391-406: Good class hierarchy for S3 storageThe introduction of
OutputS3Storageas a subclass ofS3Storagecreates a cleaner separation between input and output storage configurations. This improves the modularity of the code and makes it easier to understand the different use cases.
409-411: Useful InputFsStorage classThe addition of
InputFsStoragewith a default directory path of "/" makes the configuration more consistent and easier to work with, following the same pattern as the output storage classes.
421-426: Updated inheritance to use new OutputS3StorageThe
ArchiveS3StorageandStreamS3Storageclasses now correctly inherit fromOutputS3Storageinstead ofS3Storage, aligning with the new class hierarchy and ensuring they have the necessarystaging_directoryattribute.
429-443: Updated storage utility functionsThe utility functions
_get_directory_from_storage_configand_set_directory_for_storage_confighave been updated to use the more specificOutputS3Storagetype instead ofS3Storage, reflecting the new class hierarchy.
564-564: Restructured logs input configurationThe
input_logs_directoryattribute has been replaced withlogs_input, which is a union ofInputFsStorageandS3Storage. This change supports the PR objective to provide a more unified and flexible way to configure both FS and S3 input sources.
588-589: Conditional path resolutionThe
make_config_paths_absolutemethod now conditionally resolves paths only for FS input, which is appropriate since S3 storage doesn't require local path resolution.
597-602: Improved validation methodThe new
validate_logs_input_configmethod provides focused validation for the logs input configuration, checking the validity of the directory only for FS input types. This ensures that invalid configurations are caught early.
690-690: Updated serialization for logs inputThe
dump_to_primitive_dictmethod has been updated to include thelogs_inputconfiguration in the output, ensuring that all necessary configuration is properly serialized.
…compress.py Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
|
High level comment: We might need to ensure the same method can work for cloud. Some changes that I can think of:
|
| #input_logs_directory: "/" | ||
| #logs_input: | ||
| # type: "fs" | ||
| # directory: "/" |
There was a problem hiding this comment.
Unrelated this PR. We might want to confirm the behavior if the directory is not "/".
For example, if the directory is set to "/a/b/c", and we want to compress "/a/b/c/d", do we type full path or relative path for compress.py?
There was a problem hiding this comment.
For "fs" you type full path
| def validate_logs_input_config(self): | ||
| if StorageType.FS == self.logs_input.type: | ||
| try: | ||
| validate_path_could_be_dir(self.logs_input.directory) |
There was a problem hiding this comment.
Note to myself and Nit: The difference here is that we don't enforce the directory to exist. Need to think if there's way to deal with the inconsistent behavior
There was a problem hiding this comment.
Thinking about this again, let's add an extra boolean argument, such as "require_exists" to indicate if the directory must exist.
This will ensure the check behavior for worker's input log directory stays unchanged after this PR.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
components/clp-package-utils/clp_package_utils/scripts/compress.py (1)
7-7: Update import to be more explicitThe change from specific imports to a more general
Listimport makes the code less explicit about what types are being used. Consider using more specific imports when possible, such asfrom typing import List, Tupleif Tuple is also needed.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
components/clp-package-utils/clp_package_utils/scripts/compress.py(8 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (3)
- GitHub Check: lint-check (macos-latest)
- GitHub Check: lint-check (ubuntu-latest)
- GitHub Check: build (macos-latest)
🔇 Additional comments (8)
components/clp-package-utils/clp_package_utils/scripts/compress.py (8)
31-31: Good addition of input_type parameterAdding the
input_typeparameter to_generate_logs_listmakes the function more flexible and allows it to process different input types correctly. This supports the PR's objective of unifying command handling for different storage types.
65-69: Simplified function signatureThe function signature has been reformatted but with the same parameters. Consider removing unnecessary type annotations if they're not being used in the function body.
106-118: Enhanced S3 input validationAdding
clp_configparameter and improving validation for S3 input is a good change. The validation now correctly ensures only one key prefix is specified and path list files aren't supported, which aligns with the S3 storage model.
133-146: Improved command-line arguments structureThe command-line argument structure has been simplified by removing separate subcommands for different input types. This creates a more unified interface for users and simplifies the codebase, aligning with the PR objective to eliminate separate
fsands3subcommands.
162-162: Good use of configuration valueDirectly using
clp_config.logs_input.typeinstead of parsing from command-line arguments simplifies the code and makes it more maintainable. This addresses a previous review comment about using the configuration value.
177-179: Conditionally mount input logs directoryThe conditional mounting of the Docker input logs directory only when using the
fstype is a good change that aligns with the PR objective. This ensures that resources are only mounted when necessary, improving efficiency.
192-192: Correct propagation of input typePassing
clp_config.logs_input.typeto_generate_logs_listis consistent with the function's updated signature. This ensures that the logs list is generated correctly based on the input type.
59-59:Details
❓ Verification inconclusive
Verify S3 path handling
For S3 input, only the first path from
parsed_args.pathsis written to the logs list. Ensure this is the intended behaviour and that multiple paths aren't expected for S3 input (which would be consistent with the validation in_validate_s3_input_args).
🏁 Script executed:
#!/bin/bash # Check if there are any callers expecting multiple S3 paths rg "paths\s*=\s*\[.*,.*\].*InputType\.S3" -g "*.py"Length of output: 53
S3 Path Handling – Please Confirm the Intended Behaviour
After running the search for cases where multiple S3 paths might be passed, no evidence was found suggesting that any part of the codebase expects more than one S3 path (consistent with the logic in
_validate_s3_input_args). This supports the current approach of writing only the first element ofparsed_args.pathsto the logs. However, since the search did not yield any results, please manually verify that the intended design for S3 inputs is to only handle a single path.
- File:
components/clp-package-utils/clp_package_utils/scripts/compress.py- Line: 59
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (1)
components/clp-package-utils/clp_package_utils/scripts/compress.py (1)
1-211: Consider responding to previous review comments.A previous review comment asked if
clp_config.logs_input.typecould be used directly instead of passing the entireclp_config. This change has been implemented in some places but could be extended to other parts of the code for consistency.🧰 Tools
🪛 GitHub Actions: clp-lint
[error] 1-1: Ruff would reformat the file. Please run 'ruff check .' to fix formatting issues.
🧹 Nitpick comments (1)
components/clp-package-utils/clp_package_utils/scripts/compress.py (1)
1-211: Fix formatting issues.The pipeline failure indicates that Ruff would reformat the file. Please run 'ruff check .' to fix these formatting issues.
🧰 Tools
🪛 GitHub Actions: clp-lint
[error] 1-1: Ruff would reformat the file. Please run 'ruff check .' to fix formatting issues.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
components/clp-package-utils/clp_package_utils/scripts/compress.py(8 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
components/clp-package-utils/clp_package_utils/scripts/compress.py (2)
components/job-orchestration/job_orchestration/scheduler/job_config.py (1)
InputType(11-13)components/clp-py-utils/clp_py_utils/clp_config.py (1)
StorageEngine(47-49)
🪛 GitHub Actions: clp-lint
components/clp-package-utils/clp_package_utils/scripts/compress.py
[error] 1-1: Ruff would reformat the file. Please run 'ruff check .' to fix formatting issues.
🔇 Additional comments (10)
components/clp-package-utils/clp_package_utils/scripts/compress.py (10)
7-7: Import update matches function signature changes.The import has been changed from
TupletoListto reflect the updated function signatures in the file.
28-32: Good enhancement to the function signature.Adding the explicit
input_typeparameter makes the function more modular and clearer in its purpose. This change aligns with the PR objective of reworking the input configuration.
65-69: Function signature updated for command generation.The signature has been updated to be more explicit about the parameters. The function no longer handles AWS credentials as those are now part of the configuration.
107-118: Improved validation logic for S3 inputs.The function now takes a specific
storage_engineparameter instead of the entire config object, following the principle of least privilege. The validation logic has been simplified and properly enforces the constraints for S3 input.
133-146: Command line argument definitions simplified.The command line interface has been streamlined to remove AWS-specific arguments, aligning with the PR objective of unifying arguments for both input types.
162-162: Direct access to input type from configuration.The input type is now obtained directly from
clp_config.logs_input.typeinstead of through parsed arguments, making the code more straightforward and consistent with the configuration structure changes.
166-166: Proper validation call with specific parameter.The updated validation function is correctly called with the specific
storage_engineparameter instead of the entire config object.
177-179: Efficient conditional Docker mounting.The Docker mount for input logs directory is now only applied when necessary (for FS input type), improving efficiency. This change aligns with the PR objective regarding Docker mount handling.
192-192: Properly passing input type to generate logs list.The function call correctly passes the input type from the configuration, ensuring consistent handling of both input types.
59-59:Details
❓ Verification inconclusive
Verify S3 path handling is correct.
The S3 input handling has been simplified to write only the first path from
parsed_args.paths. Please ensure this correctly handles all potential S3 input scenarios.
🏁 Script executed:
#!/bin/bash # Check how paths[0] is used in other parts of the codebase for S3 inputs rg -A 3 -B 3 "parsed_args.paths\[0\]" --type pyLength of output: 784
S3 Input Handling – Verification Required
After reviewing the code incomponents/clp-package-utils/clp_package_utils/scripts/compress.py, it appears that for S3 inputs only the first path (parsed_args.paths[0]) is used. The grep output confirms that this is the sole handling for S3 paths.Please ensure that this approach is in line with the intended design. In particular, if the system is expected to support multiple S3 paths, additional logic may be required rather than processing only the first element. If a single S3 path is all that’s intended, then consider adding a comment or documentation to clarify this behaviour.
haiqi96
left a comment
There was a problem hiding this comment.
Looks good to me, pending Kirk's review
Co-authored-by: kirkrodrigues <2454684+kirkrodrigues@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 0
🔭 Outside diff range comments (1)
docs/src/user-guide/guides-using-object-storage/clp-usage.md (1)
35-37:⚠️ Potential issueUnused reference link definition
The
compression-iam-policylink reference is defined but not used in the document.-[compression-iam-policy]: ./object-storage-config.md#configuration-for-compressionAlternatively, if this reference is intended to be used elsewhere in the document, add a reference to it in the appropriate location.
🧰 Tools
🪛 markdownlint-cli2 (0.17.2)
35-35: Link and image reference definitions should be needed
Unused link or image reference definition: "compression-iam-policy"(MD053, link-image-reference-definitions)
🧹 Nitpick comments (2)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
597-606: Validation method implements required directory checksThe
validate_logs_input_configmethod properly checks that the directory exists and is a valid directory, but only for the FS storage type.Consider adding a boolean parameter to control whether directory existence is checked, as mentioned in a previous comment:
"Thinking about this again, let's add an extra boolean argument, such as "require_exists" to indicate if the directory must exist.
This will ensure the check behavior for worker's input log directory stays unchanged after this PR."
- def validate_logs_input_config(self): + def validate_logs_input_config(self, require_exists: bool = True): if StorageType.FS == self.logs_input.type: # NOTE: This can't be a pydantic validator since input_logs_dir might be a # package-relative path that will only be resolved after pydantic validation input_logs_dir = self.logs_input.directory - if not input_logs_dir.exists(): + if require_exists and not input_logs_dir.exists(): raise ValueError(f"logs_input.directory '{input_logs_dir}' doesn't exist.") - if not input_logs_dir.is_dir(): + if require_exists and not input_logs_dir.is_dir(): raise ValueError(f"logs_input.directory '{input_logs_dir}' is not a directory.")components/clp-package-utils/clp_package_utils/scripts/compress.py (1)
105-119: S3 input validation improvedThe validation for S3 input has been simplified and now clearly enforces the constraints for S3 usage.
Consider making the error message more specific by directly referencing "paths" instead of "key prefix":
if len(parsed_args.paths) != 1: - args_parser.error(f"Only one key prefix can be specified for input type {InputType.S3}.") + args_parser.error(f"Only one path prefix can be specified for input type {InputType.S3}.")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
components/clp-package-utils/clp_package_utils/scripts/compress.py(8 hunks)components/clp-package-utils/clp_package_utils/scripts/native/compress.py(6 hunks)components/clp-package-utils/clp_package_utils/scripts/start_clp.py(2 hunks)components/clp-py-utils/clp_py_utils/clp_config.py(7 hunks)docs/src/user-guide/guides-using-object-storage/clp-usage.md(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- components/clp-package-utils/clp_package_utils/scripts/start_clp.py
- components/clp-package-utils/clp_package_utils/scripts/native/compress.py
🧰 Additional context used
🧬 Code Graph Analysis (2)
components/clp-package-utils/clp_package_utils/scripts/compress.py (2)
components/job-orchestration/job_orchestration/scheduler/job_config.py (1)
InputType(11-13)components/clp-py-utils/clp_py_utils/clp_config.py (1)
StorageEngine(47-49)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
components/clp-py-utils/clp_py_utils/core.py (1)
make_config_path_absolute(24-35)
🪛 markdownlint-cli2 (0.17.2)
docs/src/user-guide/guides-using-object-storage/clp-usage.md
35-35: Link and image reference definitions should be needed
Unused link or image reference definition: "compression-iam-policy"
(MD053, link-image-reference-definitions)
🔇 Additional comments (19)
components/clp-py-utils/clp_py_utils/clp_config.py (9)
386-388: Simple and effective implementationThis new
dump_to_primitive_dictmethod in the S3Storage class handles serialization of the object to a primitive dictionary, which is necessary for configuration dumping.
391-406: Good separation of concerns with the new class hierarchyCreating this
OutputS3Storageclass that inherits fromS3Storageis a clean approach to separate input and output storage configurations, allowing thestaging_directoryto be specifically tied to output operations.
409-411: Good default configurationThe
InputFsStorageclass provides a sensible default directory value of/which makes it easier to create new configurations.
421-426: Class hierarchy update is appropriateChanging
ArchiveS3StorageandStreamS3Storageto inherit fromOutputS3Storageinstead of directly fromS3Storageis a logical consequence of the new class structure.
429-439: Function signature update is consistent with the class changesThe updated parameter type for
_get_directory_from_storage_configcorrectly reflects the new class hierarchy, usingOutputS3Storageinstead ofS3Storage.
441-451: Function signature update is consistent with the class changesThe updated parameter type for
_set_directory_for_storage_configcorrectly reflects the new class hierarchy, usingOutputS3Storageinstead ofS3Storage.
564-564: Core configuration change aligns with PR objectivesReplacing
input_logs_directorywith the more flexiblelogs_inputunion type enables support for both filesystem and S3 storage types, which is a key objective of this PR.
588-596: Path resolution logic properly updatedThe conditional check for
StorageType.FSbefore callingmake_config_paths_absoluteis appropriate since S3 input doesn't need path resolution.
693-693: Dictionary conversion updated to include logs_inputThe
dump_to_primitive_dictmethod is correctly updated to include the newlogs_inputconfiguration in the output.components/clp-package-utils/clp_package_utils/scripts/compress.py (7)
31-32: Function signature improvedPassing
input_typedirectly as a parameter rather than extracting it fromparsed_argsis a better approach that increases code flexibility and clarity.
56-59: S3 input processing simplifiedThe S3 logs list generation now simply writes the first path instead of constructing a complex URL. This aligns with the PR objective of handling S3 parameters through the configuration rather than command-line arguments.
64-88: Command generation simplifiedThe
_generate_compress_cmdfunction has been streamlined by removing AWS credential parameters, which now come from the configuration. This makes the function simpler and more focused.
134-147: Command-line arguments streamlinedThe command-line arguments have been reorganized to present a unified set of arguments, removing the separation between
fsands3subcommands. This simplifies the user experience.
163-169: Input type determination simplifiedUsing
clp_config.logs_input.typedirectly is cleaner than extracting it from command-line arguments. This centralizes the configuration and aligns with the PR objective.
178-181: Conditional mounting of input logs directoryThe Docker mount for
input_logs_diris now only applied for thefstype, which aligns with the PR objectives and prevents unnecessary volume mounts.
193-193: Updated function callThe call to
_generate_logs_listhas been updated to pass the input type from the configuration, which is consistent with the updated function signature.docs/src/user-guide/guides-using-object-storage/clp-usage.md (3)
8-15: Command usage documentation updatedThe documentation has been updated to reflect the removal of the
s3subcommand, which aligns with the code changes incompress.py.
17-21: Parameter description is clear and helpfulThe explanation of the
<prefix>parameter clearly describes that it must be relative tologs-input.s3_config.key_prefixand includes a helpful example to illustrate the relationship.
24-31: Limitation description updatedThe description of the S3 compression limitations has been updated to refer to a "single prefix" instead of a "single URL", which is consistent with the code changes.
🧰 Tools
🪛 LanguageTool
[grammar] ~27-~27: Possible subject-verb agreement error detected.
Context: ... path to the log file. However, if that log file's path is a prefix of another log ...(THIS_THAT_AGR)
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
docs/src/user-guide/guides-using-object-storage/clp-usage.md (2)
34-36: Unused reference links in documentation.The reference links
[add-iam-policy]and[aws-region-codes]are defined but not used in the document. Additionally,[compression-iam-policy]is defined but not referenced.Consider removing these unused reference links or updating the document to use them where appropriate.
🧰 Tools
🪛 markdownlint-cli2 (0.17.2)
34-34: Link and image reference definitions should be needed
Unused link or image reference definition: "add-iam-policy"(MD053, link-image-reference-definitions)
35-35: Link and image reference definitions should be needed
Unused link or image reference definition: "aws-region-codes"(MD053, link-image-reference-definitions)
36-36: Link and image reference definitions should be needed
Unused link or image reference definition: "compression-iam-policy"(MD053, link-image-reference-definitions)
37-37: Missing empty line at end of file.The new reference link for logs-input S3 configuration has been added, but the file should end with an empty line according to your coding standards.
[logs-input-s3-config]: ./clp-config.md#configuration-for-input-logs +
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
components/clp-package-utils/clp_package_utils/scripts/compress.py(8 hunks)components/clp-package-utils/clp_package_utils/scripts/native/compress.py(6 hunks)components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py(1 hunks)components/package-template/src/etc/clp-config.yml(1 hunks)docs/src/user-guide/guides-using-object-storage/clp-usage.md(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
- components/package-template/src/etc/clp-config.yml
🧰 Additional context used
🧬 Code Graph Analysis (2)
components/clp-package-utils/clp_package_utils/scripts/native/compress.py (2)
components/clp-package-utils/clp_package_utils/scripts/compress.py (1)
main(121-207)components/clp-package-utils/clp_package_utils/general.py (1)
get_clp_home(93-109)
components/clp-package-utils/clp_package_utils/scripts/compress.py (2)
components/job-orchestration/job_orchestration/scheduler/job_config.py (1)
InputType(11-13)components/clp-py-utils/clp_py_utils/clp_config.py (1)
StorageEngine(47-49)
🪛 LanguageTool
docs/src/user-guide/guides-using-object-storage/clp-usage.md
[grammar] ~28-~28: Possible subject-verb agreement error detected.
Context: ...3_config.key_prefix`). However, if that log file's path is a prefix of another log ...
(THIS_THAT_AGR)
🪛 markdownlint-cli2 (0.17.2)
docs/src/user-guide/guides-using-object-storage/clp-usage.md
34-34: Link and image reference definitions should be needed
Unused link or image reference definition: "add-iam-policy"
(MD053, link-image-reference-definitions)
35-35: Link and image reference definitions should be needed
Unused link or image reference definition: "aws-region-codes"
(MD053, link-image-reference-definitions)
36-36: Link and image reference definitions should be needed
Unused link or image reference definition: "compression-iam-policy"
(MD053, link-image-reference-definitions)
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: lint-check (ubuntu-latest)
- GitHub Check: lint-check (macos-latest)
🔇 Additional comments (21)
docs/src/user-guide/guides-using-object-storage/clp-usage.md (4)
8-9: Improved command reference for S3 log compression.The documentation now correctly reflects the updated approach for compressing logs from S3, using the
sbin/compress.shscript instead of a subcommand structure.
14-14: Command syntax simplified and aligned with new configuration approach.The command now only requires the timestamp key and prefix parameter, eliminating the need for explicitly providing AWS credentials.
17-21: Clear explanation of the prefix parameter with helpful example.Good documentation update that explains how the prefix is now relative to the configured
logs-input.s3_config.key_prefix, with a clear example that helps users understand the path construction.
24-31: Improved explanation of single log file compression.The note now correctly explains the limitations when compressing a single log file, clarifying that paths must be relative to the configured key prefix.
🧰 Tools
🪛 LanguageTool
[grammar] ~28-~28: Possible subject-verb agreement error detected.
Context: ...3_config.key_prefix`). However, if that log file's path is a prefix of another log ...(THIS_THAT_AGR)
components/clp-package-utils/clp_package_utils/scripts/native/compress.py (8)
13-13: Updated import to use CLPConfig instead of S3Credentials.This change appropriately supports the new configuration approach for logs input.
129-131: Function signature updated to use CLPConfig as the first parameter.The function now receives the CLPConfig object as its first parameter, which follows the project's convention of organizing parameters from stable to dynamic.
132-132: Input type now extracted from CLPConfig instead of arguments.This is a good refactoring that centralizes the configuration in the CLPConfig object.
142-142: Fixed missing 'raise' keyword in exception handling.Proper exception raising is now implemented, ensuring the function fails appropriately when too many key prefixes are provided.
144-150: Improved S3 configuration handling.The code now properly retrieves S3 configuration from the CLPConfig object's logs_input.s3_config instead of parsing URLs, which is more maintainable and consistent with the new configuration approach.
173-202: New unified main function replaces subcommand structure.The function has been reworked to eliminate separate subcommands for different input types, which simplifies the interface and makes the code more maintainable.
219-219: Updated call to _generate_clp_io_config with CLPConfig parameter.The call has been updated to include the CLPConfig object as the first parameter, aligning with the updated function signature.
208-209:Details
✅ Verification successful
Validate both logs input config and logs directory.
The code now calls both validation methods. However, there seems to be a potential issue:
validate_logs_dir()is still called aftervalidate_logs_input_config(), but based on the relevant code snippets, it appears thatvalidate_logs_input_config()might already handle directory validation for FS storage.Please verify if both validation calls are necessary, or if
validate_logs_dir()is now redundant sincevalidate_logs_input_config()might already cover this validation.
🏁 Script executed:
#!/bin/bash # Check implementation of both validation methods to determine if they're redundant rg -A 5 "validate_logs_input_config" --glob "*.py" rg -A 5 "validate_logs_dir" --glob "*.py"Length of output: 7982
Validation calls are necessary—both methods check distinct parts of the configuration.
After verifying the implementations, it appears that
validate_logs_input_config()is responsible for checking the existence of the directory specified in the logs input configuration (i.e.self.logs_input.directory) whenStorageType.FSis used, whereasvalidate_logs_dir()independently validates theself.logs_directoryproperty. Thus, even though they both deal with log directories, they serve different purposes and are both required.components/clp-package-utils/clp_package_utils/scripts/compress.py (9)
29-32: Function signature updated to receive input type as parameter.The function now takes the input type directly rather than extracting it from parsed_args, which is a clean improvement that makes the function more focused.
58-58: Simplified S3 path handling.For S3 input, the script now writes the first path directly instead of constructing a URL. This aligns with the new approach where S3 parameters are specified in the configuration rather than through command-line arguments.
64-68: Simplified compress command generation.The function signature has been simplified, removing AWS credential parameters. This aligns with the new approach where S3 credentials are specified in the configuration.
105-119: Simplified S3 input validation.The validation function has been updated to focus only on validating that S3 input is used with the correct storage engine and that only one key prefix is specified. The AWS credential validation has been removed as expected.
134-147: Simplified command-line arguments.The command-line arguments have been simplified to remove AWS credential options, making the interface cleaner and more consistent with the new configuration approach.
163-168: Input type now extracted from configuration.The code now gets the input type from CLPConfig.logs_input.type, which centralizes configuration and makes the code more maintainable.
178-180: Conditional mounting of input logs directory.This improvement ensures that the input logs directory is only mounted when file system input is used, which is more efficient.
193-193: Updated call to _generate_logs_list with input type parameter.The call has been updated to include the input type from the configuration, aligning with the updated function signature.
1-212:Details
❓ Verification inconclusive
Comprehensive refactoring to align with new configuration model.
The entire script has been effectively refactored to use the new logs_input configuration model, removing AWS credential handling from the script level and relying on the configuration object. This change simplifies the user interface and makes the code more maintainable.
One potential concern is to ensure that all documentation and existing automation scripts are updated to reflect these changes, as the command-line interface has changed significantly.
Verify that all documentation and automation scripts that use this script have been updated to reflect the new command-line interface.
🏁 Script executed:
#!/bin/bash # Search for references to the old CLI format with aws-credentials-file rg -A 2 "\-\-aws-credentials-file" --glob "*.md" --glob "*.sh" --glob "*.py"Length of output: 78
Action Required: Manual Verification of Documentation and Automation Scripts
The automated search for the deprecated
--aws-credentials-fileflag did not return any results, suggesting that legacy CLI options have been removed. However, because no output was produced from the search, please manually verify that all related documentation and automation scripts have been successfully updated to reflect the new command-line interface changes, especially those that interact with this script.
kirkrodrigues
left a comment
There was a problem hiding this comment.
For the PR title, how about:
feat(clp-package)!: Refactor input log config so that S3 sources are specified in the config file rather than through the CLI.
…ions are specified in the config file rather than through the CLI. (y-scope#788) Co-authored-by: kirkrodrigues <2454684+kirkrodrigues@users.noreply.github.com>
…ions are specified in the config file rather than through the CLI. (y-scope#788) Co-authored-by: kirkrodrigues <2454684+kirkrodrigues@users.noreply.github.com>
Description
This PR introduces changes to the log input directory for both FS and S3 input types, as well as changes to arguments for the command line compression scripts.
input_logs_dirfield withlogs_input, a Union ofFsStorageandS3Storage.input_logs_dirDocker mount to only mount if typefsis selected forlogs_input.fsands3subcommands and use the same arguments for either input type.s3_configoflogs_inputinstead of command line URL and arguments.Checklist
breaking change.
Validation performed
Manually validated package running in different configurations.
Summary by CodeRabbit