Skip to content

feat(clp-package): Add support for ingesting from custom S3 endpoints (resolves #1758).#1767

Merged
hoophalab merged 14 commits into
y-scope:mainfrom
hoophalab:compatible-pr
Dec 16, 2025
Merged

feat(clp-package): Add support for ingesting from custom S3 endpoints (resolves #1758).#1767
hoophalab merged 14 commits into
y-scope:mainfrom
hoophalab:compatible-pr

Conversation

@hoophalab

@hoophalab hoophalab commented Dec 10, 2025

Copy link
Copy Markdown
Contributor

Description

  1. Add a new endpoint_url field to both the S3InputConfig and S3Url classes.
  2. When the endpoint isn't amazonaws.com, both clp-s and the compression worker switch to path-style URLs.
  3. Update the regex to allow path-style urls. (tbh, we should switch to a true url parser in the future.)

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

  1. run minio/localstack and setup a authentication key
  2. setup log_input section in clp-config.yaml
logs_input:
  type: "s3"
  aws_authentication:
    type: "credentials"
    credentials:
      access_key_id: "id"
      secret_access_key: "key"
  1. compress-from-s3.sh --timestamp-key timestamp --dataset default s3-key-prefix http://<endpoint>/test-bucket/log-file-name.log

Summary by CodeRabbit

  • New Features

    • Support for custom S3 endpoints propagated across S3 clients and operations.
    • S3 URL generation now emits virtual‑hosted or path‑style links.
  • Improvements

    • S3 config accepts optional endpoint and optional region for greater flexibility.
    • S3 parsing/generation is endpoint‑aware, enforces consistent endpoint/region/bucket, computes common key prefixes, and detects duplicate keys.
    • S3 deletion and upload flows respect configured endpoint.
  • Tests

    • Added unit tests covering both S3 URL styles.

✏️ Tip: You can customize this high-level summary in your review settings.

@hoophalab hoophalab requested a review from a team as a code owner December 10, 2025 23:22
@coderabbitai

coderabbitai Bot commented Dec 10, 2025

Copy link
Copy Markdown
Contributor

Walkthrough

Adds endpoint-aware S3 support across Python and C++: S3 URL parsing now yields endpoint_url and common key prefix; endpoint_url is propagated into S3 configs, client creation, URL generation, and delete APIs; C++ signer detects virtual-hosted vs path-style and preserves scheme/endpoint in canonicalization and presign formatting.

Changes

Cohort / File(s) Change Summary
S3 Configuration & Defaults
components/clp-py-utils/clp_py_utils/clp_config.py
S3Config gains `endpoint_url: str
S3 URL Parsing & Client (Python)
components/clp-py-utils/clp_py_utils/s3_utils.py
Added regex constants; _create_s3_client() accepts endpoint_url and optional region_code; parse_s3_url() now returns (endpoint_url, region_code, bucket_name, key_prefix); generate_s3_url(endpoint_url, region_code, bucket, object_key) replaces prior generator; S3 ops updated to be endpoint-aware; s3_delete_by_key_prefix() signature extended.
Compression Utilities (Python)
components/clp-package-utils/clp_package_utils/scripts/native/compress.py
_parse_and_validate_s3_object_urls() signature changed to return (endpoint_url, region_code, bucket_name, key_prefix, key_list); validates uniform endpoint/region/bucket, detects duplicate keys, computes common key prefix; call sites updated to pass endpoint_url into S3InputConfig.
Dataset Management (Python)
components/clp-package-utils/clp_package_utils/scripts/native/dataset_manager.py
_try_deleting_archives_from_s3() now passes s3_config.endpoint_url as the first argument to s3_delete_by_key_prefix().
Job Orchestration — Compression (Python)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py
Replaced generate_s3_virtual_hosted_style_url with generate_s3_url; _generate_s3_logs_list() now calls generate_s3_url(endpoint_url, region_code, bucket, object_key).
Job Orchestration — Query (Python)
components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py, components/job-orchestration/job_orchestration/executor/query/fs_search_task.py
Imports and call sites updated to use generate_s3_url(endpoint_url, region_code, bucket, object_key).
S3 Delete Callsites (Python)
components/clp-package-utils/.../dataset_manager.py, components/clp-py-utils/.../s3_utils.py
s3_delete_by_key_prefix() now accepts endpoint_url and optional region_code; callers updated to include endpoint_url.
AWS C++ Constants
components/core/src/clp/aws/constants.hpp
Added regex constants: cSchemeRegex, cEndpointRegex, cRegionRegex, cBucketRegex, cKeyRegex.
AWS C++ URL Parsing & Signing (Headers)
components/core/src/clp/aws/AwsAuthenticationSigner.hpp
Added enum class Style { VirtualHost, Path }, m_style, m_scheme, and accessors get_style() and get_scheme(); included <cstdint>.
AWS C++ Signing Logic (Impl)
components/core/src/clp/aws/AwsAuthenticationSigner.cpp
URL parsing now detects virtual-hosted vs path-style, extracts endpoint/host/scheme/region/bucket/key; canonical request and presigned URL formatting vary by detected style and scheme.
Tests (C++)
components/core/src/clp/aws/test/test_AwsAuthenticationSigner.cpp, components/core/CMakeLists.txt
New unit tests added for virtual-hosted and path-style URL parsing (including scheme/port) and registered in CMake.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant CLI as compress.py / tasks
  participant PyUtils as clp_py_utils.s3_utils
  participant S3 as S3 endpoint
  participant CppSigner as AwsAuthenticationSigner (C++)

  CLI->>PyUtils: _parse_and_validate_s3_object_urls(urls)
  PyUtils-->>CLI: (endpoint_url, region_code, bucket, key_prefix, keys)
  CLI->>PyUtils: call S3 ops with S3InputConfig(endpoint_url,...)
  PyUtils->>S3: create client(endpoint_url, region) -> request metadata/delete/put
  S3-->>PyUtils: responses (metadata/ack)
  CLI->>CppSigner: request presigned URL (bucket,key,endpoint,scheme)
  CppSigner->>CppSigner: detect Style (VirtualHost/Path), compute canonical request
  CppSigner-->>CLI: presigned URL (formatted per style and scheme)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Focus review on endpoint propagation and signature changes in:
    • _parse_and_validate_s3_object_urls and its call sites in compress.py.
    • s3_utils.py: regexes, parse_s3_url/generate_s3_url, _create_s3_client and all client-creation call sites.
    • C++ AwsAuthenticationSigner parsing/canonicalization for style/scheme correctness.
    • Updated tests and CMake registration.

Possibly related issues

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 43.48% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: adding support for custom S3 endpoints, which is the primary objective across all modified files.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 77eaa75 and 92cc898.

📒 Files selected for processing (4)
  • components/clp-package-utils/clp_package_utils/scripts/native/compress.py (6 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/native/dataset_manager.py (1 hunks)
  • components/clp-py-utils/clp_py_utils/clp_config.py (1 hunks)
  • components/job-orchestration/job_orchestration/executor/compress/compression_task.py (2 hunks)
🧰 Additional context used
🧠 Learnings (12)
📓 Common learnings
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/clp_config.py:580-585
Timestamp: 2025-12-10T23:32:24.363Z
Learning: In PR #1767, custom S3 endpoint support was added to CLP. The S3Config.region_code field is now optional (NonEmptyStr | None) because custom S3-compatible endpoints (MinIO, LocalStack, etc.) use path-style URLs and don't require AWS region codes. Only AWS S3 endpoints require region_code. Presto integration still requires region_code because it only works with AWS S3.
Learnt from: haiqi96
Repo: y-scope/clp PR: 852
File: components/clp-package-utils/clp_package_utils/scripts/native/compress.py:151-160
Timestamp: 2025-04-25T20:46:20.140Z
Learning: For S3 URLs without region specifications (legacy global endpoints), either assign a default region (us-east-1) or throw a clear error message requiring region specification in the URL. This addresses validation issues in components like S3InputConfig that require a non-nullable region string.
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/core/src/clp/aws/constants.hpp:10-16
Timestamp: 2025-12-12T16:22:15.181Z
Learning: AWS S3 region codes follow the pattern `[region]-[zone]-[digit]` where the digit is always a single decimal digit (e.g., us-east-1, eu-west-3, ap-northeast-2). Custom S3-compatible endpoints (MinIO, LocalStack) may use different region naming conventions.
📚 Learning: 2025-12-10T23:32:24.363Z
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/clp_config.py:580-585
Timestamp: 2025-12-10T23:32:24.363Z
Learning: In PR #1767, custom S3 endpoint support was added to CLP. The S3Config.region_code field is now optional (NonEmptyStr | None) because custom S3-compatible endpoints (MinIO, LocalStack, etc.) use path-style URLs and don't require AWS region codes. Only AWS S3 endpoints require region_code. Presto integration still requires region_code because it only works with AWS S3.

Applied to files:

  • components/clp-py-utils/clp_py_utils/clp_config.py
  • components/clp-package-utils/clp_package_utils/scripts/native/dataset_manager.py
  • components/clp-package-utils/clp_package_utils/scripts/native/compress.py
📚 Learning: 2025-04-25T20:46:20.140Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 852
File: components/clp-package-utils/clp_package_utils/scripts/native/compress.py:151-160
Timestamp: 2025-04-25T20:46:20.140Z
Learning: For S3 URLs without region specifications (legacy global endpoints), either assign a default region (us-east-1) or throw a clear error message requiring region specification in the URL. This addresses validation issues in components like S3InputConfig that require a non-nullable region string.

Applied to files:

  • components/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-04-07T09:30:20.684Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 743
File: components/log-viewer-webui/server/src/plugins/S3Manager.ts:76-76
Timestamp: 2025-04-07T09:30:20.684Z
Learning: In the S3Manager plugin, an `isEnabled` option should be added and checked instead of checking if the region is null. This would make the plugin's behavior more explicit.

Applied to files:

  • components/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-11-17T22:58:50.056Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 0
File: :0-0
Timestamp: 2025-11-17T22:58:50.056Z
Learning: In the y-scope/clp repository, when enabling new linting tools (ruff, mypy) on Python components, the team uses an incremental approach: first enable the tools with errors allowed (exit code 0), apply only safe auto-fixable fixes, then address remaining issues in follow-up PRs. During the initial enablement PR, reviews should focus on correctness of auto-fixes rather than flagging new code quality issues.

Applied to files:

  • components/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-12-12T16:22:15.181Z
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/core/src/clp/aws/constants.hpp:10-16
Timestamp: 2025-12-12T16:22:15.181Z
Learning: AWS S3 region codes follow the pattern `[region]-[zone]-[digit]` where the digit is always a single decimal digit (e.g., us-east-1, eu-west-3, ap-northeast-2). Custom S3-compatible endpoints (MinIO, LocalStack) may use different region naming conventions.

Applied to files:

  • components/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-12-12T21:14:28.267Z
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/s3_utils.py:226-230
Timestamp: 2025-12-12T21:14:28.267Z
Learning: In the CLP project (y-scope/clp), ensure that regex patterns are compiled inline within functions for Python files under components/clp-py-utils/clp_py_utils (not at module scope). Move any re.compile calls from the module top level into the functions that use them to maintain consistency with the codebase style. This applies to all .py files in that directory; use module-level imports only for imports, not for pre-compiling regex patterns.

Applied to files:

  • components/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-09-28T15:00:22.170Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 1340
File: components/job-orchestration/job_orchestration/executor/compress/compression_task.py:528-528
Timestamp: 2025-09-28T15:00:22.170Z
Learning: In components/job-orchestration/job_orchestration/executor/compress/compression_task.py, there is a suggestion to refactor from passing logger as a parameter through multiple functions to creating a ClpCompressor class that takes the logger as a class member, with current helper functions becoming private member functions.

Applied to files:

  • components/job-orchestration/job_orchestration/executor/compress/compression_task.py
📚 Learning: 2025-10-07T07:54:32.427Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-py-utils/clp_py_utils/clp_config.py:47-47
Timestamp: 2025-10-07T07:54:32.427Z
Learning: In components/clp-py-utils/clp_py_utils/clp_config.py, the CONTAINER_AWS_CONFIG_DIRECTORY constant is intentionally set to pathlib.Path("/") / ".aws" (i.e., `/.aws`) rather than a user-specific home directory. This hardcoded path is part of the container orchestration design.

Applied to files:

  • components/job-orchestration/job_orchestration/executor/compress/compression_task.py
📚 Learning: 2025-08-13T14:48:49.020Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 1144
File: components/clp-package-utils/clp_package_utils/scripts/dataset_manager.py:106-114
Timestamp: 2025-08-13T14:48:49.020Z
Learning: For the dataset manager scripts in components/clp-package-utils/clp_package_utils/scripts/, the native script (native/dataset_manager.py) is designed to only be called through the wrapper script (dataset_manager.py), so dataset validation is only performed at the wrapper level rather than duplicating it in the native script.

Applied to files:

  • components/clp-package-utils/clp_package_utils/scripts/native/dataset_manager.py
📚 Learning: 2024-11-15T16:21:52.122Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 594
File: components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py:104-110
Timestamp: 2024-11-15T16:21:52.122Z
Learning: In `clp_package_utils/scripts/native/del_archives.py`, when deleting archives, the `archive` variable retrieved from the database is controlled and is always a single string without path components. Therefore, it's acceptable to skip additional validation checks for directory traversal in this context.

Applied to files:

  • components/clp-package-utils/clp_package_utils/scripts/native/dataset_manager.py
📚 Learning: 2025-01-16T16:58:43.190Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.

Applied to files:

  • components/clp-package-utils/clp_package_utils/scripts/native/compress.py
🧬 Code graph analysis (2)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py (1)
components/clp-py-utils/clp_py_utils/s3_utils.py (1)
  • generate_s3_url (257-284)
components/clp-package-utils/clp_package_utils/scripts/native/compress.py (2)
components/job-orchestration/job_orchestration/scheduler/job_config.py (1)
  • S3InputConfig (32-44)
components/clp-py-utils/clp_py_utils/s3_utils.py (1)
  • parse_s3_url (219-254)
🪛 Ruff (0.14.8)
components/clp-package-utils/clp_package_utils/scripts/native/compress.py

258-261: Avoid specifying long messages outside the exception class

(TRY003)


264-267: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
  • GitHub Check: build-macos (macos-15, false)
  • GitHub Check: build-macos (macos-15, true)
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-static-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-static-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: package-image
  • GitHub Check: build (macos-15)
🔇 Additional comments (7)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py (2)

29-33: LGTM!

The import change correctly reflects the API update from generate_s3_virtual_hosted_style_url to the more flexible generate_s3_url that supports both virtual-hosted and path-style URLs based on the endpoint configuration.


167-175: LGTM!

The function call correctly propagates endpoint_url from s3_input_config to enable custom S3 endpoint support. The parameter ordering matches the generate_s3_url signature, and the variable rename to s3_url appropriately reflects that both virtual-hosted and path-style URLs are now supported.

Note: The error handling concern raised in the previous review remains valid.

components/clp-py-utils/clp_py_utils/clp_config.py (1)

591-592: LGTM! Custom S3 endpoint support correctly implemented.

The addition of endpoint_url and making region_code optional properly supports custom S3-compatible endpoints (MinIO, LocalStack) that use path-style URLs without AWS region codes, whilst maintaining backwards compatibility with AWS S3 endpoints.

Based on learnings, this change was previously reviewed and confirmed safe for downstream consumers.

components/clp-package-utils/clp_package_utils/scripts/native/dataset_manager.py (1)

144-144: LGTM! Endpoint URL correctly propagated to S3 delete operation.

The addition of s3_config.endpoint_url as the first parameter correctly aligns with the updated function signature in s3_utils.py, enabling S3 delete operations to work with custom endpoints.

components/clp-package-utils/clp_package_utils/scripts/native/compress.py (3)

175-188: LGTM! S3 object compression correctly handles endpoint URL.

The unpacking of 5 values from _parse_and_validate_s3_object_urls (including endpoint_url) and propagation to S3InputConfig is correct and consistent with the function's updated signature.


194-205: LGTM! S3 key prefix compression correctly handles endpoint URL.

The unpacking of 4 values from parse_s3_url (including endpoint_url) and propagation to S3InputConfig is correct and aligns with the function signature in s3_utils.py.


228-285: LGTM! S3 URL parsing and validation logic is comprehensive and correct.

The function correctly:

  • Parses endpoint_url from S3 URLs
  • Validates that all URLs share the same endpoint, region, and bucket
  • Detects duplicate keys
  • Computes the common key prefix
  • Returns the 5-tuple (endpoint_url, region_code, bucket_name, key_prefix, key_list)

The None-handling for AWS endpoints (where endpoint_url would be None) is correct, as equality comparisons will work properly.

Note: The static analysis hints (TRY003) about long exception messages are minor style concerns and do not affect correctness.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
components/clp-package-utils/clp_package_utils/scripts/native/compress.py (1)

228-245: Return type annotation and docstring are inconsistent with actual return values.

  1. The return type on line 230 is tuple[str, str, str, str, list[str]], but endpoint_url and region_code can be None (as per parse_s3_url's return type tuple[str | None, str | None, str, str]).
  2. The docstring lists only 4 items in the return tuple but there are now 5, and endpoint_url is missing from the list.

Apply this diff to fix:

 def _parse_and_validate_s3_object_urls(
     urls: list[str],
-) -> tuple[str, str, str, str, list[str]]:
+) -> tuple[str | None, str | None, str, str, list[str]]:
     """
     Parses and validates S3 object URLs.

     The validation will ensure:
+    - All URLs have the same endpoint.
     - All URLs have the same region and bucket.
     - No duplicate keys among the URLs.
     - The URLs share a non-empty common prefix.

     :param urls:
     :return: A tuple containing:
-        - The region code.
+        - The endpoint URL (None for AWS default).
+        - The region code (may be None).
         - The bucket.
         - The common key prefix.
         - The list of keys.
     :raises ValueError: If the validation fails.
     """
components/clp-py-utils/clp_py_utils/s3_utils.py (1)

353-369: Docstring missing endpoint_url parameter documentation.

The function signature now includes endpoint_url: str | None as the first parameter, but the docstring doesn't document it.

Apply this diff:

 def s3_delete_by_key_prefix(
     endpoint_url: str | None,
     region_code: str | None,
     bucket_name: str,
     key_prefix: str,
     s3_auth: AwsAuthentication,
 ) -> None:
     """
     Deletes all objects under the S3 path `bucket_name`/`key_prefix`.

+    :param endpoint_url: The S3-compatible endpoint URL, or None for AWS default.
     :param region_code:
     :param bucket_name:
     :param key_prefix:
     :param s3_auth:
     :raises: ValueError if any parameter is invalid.
     :raises: Propagates `boto3.client.delete_objects`'s exceptions.
     """
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ca43337 and 784f8a0.

📒 Files selected for processing (10)
  • components/clp-package-utils/clp_package_utils/scripts/native/compress.py (5 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/native/dataset_manager.py (1 hunks)
  • components/clp-py-utils/clp_py_utils/clp_config.py (1 hunks)
  • components/clp-py-utils/clp_py_utils/s3_utils.py (8 hunks)
  • components/core/src/clp/aws/AwsAuthenticationSigner.cpp (4 hunks)
  • components/core/src/clp/aws/AwsAuthenticationSigner.hpp (3 hunks)
  • components/core/src/clp/aws/constants.hpp (1 hunks)
  • components/job-orchestration/job_orchestration/executor/compress/compression_task.py (2 hunks)
  • components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py (2 hunks)
  • components/job-orchestration/job_orchestration/executor/query/fs_search_task.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/src/clp/aws/constants.hpp
  • components/core/src/clp/aws/AwsAuthenticationSigner.hpp
  • components/core/src/clp/aws/AwsAuthenticationSigner.cpp
🧠 Learnings (10)
📓 Common learnings
Learnt from: haiqi96
Repo: y-scope/clp PR: 852
File: components/clp-package-utils/clp_package_utils/scripts/native/compress.py:151-160
Timestamp: 2025-04-25T20:46:20.140Z
Learning: For S3 URLs without region specifications (legacy global endpoints), either assign a default region (us-east-1) or throw a clear error message requiring region specification in the URL. This addresses validation issues in components like S3InputConfig that require a non-nullable region string.
Learnt from: haiqi96
Repo: y-scope/clp PR: 651
File: components/job-orchestration/job_orchestration/scheduler/job_config.py:29-39
Timestamp: 2025-01-15T16:36:48.932Z
Learning: The S3 service in the clp codebase only supports AWS S3, where region_code is mandatory. Other S3-like services are not supported.
📚 Learning: 2025-01-13T21:18:54.629Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 662
File: components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py:167-186
Timestamp: 2025-01-13T21:18:54.629Z
Learning: The s3_put API in clp_py_utils.s3_utils internally handles timeout and retry mechanisms for S3 operations.

Applied to files:

  • components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py
  • components/job-orchestration/job_orchestration/executor/query/fs_search_task.py
📚 Learning: 2024-11-15T16:21:52.122Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 594
File: components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py:104-110
Timestamp: 2024-11-15T16:21:52.122Z
Learning: In `clp_package_utils/scripts/native/del_archives.py`, when deleting archives, the `archive` variable retrieved from the database is controlled and is always a single string without path components. Therefore, it's acceptable to skip additional validation checks for directory traversal in this context.

Applied to files:

  • components/clp-package-utils/clp_package_utils/scripts/native/dataset_manager.py
📚 Learning: 2025-08-13T14:48:49.020Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 1144
File: components/clp-package-utils/clp_package_utils/scripts/dataset_manager.py:106-114
Timestamp: 2025-08-13T14:48:49.020Z
Learning: For the dataset manager scripts in components/clp-package-utils/clp_package_utils/scripts/, the native script (native/dataset_manager.py) is designed to only be called through the wrapper script (dataset_manager.py), so dataset validation is only performed at the wrapper level rather than duplicating it in the native script.

Applied to files:

  • components/clp-package-utils/clp_package_utils/scripts/native/dataset_manager.py
📚 Learning: 2025-04-25T20:46:20.140Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 852
File: components/clp-package-utils/clp_package_utils/scripts/native/compress.py:151-160
Timestamp: 2025-04-25T20:46:20.140Z
Learning: For S3 URLs without region specifications (legacy global endpoints), either assign a default region (us-east-1) or throw a clear error message requiring region specification in the URL. This addresses validation issues in components like S3InputConfig that require a non-nullable region string.

Applied to files:

  • components/clp-py-utils/clp_py_utils/clp_config.py
  • components/core/src/clp/aws/AwsAuthenticationSigner.cpp
  • components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-01-15T16:36:48.932Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 651
File: components/job-orchestration/job_orchestration/scheduler/job_config.py:29-39
Timestamp: 2025-01-15T16:36:48.932Z
Learning: The S3 service in the clp codebase only supports AWS S3, where region_code is mandatory. Other S3-like services are not supported.

Applied to files:

  • components/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-09-28T15:00:22.170Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 1340
File: components/job-orchestration/job_orchestration/executor/compress/compression_task.py:528-528
Timestamp: 2025-09-28T15:00:22.170Z
Learning: In components/job-orchestration/job_orchestration/executor/compress/compression_task.py, there is a suggestion to refactor from passing logger as a parameter through multiple functions to creating a ClpCompressor class that takes the logger as a class member, with current helper functions becoming private member functions.

Applied to files:

  • components/job-orchestration/job_orchestration/executor/compress/compression_task.py
📚 Learning: 2025-10-07T07:54:32.427Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-py-utils/clp_py_utils/clp_config.py:47-47
Timestamp: 2025-10-07T07:54:32.427Z
Learning: In components/clp-py-utils/clp_py_utils/clp_config.py, the CONTAINER_AWS_CONFIG_DIRECTORY constant is intentionally set to pathlib.Path("/") / ".aws" (i.e., `/.aws`) rather than a user-specific home directory. This hardcoded path is part of the container orchestration design.

Applied to files:

  • components/job-orchestration/job_orchestration/executor/compress/compression_task.py
📚 Learning: 2025-01-17T23:25:38.165Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 673
File: components/log-viewer-webui/server/src/S3Manager.js:0-0
Timestamp: 2025-01-17T23:25:38.165Z
Learning: In S3Manager.js, URL validation is handled by the URL constructor which throws TypeError for invalid URLs, making additional scheme (s3://) validation redundant.

Applied to files:

  • components/core/src/clp/aws/AwsAuthenticationSigner.cpp
📚 Learning: 2025-12-09T01:57:04.839Z
Learnt from: SharafMohamed
Repo: y-scope/clp PR: 1313
File: components/core/tests/test-SchemaSearcher.cpp:136-137
Timestamp: 2025-12-09T01:57:04.839Z
Learning: In C++ files under the components directory, keep forward declarations even when a definition follows in the same scope (such as in anonymous namespaces) to maintain consistency across multiple methods. This style reduces churn and aligns with the existing convention in this repository for both source and tests.

Applied to files:

  • components/core/src/clp/aws/AwsAuthenticationSigner.cpp
🧬 Code graph analysis (6)
components/core/src/clp/aws/AwsAuthenticationSigner.hpp (1)
components/core/src/clp_s/Schema.hpp (1)
  • m_schema (51-54)
components/job-orchestration/job_orchestration/executor/query/fs_search_task.py (1)
components/clp-py-utils/clp_py_utils/s3_utils.py (1)
  • generate_s3_url (264-288)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py (1)
components/clp-py-utils/clp_py_utils/s3_utils.py (1)
  • generate_s3_url (264-288)
components/clp-package-utils/clp_package_utils/scripts/native/compress.py (2)
components/job-orchestration/job_orchestration/scheduler/job_config.py (1)
  • S3InputConfig (32-44)
components/clp-py-utils/clp_py_utils/s3_utils.py (1)
  • parse_s3_url (218-261)
components/core/src/clp/aws/AwsAuthenticationSigner.cpp (1)
components/core/src/clp/aws/AwsAuthenticationSigner.hpp (1)
  • s3_url (103-104)
components/clp-py-utils/clp_py_utils/s3_utils.py (1)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
  • AwsAuthentication (547-577)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-static-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: package-image
  • GitHub Check: musllinux_1_2-x86_64-static-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: build-macos (macos-15, true)
  • GitHub Check: build-macos (macos-15, false)
  • GitHub Check: build (ubuntu-24.04)
  • GitHub Check: build (macos-15)
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: lint-check (macos-15)
🔇 Additional comments (23)
components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py (2)

17-21: LGTM!

The import change correctly reflects the updated S3 utility API.


103-108: LGTM!

The call to generate_s3_url correctly passes endpoint_url as the first parameter, followed by region_code, bucket, and object_key. Error handling is appropriately preserved.

components/job-orchestration/job_orchestration/executor/compress/compression_task.py (1)

28-32: LGTM!

The import change correctly reflects the updated S3 utility API.

components/job-orchestration/job_orchestration/executor/query/fs_search_task.py (2)

15-19: LGTM!

The import change correctly reflects the updated S3 utility API.


80-85: LGTM!

The call to generate_s3_url correctly passes endpoint_url as the first parameter, with appropriate error handling preserved.

components/core/src/clp/aws/AwsAuthenticationSigner.hpp (4)

43-46: LGTM!

The Style enum cleanly distinguishes between virtual-hosted and path-style S3 URLs, which is essential for correct URL parsing and presigned URL generation with custom endpoints.


52-53: LGTM!

The accessor is appropriately marked [[nodiscard]] and returns by value for the enum type.


62-63: LGTM!

Returning std::string_view for the schema accessor is efficient and consistent with other accessors in this class.


65-71: LGTM!

The new private members m_style and m_schema properly extend the class state to support both URL styles and schema-aware URL generation.

components/core/src/clp/aws/AwsAuthenticationSigner.cpp (2)

149-170: LGTM!

The canonical request construction correctly differentiates between VirtualHost and Path styles:

  • VirtualHost: URI is /{key}
  • Path: URI is /{bucket}/{key}

This aligns with AWS Signature Version 4 requirements for path-style URLs.


173-216: LGTM!

The URL parsing logic correctly handles both virtual-hosted and path-style S3 URL formats using distinct regex patterns. The default region assignment (us-east-1) when the region is empty maintains backward compatibility with URLs that don't specify a region. Based on learnings, this aligns with the guidance for S3 URLs without region specifications.

components/core/src/clp/aws/constants.hpp (1)

10-16: Delegate bucket name validation to the AWS S3 API rather than implementing partial validation through regex patterns.

While the cBucketRegex pattern [a-z0-9.-]+ is indeed permissive (allowing buckets starting/ending with periods or hyphens, no length constraints, no consecutive period checks, etc.), implementing stricter regex validation in the constants is not the appropriate solution. AWS S3 bucket naming rules are extensive and subject to change. Instead, let the S3 API handle validation on actual requests. If this regex is used for URL parsing/extraction, a permissive pattern is acceptable.

Likely an incorrect or invalid review comment.

components/clp-package-utils/clp_package_utils/scripts/native/compress.py (3)

175-188: LGTM!

The unpacking of the 5-tuple and propagation of endpoint_url to S3InputConfig is correct and consistent with the updated _parse_and_validate_s3_object_urls signature.


189-205: LGTM!

The S3_KEY_PREFIX_COMPRESSION path correctly extracts endpoint_url from parse_s3_url and passes it to S3InputConfig.


250-284: LGTM!

The validation logic correctly ensures all URLs share the same endpoint, region, and bucket, and properly detects duplicate keys using a set.

components/clp-py-utils/clp_py_utils/s3_utils.py (7)

39-43: LGTM!

The regex constants are well-defined for parsing S3 URLs in both host-style and path-style formats.


188-215: LGTM!

The _create_s3_client function correctly accepts endpoint_url and passes it to the boto3 client, enabling connectivity to S3-compatible services like MinIO and LocalStack.


225-261: LGTM!

The URL parsing logic correctly handles both host-style and path-style URLs, with optional region codes, and properly determines whether to return a custom endpoint_url or None for AWS default endpoints.


264-288: LGTM!

The URL generation logic correctly produces virtual-hosted style URLs for AWS and path-style URLs for custom endpoints, aligning with the PR objectives for supporting MinIO/LocalStack.


304-308: LGTM!

Correctly passes endpoint_url and region_code to _create_s3_client.


343-345: LGTM!

Correctly passes endpoint_url to _create_s3_client.


407-409: LGTM!

Correctly passes endpoint_url to _create_s3_client.

components/clp-package-utils/clp_package_utils/scripts/native/dataset_manager.py (1)

143-149: LGTM!

The call to s3_delete_by_key_prefix correctly passes s3_config.endpoint_url as the first argument, aligning with the updated function signature in s3_utils.py.

Comment thread components/clp-py-utils/clp_py_utils/clp_config.py
Comment thread components/clp-py-utils/clp_py_utils/s3_utils.py
Comment thread components/core/src/clp/aws/AwsAuthenticationSigner.cpp
@hoophalab hoophalab requested a review from gibber9809 December 10, 2025 23:31

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
components/clp-py-utils/clp_py_utils/s3_utils.py (1)

353-368: Update s3_delete_by_key_prefix docstring to reflect endpoint_url and optional region.

The function signature now includes endpoint_url and allows region_code to be None, but the docstring still has no :param endpoint_url: entry and leaves :param region_code: undocumented. That’s misleading for callers trying to understand how to use AWS versus custom S3-compatible endpoints.

Consider updating the docstring along these lines:

 def s3_delete_by_key_prefix(
-    endpoint_url: str | None,
-    region_code: str | None,
+    endpoint_url: str | None,
+    region_code: str | None,
     bucket_name: str,
     key_prefix: str,
     s3_auth: AwsAuthentication,
 ) -> None:
     """
     Deletes all objects under the S3 path `bucket_name`/`key_prefix`.
 
-    :param region_code:
-    :param bucket_name:
-    :param key_prefix:
-    :param s3_auth:
+    :param endpoint_url: Optional custom S3 endpoint base URL (e.g., `http://minio:9000`).
+        Use `None` for AWS S3 so the default AWS endpoint is used.
+    :param region_code: Optional AWS region. Must be non-`None` for real AWS S3 endpoints;
+        may be `None` for S3-compatible custom endpoints.
+    :param bucket_name: Name of the bucket to delete from.
+    :param key_prefix: Object key prefix under which all objects will be deleted.
+    :param s3_auth: AWS authentication configuration used to create the S3 client.

This keeps the behaviour unchanged but makes the intended usage much clearer, especially in the context of the new endpoint support. Based on learnings, this also aligns with the updated S3Config.region_code semantics (required only for AWS S3 endpoints, optional for custom ones).

Also applies to: 375-377

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 784f8a0 and e1cdf38.

📒 Files selected for processing (1)
  • components/clp-py-utils/clp_py_utils/s3_utils.py (8 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/clp_config.py:580-585
Timestamp: 2025-12-10T23:32:19.818Z
Learning: In PR #1767, custom S3 endpoint support was added to CLP. The S3Config.region_code field is now optional (NonEmptyStr | None) because custom S3-compatible endpoints (MinIO, LocalStack, etc.) use path-style URLs and don't require AWS region codes. Only AWS S3 endpoints require region_code. Presto integration still requires region_code because it only works with AWS S3.
Learnt from: haiqi96
Repo: y-scope/clp PR: 852
File: components/clp-package-utils/clp_package_utils/scripts/native/compress.py:151-160
Timestamp: 2025-04-25T20:46:20.140Z
Learning: For S3 URLs without region specifications (legacy global endpoints), either assign a default region (us-east-1) or throw a clear error message requiring region specification in the URL. This addresses validation issues in components like S3InputConfig that require a non-nullable region string.
📚 Learning: 2025-12-10T23:32:19.818Z
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/clp_config.py:580-585
Timestamp: 2025-12-10T23:32:19.818Z
Learning: In PR #1767, custom S3 endpoint support was added to CLP. The S3Config.region_code field is now optional (NonEmptyStr | None) because custom S3-compatible endpoints (MinIO, LocalStack, etc.) use path-style URLs and don't require AWS region codes. Only AWS S3 endpoints require region_code. Presto integration still requires region_code because it only works with AWS S3.

Applied to files:

  • components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-04-25T20:46:20.140Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 852
File: components/clp-package-utils/clp_package_utils/scripts/native/compress.py:151-160
Timestamp: 2025-04-25T20:46:20.140Z
Learning: For S3 URLs without region specifications (legacy global endpoints), either assign a default region (us-east-1) or throw a clear error message requiring region specification in the URL. This addresses validation issues in components like S3InputConfig that require a non-nullable region string.

Applied to files:

  • components/clp-py-utils/clp_py_utils/s3_utils.py
🔇 Additional comments (2)
components/clp-py-utils/clp_py_utils/s3_utils.py (2)

39-44: Regex building blocks are cohesive and consistent.

The shared regex fragments for schema, endpoint, region, bucket, and key prefix are well-factored and make the later URL patterns easier to reason about; no issues from a correctness or readability standpoint for the supported URL shapes.


188-215: Endpoint-aware client creation is consistent with config semantics.

Passing endpoint_url through _create_s3_client and allowing region_code to be None aligns with the S3Config design (AWS endpoints require a region; custom S3-compatible endpoints may omit it). All five call sites correctly pass endpoint_url and region_code in the intended combinations: AWS calls supply non-None region_code and leave endpoint_url as None, while MinIO/LocalStack calls set endpoint_url and may omit region_code. This matches how boto3.Session(..., region_name=region_code) and client("s3", endpoint_url=endpoint_url, ...) are intended to be used. No remaining callers use the old signature.

Comment thread components/clp-py-utils/clp_py_utils/s3_utils.py

@gibber9809 gibber9809 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed only the c++ side of this PR. I like the design, just a few nits + I think we should add some tests for the S3Url URL parsing at least.

Comment thread components/core/src/clp/aws/AwsAuthenticationSigner.hpp Outdated
Comment thread components/core/src/clp/aws/AwsAuthenticationSigner.hpp Outdated
Comment thread components/core/src/clp/aws/AwsAuthenticationSigner.cpp
Comment thread components/clp-py-utils/clp_py_utils/s3_utils.py
@hoophalab

Copy link
Copy Markdown
Contributor Author

Rabbit seems to have a point here, though I haven't read the python side of this PR carefully enough to be sure.

I confirm rabbit is right. Let me make some changes.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
components/clp-py-utils/clp_py_utils/s3_utils.py (2)

218-224: Docstring still doesn’t fully describe the 4‑tuple semantics.

You’ve updated it to a 4‑tuple, but it doesn’t document that endpoint_url is None for *.amazonaws.com URLs and that region_code may be None for AWS global endpoints and for non‑AWS endpoints (per S3Config). It also says bucket instead of bucket_name, which is slightly inconsistent with the implementation.

Based on learnings, region codes are required only for AWS S3, so spelling out those cases in the docstring will help future callers reason about the return values.


225-261: Non‑AWS s3.<domain> path‑style URLs still drop the s3. prefix and won’t round‑trip.

For URLs like https://s3.wasabisys.com/bucket/key and https://s3.us-west-1.wasabisys.com/bucket/key, path_style_url_regex captures endpoint="wasabisys.com" and region_code (None or us-west-1), so parse_s3_url sets endpoint_url = "https://wasabisys.com". Subsequent client creation and generate_s3_url then talk to wasabisys.com instead of s3[.region].wasabisys.com, which will break typical S3‑compatible providers that expose their service on s3.* hosts.

Given that only AWS needs a distinct region_code (custom endpoints don’t), a safer option for non‑AWS path‑style hosts is to preserve the full host (including any s3. + region prefix) as endpoint_url and treat region_code as None. One concrete way to do this is:

+from urllib.parse import urlsplit
@@
-    match = host_style_url_regex.match(s3_url)
-
-    if match is None:
+    match = host_style_url_regex.match(s3_url)
+    is_path_style = False
+
+    if match is None:
         path_style_url_regex = re.compile(
@@
-        )
-        match = path_style_url_regex.match(s3_url)
+        )
+        match = path_style_url_regex.match(s3_url)
+        is_path_style = True
@@
-    scheme = match.group("scheme")
-    endpoint = match.group("endpoint")
-    region_code = match.group("region_code")
-    bucket_name = match.group("bucket_name")
-
-    endpoint_url = f"{scheme}://{endpoint}" if endpoint != AWS_ENDPOINT else None
-    key_prefix = match.group("key_prefix")
+    scheme = match.group("scheme")
+    endpoint = match.group("endpoint")
+    region_code = match.group("region_code")
+    bucket_name = match.group("bucket_name")
+    key_prefix = match.group("key_prefix")
+
+    full_host = urlsplit(s3_url).netloc
+    if full_host.endswith(AWS_ENDPOINT):
+        # AWS: rely on region_code for routing; endpoint_url stays None.
+        endpoint_url = None
+    elif is_path_style and full_host.startswith("s3."):
+        # Non‑AWS `s3[.region].<domain>` endpoints: keep the full host and
+        # use path‑style addressing; region_code isn’t needed for them.
+        endpoint_url = f"{scheme}://{full_host}"
+        region_code = None
+    else:
+        endpoint_url = f"{scheme}://{endpoint}"
@@
-    return endpoint_url, region_code, bucket_name, key_prefix
+    return endpoint_url, region_code, bucket_name, key_prefix

This preserves s3.* for non‑AWS path‑style URLs, keeps AWS behaviour unchanged, and allows generate_s3_url to round‑trip both AWS and S3‑compatible path‑style URLs correctly. Based on learnings, clearing region_code for non‑AWS endpoints is consistent with the S3Config contract (region only required for AWS S3).

Please validate this behaviour with at least one AWS S3 URL and one S3‑compatible endpoint that uses s3.<domain> (e.g., Wasabi or a LocalStack/MinIO deployment behind such a host) to confirm correct round‑tripping and connectivity.

Also applies to: 264-289

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e1cdf38 and d11b881.

📒 Files selected for processing (4)
  • components/clp-py-utils/clp_py_utils/s3_utils.py (8 hunks)
  • components/core/src/clp/aws/AwsAuthenticationSigner.cpp (4 hunks)
  • components/core/src/clp/aws/AwsAuthenticationSigner.hpp (3 hunks)
  • components/core/src/clp/aws/constants.hpp (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/src/clp/aws/AwsAuthenticationSigner.cpp
  • components/core/src/clp/aws/constants.hpp
  • components/core/src/clp/aws/AwsAuthenticationSigner.hpp
🧠 Learnings (6)
📓 Common learnings
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/clp_config.py:580-585
Timestamp: 2025-12-10T23:32:24.363Z
Learning: In PR #1767, custom S3 endpoint support was added to CLP. The S3Config.region_code field is now optional (NonEmptyStr | None) because custom S3-compatible endpoints (MinIO, LocalStack, etc.) use path-style URLs and don't require AWS region codes. Only AWS S3 endpoints require region_code. Presto integration still requires region_code because it only works with AWS S3.
📚 Learning: 2025-04-25T20:46:20.140Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 852
File: components/clp-package-utils/clp_package_utils/scripts/native/compress.py:151-160
Timestamp: 2025-04-25T20:46:20.140Z
Learning: For S3 URLs without region specifications (legacy global endpoints), either assign a default region (us-east-1) or throw a clear error message requiring region specification in the URL. This addresses validation issues in components like S3InputConfig that require a non-nullable region string.

Applied to files:

  • components/core/src/clp/aws/AwsAuthenticationSigner.cpp
  • components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-12-10T23:32:24.363Z
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/clp_config.py:580-585
Timestamp: 2025-12-10T23:32:24.363Z
Learning: In PR #1767, custom S3 endpoint support was added to CLP. The S3Config.region_code field is now optional (NonEmptyStr | None) because custom S3-compatible endpoints (MinIO, LocalStack, etc.) use path-style URLs and don't require AWS region codes. Only AWS S3 endpoints require region_code. Presto integration still requires region_code because it only works with AWS S3.

Applied to files:

  • components/core/src/clp/aws/AwsAuthenticationSigner.cpp
  • components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-01-17T23:25:38.165Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 673
File: components/log-viewer-webui/server/src/S3Manager.js:0-0
Timestamp: 2025-01-17T23:25:38.165Z
Learning: In S3Manager.js, URL validation is handled by the URL constructor which throws TypeError for invalid URLs, making additional scheme (s3://) validation redundant.

Applied to files:

  • components/core/src/clp/aws/AwsAuthenticationSigner.cpp
📚 Learning: 2025-12-09T01:57:04.839Z
Learnt from: SharafMohamed
Repo: y-scope/clp PR: 1313
File: components/core/tests/test-SchemaSearcher.cpp:136-137
Timestamp: 2025-12-09T01:57:04.839Z
Learning: In C++ files under the components directory, keep forward declarations even when a definition follows in the same scope (such as in anonymous namespaces) to maintain consistency across multiple methods. This style reduces churn and aligns with the existing convention in this repository for both source and tests.

Applied to files:

  • components/core/src/clp/aws/AwsAuthenticationSigner.cpp
📚 Learning: 2025-09-15T22:20:40.750Z
Learnt from: quinntaylormitchell
Repo: y-scope/clp PR: 1125
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:267-291
Timestamp: 2025-09-15T22:20:40.750Z
Learning: For CLP compression jobs, the team has decided to fail the entire job immediately upon encountering any invalid input path, rather than continuing to process valid paths. This decision was made during PR #1125 development.

Applied to files:

  • components/clp-py-utils/clp_py_utils/s3_utils.py
🧬 Code graph analysis (1)
components/clp-py-utils/clp_py_utils/s3_utils.py (1)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
  • AwsAuthentication (547-577)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)
  • GitHub Check: manylinux_2_28-x86_64-static-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-static-linked-bins
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: package-image
  • GitHub Check: build-macos (macos-15, true)
  • GitHub Check: build-macos (macos-15, false)
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: build (ubuntu-24.04)
  • GitHub Check: build (macos-15)
🔇 Additional comments (9)
components/core/src/clp/aws/AwsAuthenticationSigner.cpp (3)

150-158: LGTM! Canonical URI encoding correctly handles both URL styles.

The switch statement properly formats the URI based on the S3Url style:

  • VirtualHost includes only the key
  • Path includes both bucket and key

This aligns with AWS Signature Version 4 canonical request requirements for each style.


173-216: URL parsing logic is sound.

The constructor correctly:

  • Attempts to match both virtual-hosted-style and path-style URL patterns
  • Sets the appropriate style based on which pattern matches
  • Extracts all URL components from the regex match
  • Defaults the region to cDefaultRegion when empty (consistent with support for custom S3-compatible endpoints)

The extraction logic assumes all named groups exist in the match object. This is safe because boost::regex returns empty strings for optional groups, and the regex structure guarantees the named groups are present.

Based on learnings, the optional region handling correctly supports custom S3-compatible endpoints that don't require region codes.


248-272: LGTM! Presigned URL generation correctly respects URL style and scheme.

The switch statement properly formats the presigned URL:

  • VirtualHost places the bucket in the hostname, so the path is "/{key}"
  • Path places the bucket in the path, so the path is "/{bucket}/{key}"
  • Both correctly use the parsed scheme, supporting http, https, and s3 protocols

This is consistent with the canonical request URI formatting and AWS Signature Version 4 requirements.

components/core/src/clp/aws/AwsAuthenticationSigner.hpp (1)

5-5: LGTM! Header additions are clean and well-structured.

The new members and accessors correctly support S3 URL style handling:

  • Style enum class is properly scoped and uses uint8_t for efficient storage
  • get_style() and get_scheme() accessors follow existing patterns with [[nodiscard]] and const
  • m_style has default brace initialisation per the style guide
  • All additions integrate cleanly with the existing interface

Also applies to: 43-46, 52-52, 62-62, 65-65, 71-71

components/clp-py-utils/clp_py_utils/s3_utils.py (5)

39-43: Regex constants for S3 URL parsing look consistent.

Centralising the scheme/endpoint/region/bucket/key regex pieces here lines up cleanly with the host‑ and path‑style patterns below and should help keep parsing and validation in sync.


188-215: Endpoint‑aware _create_s3_client wiring looks correct.

Passing endpoint_url through to aws_session.client("s3", ...) and using region_name=region_code on the session matches boto3 usage and should support both AWS and custom S3‑compatible endpoints, assuming callers only omit region_code for non‑AWS endpoints.

Please double‑check that all AWS S3 call sites still always provide a non‑None region_code so boto3 doesn’t complain about a missing region.


304-308: s3_get_object_metadata now correctly honours endpoint_url.

Using _create_s3_client(s3_input_config.endpoint_url, s3_input_config.region_code, ...) aligns metadata fetches with the new endpoint‑aware config and should work for both AWS and custom S3 backends.

It would be helpful to have (or keep) an integration test hitting a non‑AWS endpoint here to guard against regressions.


343-345: s3_put uses the endpoint‑aware client consistently.

The _create_s3_client call now passes s3_config.endpoint_url and s3_config.region_code, matching the updated helper signature and ensuring uploads respect custom endpoints.

Please confirm all other _create_s3_client callers were updated to the new (endpoint_url, region_code, aws_authentication, boto3_config) ordering.


407-409: s3_delete_objects now respects endpoint_url as well.

Creating the client with s3_config.endpoint_url and s3_config.region_code keeps bulk deletions consistent with the rest of the S3 helpers.

As with the other helpers, please ensure this path is exercised against a custom S3 endpoint in tests or manual validation.

Comment thread components/clp-py-utils/clp_py_utils/s3_utils.py
Comment thread components/core/src/clp/aws/constants.hpp

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

♻️ Duplicate comments (1)
components/clp-py-utils/clp_py_utils/s3_utils.py (1)

352-360: Docstring parameter descriptions are incomplete.

The docstring lists :param endpoint_url: and :param region_code: without descriptions. Consider adding brief descriptions as suggested in the past review.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d11b881 and dde3f61.

📒 Files selected for processing (2)
  • components/clp-py-utils/clp_py_utils/s3_utils.py (7 hunks)
  • components/core/src/clp/aws/test/test_AwsAuthenticationSigner.cpp (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/src/clp/aws/test/test_AwsAuthenticationSigner.cpp
🧠 Learnings (6)
📓 Common learnings
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/clp_config.py:580-585
Timestamp: 2025-12-10T23:32:24.363Z
Learning: In PR #1767, custom S3 endpoint support was added to CLP. The S3Config.region_code field is now optional (NonEmptyStr | None) because custom S3-compatible endpoints (MinIO, LocalStack, etc.) use path-style URLs and don't require AWS region codes. Only AWS S3 endpoints require region_code. Presto integration still requires region_code because it only works with AWS S3.
Learnt from: haiqi96
Repo: y-scope/clp PR: 852
File: components/clp-package-utils/clp_package_utils/scripts/native/compress.py:151-160
Timestamp: 2025-04-25T20:46:20.140Z
Learning: For S3 URLs without region specifications (legacy global endpoints), either assign a default region (us-east-1) or throw a clear error message requiring region specification in the URL. This addresses validation issues in components like S3InputConfig that require a non-nullable region string.
📚 Learning: 2025-12-09T01:57:04.839Z
Learnt from: SharafMohamed
Repo: y-scope/clp PR: 1313
File: components/core/tests/test-SchemaSearcher.cpp:136-137
Timestamp: 2025-12-09T01:57:04.839Z
Learning: In C++ files under the components directory, keep forward declarations even when a definition follows in the same scope (such as in anonymous namespaces) to maintain consistency across multiple methods. This style reduces churn and aligns with the existing convention in this repository for both source and tests.

Applied to files:

  • components/core/src/clp/aws/test/test_AwsAuthenticationSigner.cpp
📚 Learning: 2025-12-10T23:32:24.363Z
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/clp_config.py:580-585
Timestamp: 2025-12-10T23:32:24.363Z
Learning: In PR #1767, custom S3 endpoint support was added to CLP. The S3Config.region_code field is now optional (NonEmptyStr | None) because custom S3-compatible endpoints (MinIO, LocalStack, etc.) use path-style URLs and don't require AWS region codes. Only AWS S3 endpoints require region_code. Presto integration still requires region_code because it only works with AWS S3.

Applied to files:

  • components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-04-25T20:46:20.140Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 852
File: components/clp-package-utils/clp_package_utils/scripts/native/compress.py:151-160
Timestamp: 2025-04-25T20:46:20.140Z
Learning: For S3 URLs without region specifications (legacy global endpoints), either assign a default region (us-east-1) or throw a clear error message requiring region specification in the URL. This addresses validation issues in components like S3InputConfig that require a non-nullable region string.

Applied to files:

  • components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-09-15T22:20:40.750Z
Learnt from: quinntaylormitchell
Repo: y-scope/clp PR: 1125
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:267-291
Timestamp: 2025-09-15T22:20:40.750Z
Learning: For CLP compression jobs, the team has decided to fail the entire job immediately upon encountering any invalid input path, rather than continuing to process valid paths. This decision was made during PR #1125 development.

Applied to files:

  • components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-12-12T16:22:05.883Z
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/core/src/clp/aws/constants.hpp:10-16
Timestamp: 2025-12-12T16:22:05.883Z
Learning: AWS S3 region codes follow the pattern `[region]-[zone]-[digit]` where the digit is always a single decimal digit (e.g., us-east-1, eu-west-3, ap-northeast-2). Custom S3-compatible endpoints (MinIO, LocalStack) may use different region naming conventions.

Applied to files:

  • components/clp-py-utils/clp_py_utils/s3_utils.py
🪛 Cppcheck (2.18.0)
components/core/src/clp/aws/test/test_AwsAuthenticationSigner.cpp

[information] 1-1: Include file

(missingIncludeSystem)


[information] 2-2: Include file

(missingIncludeSystem)


[information] 3-3: Include file

(missingIncludeSystem)


[information] 5-5: Include file

(missingInclude)


[information] 6-6: Include file

(missingInclude)

🔇 Additional comments (4)
components/core/src/clp/aws/test/test_AwsAuthenticationSigner.cpp (1)

12-34: Good coverage of VirtualHost vs Path style parsing (including custom endpoint + port).

components/clp-py-utils/clp_py_utils/s3_utils.py (3)

39-45: LGTM!

The regex constants are well-defined and reusable. The REGION_CODE_REGEXP correctly matches single-digit AWS region suffixes. Based on learnings, AWS region codes follow the [region]-[zone]-[digit] pattern where the digit is always a single decimal digit.


189-216: LGTM!

The updated _create_s3_client correctly propagates endpoint_url to the boto3 client and handles all authentication types consistently.


296-300: LGTM!

The updated calls correctly pass endpoint_url and region_code from the config objects to _create_s3_client.

Comment thread components/clp-py-utils/clp_py_utils/s3_utils.py
Comment thread components/clp-py-utils/clp_py_utils/s3_utils.py
Comment thread components/clp-py-utils/clp_py_utils/s3_utils.py Outdated
Comment thread components/core/src/clp/aws/test/test_AwsAuthenticationSigner.cpp Outdated
Comment thread components/core/src/clp/aws/test/test_AwsAuthenticationSigner.cpp
@hoophalab hoophalab requested a review from gibber9809 December 12, 2025 21:07
Comment thread components/core/src/clp/aws/test/test_AwsAuthenticationSigner.cpp

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
components/clp-py-utils/clp_py_utils/s3_utils.py (1)

189-216: Add explicit path-style addressing for custom S3 endpoints.

The _create_s3_client function doesn't enforce s3.addressing_style=path for custom endpoints. While boto3's default "auto" mode should eventually fall back to path-style addressing, explicitly setting path-style prevents unnecessary virtual-hosted-style attempts against custom S3-compatible services (MinIO, LocalStack, etc.), which don't support that mode.

Update the client creation to merge a path-style config when a custom endpoint is provided:

 def _create_s3_client(
     endpoint_url: str | None,
     region_code: str | None,
     s3_auth: AwsAuthentication,
     boto3_config: Config | None = None,
 ) -> boto3.client:
@@
-    s3_client = aws_session.client("s3", endpoint_url=endpoint_url, config=boto3_config)
+    effective_config = boto3_config
+    if endpoint_url is not None:
+        # Custom S3-compatible endpoints generally require path-style addressing.
+        path_style_cfg = Config(s3={"addressing_style": "path"})
+        effective_config = (
+            path_style_cfg.merge(boto3_config) if boto3_config is not None else path_style_cfg
+        )
+
+    s3_client = aws_session.client("s3", endpoint_url=endpoint_url, config=effective_config)
     return s3_client
♻️ Duplicate comments (2)
components/core/src/clp/aws/test/test_AwsAuthenticationSigner.cpp (1)

9-33: Split into SECTIONs and CAPTURE(url) to make failures self-identifying. Right now, a single failure doesn’t immediately tell which input URL broke parsing.

components/clp-py-utils/clp_py_utils/s3_utils.py (1)

348-365: Docstring params for s3_delete_by_key_prefix are still blank/out of sync.

This was raised previously; :param endpoint_url: / :param region_code: should clarify AWS-vs-custom semantics (and that region_code may be None for non-AWS endpoints).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dde3f61 and 0c41f7f.

📒 Files selected for processing (3)
  • components/clp-py-utils/clp_py_utils/s3_utils.py (7 hunks)
  • components/core/CMakeLists.txt (1 hunks)
  • components/core/src/clp/aws/test/test_AwsAuthenticationSigner.cpp (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/src/clp/aws/test/test_AwsAuthenticationSigner.cpp
🧠 Learnings (14)
📓 Common learnings
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/clp_config.py:580-585
Timestamp: 2025-12-10T23:32:24.363Z
Learning: In PR #1767, custom S3 endpoint support was added to CLP. The S3Config.region_code field is now optional (NonEmptyStr | None) because custom S3-compatible endpoints (MinIO, LocalStack, etc.) use path-style URLs and don't require AWS region codes. Only AWS S3 endpoints require region_code. Presto integration still requires region_code because it only works with AWS S3.
Learnt from: haiqi96
Repo: y-scope/clp PR: 852
File: components/clp-package-utils/clp_package_utils/scripts/native/compress.py:151-160
Timestamp: 2025-04-25T20:46:20.140Z
Learning: For S3 URLs without region specifications (legacy global endpoints), either assign a default region (us-east-1) or throw a clear error message requiring region specification in the URL. This addresses validation issues in components like S3InputConfig that require a non-nullable region string.
📚 Learning: 2024-11-29T22:50:17.206Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 595
File: components/core/tests/test-clp_s-end_to_end.cpp:109-110
Timestamp: 2024-11-29T22:50:17.206Z
Learning: In `components/core/tests/test-clp_s-end_to_end.cpp`, the success of `constructor.store()` is verified through `REQUIRE` statements and subsequent comparisons.

Applied to files:

  • components/core/src/clp/aws/test/test_AwsAuthenticationSigner.cpp
📚 Learning: 2024-11-01T03:26:26.386Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 570
File: components/core/tests/test-ir_encoding_methods.cpp:376-399
Timestamp: 2024-11-01T03:26:26.386Z
Learning: In the test code (`components/core/tests/test-ir_encoding_methods.cpp`), exception handling for `msgpack::unpack` can be omitted because the Catch2 testing framework captures exceptions if they occur.

Applied to files:

  • components/core/src/clp/aws/test/test_AwsAuthenticationSigner.cpp
📚 Learning: 2024-10-13T09:27:43.408Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 557
File: components/core/tests/test-ir_encoding_methods.cpp:1216-1286
Timestamp: 2024-10-13T09:27:43.408Z
Learning: In the unit test case `ffi_ir_stream_serialize_schema_tree_node_id` in `test-ir_encoding_methods.cpp`, suppressing the `readability-function-cognitive-complexity` warning is acceptable due to the expansion of Catch2 macros in C++ tests, and such test cases may not have readability issues.

Applied to files:

  • components/core/src/clp/aws/test/test_AwsAuthenticationSigner.cpp
📚 Learning: 2025-12-09T01:57:04.839Z
Learnt from: SharafMohamed
Repo: y-scope/clp PR: 1313
File: components/core/tests/test-SchemaSearcher.cpp:136-137
Timestamp: 2025-12-09T01:57:04.839Z
Learning: In C++ files under the components directory, keep forward declarations even when a definition follows in the same scope (such as in anonymous namespaces) to maintain consistency across multiple methods. This style reduces churn and aligns with the existing convention in this repository for both source and tests.

Applied to files:

  • components/core/src/clp/aws/test/test_AwsAuthenticationSigner.cpp
📚 Learning: 2025-12-10T23:32:24.363Z
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/clp_config.py:580-585
Timestamp: 2025-12-10T23:32:24.363Z
Learning: In PR #1767, custom S3 endpoint support was added to CLP. The S3Config.region_code field is now optional (NonEmptyStr | None) because custom S3-compatible endpoints (MinIO, LocalStack, etc.) use path-style URLs and don't require AWS region codes. Only AWS S3 endpoints require region_code. Presto integration still requires region_code because it only works with AWS S3.

Applied to files:

  • components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-04-25T20:46:20.140Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 852
File: components/clp-package-utils/clp_package_utils/scripts/native/compress.py:151-160
Timestamp: 2025-04-25T20:46:20.140Z
Learning: For S3 URLs without region specifications (legacy global endpoints), either assign a default region (us-east-1) or throw a clear error message requiring region specification in the URL. This addresses validation issues in components like S3InputConfig that require a non-nullable region string.

Applied to files:

  • components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-12-12T21:14:28.267Z
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/s3_utils.py:226-230
Timestamp: 2025-12-12T21:14:28.267Z
Learning: In the CLP project (y-scope/clp), ensure that regex patterns are compiled inline within functions for Python files under components/clp-py-utils/clp_py_utils (not at module scope). Move any re.compile calls from the module top level into the functions that use them to maintain consistency with the codebase style. This applies to all .py files in that directory; use module-level imports only for imports, not for pre-compiling regex patterns.

Applied to files:

  • components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-09-15T22:20:40.750Z
Learnt from: quinntaylormitchell
Repo: y-scope/clp PR: 1125
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:267-291
Timestamp: 2025-09-15T22:20:40.750Z
Learning: For CLP compression jobs, the team has decided to fail the entire job immediately upon encountering any invalid input path, rather than continuing to process valid paths. This decision was made during PR #1125 development.

Applied to files:

  • components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-12-12T16:22:05.883Z
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/core/src/clp/aws/constants.hpp:10-16
Timestamp: 2025-12-12T16:22:05.883Z
Learning: AWS S3 region codes follow the pattern `[region]-[zone]-[digit]` where the digit is always a single decimal digit (e.g., us-east-1, eu-west-3, ap-northeast-2). Custom S3-compatible endpoints (MinIO, LocalStack) may use different region naming conventions.

Applied to files:

  • components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-08-15T21:48:40.228Z
Learnt from: quinntaylormitchell
Repo: y-scope/clp PR: 1181
File: docs/src/user-guide/guides-retention.md:68-73
Timestamp: 2025-08-15T21:48:40.228Z
Learning: In documentation for the CLP project, when suggesting formatting improvements for variables in explanatory text, the user quinntaylormitchell prefers to maintain existing sentence structures (like keeping "that" in "i.e., that the difference...") while applying monospace formatting to technical terms and variables for consistency.

Applied to files:

  • components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-11-17T22:58:50.056Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 0
File: :0-0
Timestamp: 2025-11-17T22:58:50.056Z
Learning: In the y-scope/clp repository, when enabling new linting tools (ruff, mypy) on Python components, the team uses an incremental approach: first enable the tools with errors allowed (exit code 0), apply only safe auto-fixable fixes, then address remaining issues in follow-up PRs. During the initial enablement PR, reviews should focus on correctness of auto-fixes rather than flagging new code quality issues.

Applied to files:

  • components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-08-24T17:10:44.565Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: taskfiles/lint.yaml:780-781
Timestamp: 2025-08-24T17:10:44.565Z
Learning: The CLP repository is gradually migrating from Black to Ruff for Python formatting. Some modules haven't been set up for Ruff yet, so there's intentional mixed usage of Black and Ruff formatters during this transition period. This should not be flagged as an inconsistency in PR reviews.

Applied to files:

  • components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-10-13T03:24:35.074Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1405
File: components/clp-package-utils/pyproject.toml:5-15
Timestamp: 2025-10-13T03:24:35.074Z
Learning: In the y-scope/clp repository, the Python 3.9 to 3.10 version requirement change was intentionally deferred to a separate PR (after PR #1405) to reduce review effort, as decided in an offline discussion between junhaoliao and kirkrodrigues.

Applied to files:

  • components/clp-py-utils/clp_py_utils/s3_utils.py
🪛 Cppcheck (2.18.0)
components/core/src/clp/aws/test/test_AwsAuthenticationSigner.cpp

[information] 1-1: Include file

(missingIncludeSystem)


[information] 2-2: Include file

(missingIncludeSystem)


[information] 3-3: Include file

(missingIncludeSystem)


[information] 5-5: Include file

(missingInclude)


[information] 6-6: Include file

(missingInclude)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: musllinux_1_2-x86_64-static-linked-bins
  • GitHub Check: package-image
  • GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-static-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: build-macos (macos-15, false)
  • GitHub Check: build-macos (macos-15, true)
  • GitHub Check: build (macos-15)
  • GitHub Check: build (ubuntu-24.04)
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (7)
components/core/CMakeLists.txt (1)

440-446: Unit test source wiring looks correct. Adding src/clp/aws/test/test_AwsAuthenticationSigner.cpp to SOURCE_FILES_unitTest should ensure it’s built/linked only with unitTest under CLP_BUILD_TESTING.

components/core/src/clp/aws/test/test_AwsAuthenticationSigner.cpp (1)

26-32: Confirm the intended behaviour for “custom endpoint + no region” (default vs empty). The test currently requires cDefaultRegion for http://localhost:4566/...; please verify this matches the PR’s cross-language contract (especially if Python now treats region as optional for non-AWS endpoints).

components/clp-py-utils/clp_py_utils/s3_utils.py (5)

219-254: S3 URL parsing changes look consistent with endpoint-aware 4-tuple return.

Returning (endpoint_url, region_code, bucket_name, key_prefix) and treating AWS as endpoint_url=None aligns with the new config model (optional region for custom endpoints). Based on learnings, the inline regex compilation is also consistent with repo style.


256-284: generate_s3_url behaviour is reasonable, but requires callers to include s3. in endpoint_url if they want s3.<region>... hosts.

For non-AWS endpoints, the presence/absence of s3. in endpoint_url changes the generated host for region_code != None (e.g., https://s3.<region>.<endpoint> vs https://<region>.<endpoint>). Please confirm that’s intended for all supported providers (Wasabi/Spaces/etc.) and document it where configs are defined.


299-313: Call-site propagation of endpoint_url / region_code into _create_s3_client is correct.


338-340: Call-site propagation of endpoint_url / region_code into _create_s3_client is correct.


403-405: Call-site propagation of endpoint_url / region_code into _create_s3_client is correct.

Comment on lines +39 to +45
SCHEME_REGEXP = r"(?P<scheme>(http|https))"
S3_PREFIX_REGEXP = r"(?P<s3>s3)"
ENDPOINT_REGEXP = r"(?P<endpoint>[a-z0-9.-]+(\:[0-9]+)?)"
REGION_CODE_REGEXP = r"(?P<region_code>[a-z]+-[a-z]+-[0-9])"
BUCKET_NAME_REGEXP = r"(?P<bucket_name>[a-z0-9.-]+)"
KEY_PREFIX_REGEXP = r"(?P<key_prefix>[^?]+)"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

REGION_CODE_REGEXP is too strict; likely rejects valid AWS regions (e.g., us-gov-west-1).

REGION_CODE_REGEXP = r"(?P<region_code>[a-z]+-[a-z]+-[0-9])" won’t match regions with more than 2 hyphen-separated alpha segments, which can break parsing and URL generation for some AWS partitions/providers.

-REGION_CODE_REGEXP = r"(?P<region_code>[a-z]+-[a-z]+-[0-9])"
+# Allow additional hyphen-separated segments (e.g., us-gov-west-1).
+REGION_CODE_REGEXP = r"(?P<region_code>[a-z0-9]+(-[a-z0-9]+)*-[0-9])"
🤖 Prompt for AI Agents
components/clp-py-utils/clp_py_utils/s3_utils.py around lines 39 to 45: the
REGION_CODE_REGEXP is too strict and only matches two alpha segments plus a
single digit (rejects regions like us-gov-west-1); update the REGION_CODE_REGEXP
to accept one or more hyphen-separated alpha segments followed by a numeric
suffix (one or more digits) so multi-segment region codes are allowed, replace
the constant accordingly and run/adjust any tests that validate region parsing.

@gibber9809 gibber9809 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few nits, but otherwise looks good to me.

Comment thread components/clp-py-utils/clp_py_utils/s3_utils.py Outdated
Comment on lines +219 to 224
def parse_s3_url(s3_url: str) -> tuple[str | None, str | None, str, str]:
"""
Parses the region_code, bucket, and key_prefix from the given S3 URL.
Parses the endpoint_url, region_code, bucket, and key_prefix from the given S3 URL.
:param s3_url: A host-style URL or path-style URL.
:return: A tuple of (region_code, bucket, key_prefix).
:return: A tuple of (endpoint_url, region_code, bucket, key_prefix).
:raise: ValueError if `s3_url` is not a valid host-style URL or path-style URL.

@gibber9809 gibber9809 Dec 12, 2025

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like it's a bit weird that we parse the endpoint url for aws as None. I can see that we're consistently using a None endpoint to indicate aws everywhere, but maybe it would be simpler to have a small EndpointUrl class that stores the endpoint string + a flag that indicates whether the endpoint is S3 or not?

It's also potentially fine to stick with the str | None approach if you prefer, I just think we need to update these docstrings to explain that None means AWS in that case .

@hoophalab hoophalab Dec 15, 2025

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the documentation. The behavior feels analogous to when we don't specify an AWS_ENDPOINT_URL env in the AWS CLI, or when we pass endpoint_url=None to a boto3.client(...) call.

Honestly, I also feel returning a None for endpoint_url is very weird. Maybe the logic should be applied in _create_s3_client? Open to discussion.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's probably fine to stick with None for now since it's at least consistent everywhere. If we find it ends up being too confusing we should definitely come back to this after the release though.

Comment thread components/core/src/clp/aws/constants.hpp Outdated
Comment thread components/core/src/clp/aws/test/test_AwsAuthenticationSigner.cpp

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
components/clp-py-utils/clp_py_utils/s3_utils.py (1)

348-372: LGTM - Delete operation correctly uses endpoint configuration.

The function signature and client creation are properly updated to support custom S3 endpoints.

Note: The docstring enhancement for endpoint_url and region_code parameters was flagged in a previous review.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 366897c and fbfd3a9.

📒 Files selected for processing (2)
  • components/clp-py-utils/clp_py_utils/s3_utils.py (7 hunks)
  • components/core/src/clp/aws/constants.hpp (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/src/clp/aws/constants.hpp
🧠 Learnings (12)
📓 Common learnings
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/clp_config.py:580-585
Timestamp: 2025-12-10T23:32:24.363Z
Learning: In PR #1767, custom S3 endpoint support was added to CLP. The S3Config.region_code field is now optional (NonEmptyStr | None) because custom S3-compatible endpoints (MinIO, LocalStack, etc.) use path-style URLs and don't require AWS region codes. Only AWS S3 endpoints require region_code. Presto integration still requires region_code because it only works with AWS S3.
Learnt from: haiqi96
Repo: y-scope/clp PR: 852
File: components/clp-package-utils/clp_package_utils/scripts/native/compress.py:151-160
Timestamp: 2025-04-25T20:46:20.140Z
Learning: For S3 URLs without region specifications (legacy global endpoints), either assign a default region (us-east-1) or throw a clear error message requiring region specification in the URL. This addresses validation issues in components like S3InputConfig that require a non-nullable region string.
📚 Learning: 2025-04-25T20:46:20.140Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 852
File: components/clp-package-utils/clp_package_utils/scripts/native/compress.py:151-160
Timestamp: 2025-04-25T20:46:20.140Z
Learning: For S3 URLs without region specifications (legacy global endpoints), either assign a default region (us-east-1) or throw a clear error message requiring region specification in the URL. This addresses validation issues in components like S3InputConfig that require a non-nullable region string.

Applied to files:

  • components/core/src/clp/aws/constants.hpp
  • components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-12-12T21:14:36.067Z
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/s3_utils.py:226-230
Timestamp: 2025-12-12T21:14:36.067Z
Learning: In the CLP project (y-scope/clp), regex patterns in components/clp-py-utils/clp_py_utils/s3_utils.py should be compiled inline within functions rather than at module level, to maintain consistency with the existing codebase style.

Applied to files:

  • components/core/src/clp/aws/constants.hpp
📚 Learning: 2025-12-10T23:32:24.363Z
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/clp_config.py:580-585
Timestamp: 2025-12-10T23:32:24.363Z
Learning: In PR #1767, custom S3 endpoint support was added to CLP. The S3Config.region_code field is now optional (NonEmptyStr | None) because custom S3-compatible endpoints (MinIO, LocalStack, etc.) use path-style URLs and don't require AWS region codes. Only AWS S3 endpoints require region_code. Presto integration still requires region_code because it only works with AWS S3.

Applied to files:

  • components/core/src/clp/aws/constants.hpp
  • components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-12-12T16:22:05.883Z
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/core/src/clp/aws/constants.hpp:10-16
Timestamp: 2025-12-12T16:22:05.883Z
Learning: AWS S3 region codes are expected to follow the pattern <region>-<zone>-<digit> with a single decimal digit (e.g., us-east-1, eu-west-3, ap-northeast-2). In code handling region identifiers, validate against this pattern and treat region strings not matching it as invalid, while allowing the common exception that custom S3-compatible endpoints (e.g., MinIO, LocalStack) may use different region naming conventions. Apply this guideline to all relevant C++ header files under components/core/src/clp/aws (and similar AWS-related headers) to ensure consistent region string validation across the codebase.

Applied to files:

  • components/core/src/clp/aws/constants.hpp
📚 Learning: 2025-12-12T21:14:28.267Z
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/s3_utils.py:226-230
Timestamp: 2025-12-12T21:14:28.267Z
Learning: In the CLP project (y-scope/clp), ensure that regex patterns are compiled inline within functions for Python files under components/clp-py-utils/clp_py_utils (not at module scope). Move any re.compile calls from the module top level into the functions that use them to maintain consistency with the codebase style. This applies to all .py files in that directory; use module-level imports only for imports, not for pre-compiling regex patterns.

Applied to files:

  • components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-09-15T22:20:40.750Z
Learnt from: quinntaylormitchell
Repo: y-scope/clp PR: 1125
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:267-291
Timestamp: 2025-09-15T22:20:40.750Z
Learning: For CLP compression jobs, the team has decided to fail the entire job immediately upon encountering any invalid input path, rather than continuing to process valid paths. This decision was made during PR #1125 development.

Applied to files:

  • components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-12-12T16:22:15.181Z
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/core/src/clp/aws/constants.hpp:10-16
Timestamp: 2025-12-12T16:22:15.181Z
Learning: AWS S3 region codes follow the pattern `[region]-[zone]-[digit]` where the digit is always a single decimal digit (e.g., us-east-1, eu-west-3, ap-northeast-2). Custom S3-compatible endpoints (MinIO, LocalStack) may use different region naming conventions.

Applied to files:

  • components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-08-15T21:48:40.228Z
Learnt from: quinntaylormitchell
Repo: y-scope/clp PR: 1181
File: docs/src/user-guide/guides-retention.md:68-73
Timestamp: 2025-08-15T21:48:40.228Z
Learning: In documentation for the CLP project, when suggesting formatting improvements for variables in explanatory text, the user quinntaylormitchell prefers to maintain existing sentence structures (like keeping "that" in "i.e., that the difference...") while applying monospace formatting to technical terms and variables for consistency.

Applied to files:

  • components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-11-17T22:58:50.056Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 0
File: :0-0
Timestamp: 2025-11-17T22:58:50.056Z
Learning: In the y-scope/clp repository, when enabling new linting tools (ruff, mypy) on Python components, the team uses an incremental approach: first enable the tools with errors allowed (exit code 0), apply only safe auto-fixable fixes, then address remaining issues in follow-up PRs. During the initial enablement PR, reviews should focus on correctness of auto-fixes rather than flagging new code quality issues.

Applied to files:

  • components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-08-24T17:10:44.565Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: taskfiles/lint.yaml:780-781
Timestamp: 2025-08-24T17:10:44.565Z
Learning: The CLP repository is gradually migrating from Black to Ruff for Python formatting. Some modules haven't been set up for Ruff yet, so there's intentional mixed usage of Black and Ruff formatters during this transition period. This should not be flagged as an inconsistency in PR reviews.

Applied to files:

  • components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-10-13T03:24:35.074Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1405
File: components/clp-package-utils/pyproject.toml:5-15
Timestamp: 2025-10-13T03:24:35.074Z
Learning: In the y-scope/clp repository, the Python 3.9 to 3.10 version requirement change was intentionally deferred to a separate PR (after PR #1405) to reduce review effort, as decided in an offline discussion between junhaoliao and kirkrodrigues.

Applied to files:

  • components/clp-py-utils/clp_py_utils/s3_utils.py
🧬 Code graph analysis (1)
components/clp-py-utils/clp_py_utils/s3_utils.py (2)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
  • AwsAuthentication (547-577)
components/core/src/clp/aws/AwsAuthenticationSigner.hpp (1)
  • s3_url (103-104)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-static-linked-bins
  • GitHub Check: package-image
  • GitHub Check: musllinux_1_2-x86_64-static-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: build-macos (macos-15, false)
  • GitHub Check: build (ubuntu-24.04)
  • GitHub Check: build (macos-15)
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: lint-check (macos-15)
🔇 Additional comments (7)
components/clp-py-utils/clp_py_utils/s3_utils.py (7)

39-45: LGTM - Regex constants support flexible endpoint handling.

The relaxed REGION_CODE_REGEXP pattern supports custom S3-compatible endpoints (MinIO, LocalStack) that may use non-standard region naming conventions. The pattern works within the larger URL regex context where structured matching limits false positives.


189-216: LGTM - Client creation correctly propagates endpoint configuration.

The endpoint_url and region_code parameters are properly threaded through session creation and client instantiation for all authentication types.


219-253: LGTM - URL parsing correctly extracts endpoint and preserves s3. prefix.

The implementation properly handles both host-style and path-style URLs. The s3. prefix is correctly preserved in endpoint_url for non-AWS endpoints (line 249-250), addressing the previous round-trip concerns.


256-283: LGTM - URL generation handles AWS and custom endpoints correctly.

The function properly generates:

  • Virtual-hosted style URLs for AWS (bucket.s3.region.amazonaws.com)
  • Path-style URLs for custom endpoints (endpoint/bucket/key)

The s3. prefix and region code are correctly interpolated for non-AWS endpoints with region support.


299-303: LGTM - Endpoint configuration correctly propagated to client creation.


338-340: LGTM - Endpoint configuration correctly propagated for upload operations.


403-405: LGTM - Batch delete operations correctly use endpoint configuration.

Comment on lines +11 to +15
constexpr std::string_view cSchemeRegex{R"((?<scheme>(http|https)))"};
constexpr std::string_view cEndpointRegex{R"((?<endpoint>[a-z0-9.-]+(\:[0-9]+)?))"};
constexpr std::string_view cRegionRegex{R"((?<region>[a-z0-9\-]+))"};
constexpr std::string_view cBucketRegex{R"((?<bucket>[a-z0-9.-]+))"};
constexpr std::string_view cKeyRegex{R"((?<key>[^?]+))"};

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Regex constants look good; confirm AWS‑specific region validation happens elsewhere

The new regex constants are internally consistent and match the intended S3 URL components (scheme, endpoint + optional port, region, bucket, key). The more permissive cRegionRegex ([a-z0-9\-]+) makes sense for custom S3‑compatible endpoints, but it no longer enforces the AWS <region>-<zone>-<digit> shape by itself. As long as AWS‑only flows (e.g., anything that truly depends on an AWS S3 region code, like Presto integration) still validate region strings against the stricter AWS pattern in their own logic, this shared regex set is a reasonable compromise for mixed AWS/custom endpoint support. Based on learnings, this keeps custom endpoints working without silently relaxing AWS‑specific correctness guarantees.

🤖 Prompt for AI Agents
In components/core/src/clp/aws/constants.hpp around lines 11 to 15, the shared
regexes deliberately relax region validation to support custom S3 endpoints;
ensure any AWS-specific consumers (e.g., Presto integration or other AWS-only
flows) perform stricter region validation elsewhere by adding a validation step
using the AWS region pattern (e.g., full region-zone-digit form) before relying
on cRegionRegex, returning/throwing a clear validation error when the region
does not match the stricter AWS pattern; keep the permissive shared constants
as-is for mixed endpoint support.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
components/clp-py-utils/clp_py_utils/s3_utils.py (1)

349-365: s3_delete_by_key_prefix docstring still doesn’t describe endpoint_url / region_code.

The signature now takes both endpoint_url and region_code, but the docstring only lists :param endpoint_url: and :param region_code: without explaining their roles, which is confusing given the None‑means‑AWS convention used elsewhere.

Consider tightening the docstring to mirror parse_s3_url / S3Config semantics, e.g.:

 def s3_delete_by_key_prefix(
     endpoint_url: str | None,
     region_code: str | None,
@@
-    :param endpoint_url:
-    :param region_code:
+    :param endpoint_url: Custom S3 endpoint URL (e.g., https://minio.example.com:9000),
+        or None to use the default AWS S3 endpoint.
+    :param region_code: AWS region code for AWS endpoints; None for S3-compatible
+        endpoints that don’t require a region.
     :param bucket_name:
     :param key_prefix:
     :param s3_auth:

This keeps the docstring consistent with how S3Config and _create_s3_client interpret these fields. Based on learnings, that clarity is important now that region_code is optional for non‑AWS endpoints.

Also applies to: 372-373

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fbfd3a9 and 77eaa75.

📒 Files selected for processing (1)
  • components/clp-py-utils/clp_py_utils/s3_utils.py (7 hunks)
🧰 Additional context used
🧠 Learnings (10)
📓 Common learnings
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/clp_config.py:580-585
Timestamp: 2025-12-10T23:32:24.363Z
Learning: In PR #1767, custom S3 endpoint support was added to CLP. The S3Config.region_code field is now optional (NonEmptyStr | None) because custom S3-compatible endpoints (MinIO, LocalStack, etc.) use path-style URLs and don't require AWS region codes. Only AWS S3 endpoints require region_code. Presto integration still requires region_code because it only works with AWS S3.
Learnt from: haiqi96
Repo: y-scope/clp PR: 852
File: components/clp-package-utils/clp_package_utils/scripts/native/compress.py:151-160
Timestamp: 2025-04-25T20:46:20.140Z
Learning: For S3 URLs without region specifications (legacy global endpoints), either assign a default region (us-east-1) or throw a clear error message requiring region specification in the URL. This addresses validation issues in components like S3InputConfig that require a non-nullable region string.
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/core/src/clp/aws/constants.hpp:10-16
Timestamp: 2025-12-12T16:22:15.181Z
Learning: AWS S3 region codes follow the pattern `[region]-[zone]-[digit]` where the digit is always a single decimal digit (e.g., us-east-1, eu-west-3, ap-northeast-2). Custom S3-compatible endpoints (MinIO, LocalStack) may use different region naming conventions.
📚 Learning: 2025-12-10T23:32:24.363Z
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/clp_config.py:580-585
Timestamp: 2025-12-10T23:32:24.363Z
Learning: In PR #1767, custom S3 endpoint support was added to CLP. The S3Config.region_code field is now optional (NonEmptyStr | None) because custom S3-compatible endpoints (MinIO, LocalStack, etc.) use path-style URLs and don't require AWS region codes. Only AWS S3 endpoints require region_code. Presto integration still requires region_code because it only works with AWS S3.

Applied to files:

  • components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-12-12T21:14:28.267Z
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/s3_utils.py:226-230
Timestamp: 2025-12-12T21:14:28.267Z
Learning: In the CLP project (y-scope/clp), ensure that regex patterns are compiled inline within functions for Python files under components/clp-py-utils/clp_py_utils (not at module scope). Move any re.compile calls from the module top level into the functions that use them to maintain consistency with the codebase style. This applies to all .py files in that directory; use module-level imports only for imports, not for pre-compiling regex patterns.

Applied to files:

  • components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-04-25T20:46:20.140Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 852
File: components/clp-package-utils/clp_package_utils/scripts/native/compress.py:151-160
Timestamp: 2025-04-25T20:46:20.140Z
Learning: For S3 URLs without region specifications (legacy global endpoints), either assign a default region (us-east-1) or throw a clear error message requiring region specification in the URL. This addresses validation issues in components like S3InputConfig that require a non-nullable region string.

Applied to files:

  • components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-09-15T22:20:40.750Z
Learnt from: quinntaylormitchell
Repo: y-scope/clp PR: 1125
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:267-291
Timestamp: 2025-09-15T22:20:40.750Z
Learning: For CLP compression jobs, the team has decided to fail the entire job immediately upon encountering any invalid input path, rather than continuing to process valid paths. This decision was made during PR #1125 development.

Applied to files:

  • components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-12-12T16:22:15.181Z
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/core/src/clp/aws/constants.hpp:10-16
Timestamp: 2025-12-12T16:22:15.181Z
Learning: AWS S3 region codes follow the pattern `[region]-[zone]-[digit]` where the digit is always a single decimal digit (e.g., us-east-1, eu-west-3, ap-northeast-2). Custom S3-compatible endpoints (MinIO, LocalStack) may use different region naming conventions.

Applied to files:

  • components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-08-15T21:48:40.228Z
Learnt from: quinntaylormitchell
Repo: y-scope/clp PR: 1181
File: docs/src/user-guide/guides-retention.md:68-73
Timestamp: 2025-08-15T21:48:40.228Z
Learning: In documentation for the CLP project, when suggesting formatting improvements for variables in explanatory text, the user quinntaylormitchell prefers to maintain existing sentence structures (like keeping "that" in "i.e., that the difference...") while applying monospace formatting to technical terms and variables for consistency.

Applied to files:

  • components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-11-17T22:58:50.056Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 0
File: :0-0
Timestamp: 2025-11-17T22:58:50.056Z
Learning: In the y-scope/clp repository, when enabling new linting tools (ruff, mypy) on Python components, the team uses an incremental approach: first enable the tools with errors allowed (exit code 0), apply only safe auto-fixable fixes, then address remaining issues in follow-up PRs. During the initial enablement PR, reviews should focus on correctness of auto-fixes rather than flagging new code quality issues.

Applied to files:

  • components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-08-24T17:10:44.565Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: taskfiles/lint.yaml:780-781
Timestamp: 2025-08-24T17:10:44.565Z
Learning: The CLP repository is gradually migrating from Black to Ruff for Python formatting. Some modules haven't been set up for Ruff yet, so there's intentional mixed usage of Black and Ruff formatters during this transition period. This should not be flagged as an inconsistency in PR reviews.

Applied to files:

  • components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-10-13T03:24:35.074Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1405
File: components/clp-package-utils/pyproject.toml:5-15
Timestamp: 2025-10-13T03:24:35.074Z
Learning: In the y-scope/clp repository, the Python 3.9 to 3.10 version requirement change was intentionally deferred to a separate PR (after PR #1405) to reduce review effort, as decided in an offline discussion between junhaoliao and kirkrodrigues.

Applied to files:

  • components/clp-py-utils/clp_py_utils/s3_utils.py
🧬 Code graph analysis (1)
components/clp-py-utils/clp_py_utils/s3_utils.py (1)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
  • AwsAuthentication (547-577)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: package-image
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: musllinux_1_2-x86_64-static-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-static-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: build (macos-15)
  • GitHub Check: build (ubuntu-24.04)
  • GitHub Check: build-macos (macos-15, false)
  • GitHub Check: build-macos (macos-15, true)
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: lint-check (macos-15)
🔇 Additional comments (3)
components/clp-py-utils/clp_py_utils/s3_utils.py (3)

39-45: Regex fragments for S3 URL parsing look correct and flexible.

These constants compose cleanly into the host‑ and path‑style patterns below, and the relaxed REGION_CODE_REGEXP = r"(?P<region_code>[a-z0-9\-]+)" now admits extended AWS‑style region codes (e.g., us-gov-west-1) while still stopping at the literal . before the endpoint, so it won’t swallow any of the endpoint host. Based on learnings, this matches the intent to support both AWS and custom S3-compatible region naming.


189-216: Endpoint-aware S3 client creation is wired correctly.

Passing endpoint_url through to aws_session.client("s3", endpoint_url=endpoint_url, config=boto3_config) while keeping region_name=region_code on the Session preserves existing AWS behaviour and enables custom S3-compatible endpoints (MinIO/LocalStack, etc.) without changing auth branching. The updated signature is consistently used by all local callers.


287-304: All S3 operations now consistently propagate endpoint_url into client creation.

s3_get_object_metadata, s3_put, and s3_delete_objects all call _create_s3_client with (s3_config.endpoint_url, s3_config.region_code, s3_config.aws_authentication, boto3_config) in the new argument order, so metadata reads, uploads, and batched deletes should behave the same across AWS and custom endpoints. This is aligned with the updated S3Config shape from the config module. Based on learnings, this is the desired propagation of optional region_code and endpoint_url.

Also applies to: 338-341, 403-406

Comment on lines +219 to 255
def parse_s3_url(s3_url: str) -> tuple[str | None, str | None, str, str]:
"""
Parses the region_code, bucket, and key_prefix from the given S3 URL.
Parses the endpoint_url, region_code, bucket, and key_prefix from the given S3 URL.
:param s3_url: A host-style URL or path-style URL.
:return: A tuple of (region_code, bucket, key_prefix).
:return: A tuple of (endpoint_url, region_code, bucket, key_prefix). A value of None is returned
if the endpoint_url originates from AWS.
:raise: ValueError if `s3_url` is not a valid host-style URL or path-style URL.
"""
host_style_url_regex = re.compile(
r"https://(?P<bucket_name>[a-z0-9.-]+)\.s3(\.(?P<region_code>[a-z]+-[a-z]+-[0-9]))"
r"\.(?P<endpoint>[a-z0-9.-]+)/(?P<key_prefix>[^?]+).*"
rf"{SCHEME_REGEXP}://{BUCKET_NAME_REGEXP}\."
rf"{S3_PREFIX_REGEXP}\.({REGION_CODE_REGEXP}\.)?"
rf"{ENDPOINT_REGEXP}/{KEY_PREFIX_REGEXP}.*"
)
match = host_style_url_regex.match(s3_url)

if match is None:
path_style_url_regex = re.compile(
r"https://s3(\.(?P<region_code>[a-z]+-[a-z]+-[0-9]))\.(?P<endpoint>[a-z0-9.-]+)/"
r"(?P<bucket_name>[a-z0-9.-]+)/(?P<key_prefix>[^?]+).*"
rf"{SCHEME_REGEXP}://({S3_PREFIX_REGEXP}\."
rf"({REGION_CODE_REGEXP}\.)?)?{ENDPOINT_REGEXP}"
rf"/{BUCKET_NAME_REGEXP}/{KEY_PREFIX_REGEXP}.*"
)
match = path_style_url_regex.match(s3_url)

if match is None:
raise ValueError(f"Unsupported URL format: {s3_url}")

scheme = match.group("scheme")
endpoint = match.group("endpoint")
region_code = match.group("region_code")
bucket_name = match.group("bucket_name")
endpoint = match.group("endpoint")
key_prefix = match.group("key_prefix")

if AWS_ENDPOINT != endpoint:
raise ValueError(f"Unsupported endpoint: {endpoint}")
s3_prefix = "s3." if match.group("s3") is not None else ""
endpoint_url = f"{scheme}://{s3_prefix}{endpoint}" if endpoint != AWS_ENDPOINT else None
key_prefix = match.group("key_prefix")

return region_code, bucket_name, key_prefix
return endpoint_url, region_code, bucket_name, key_prefix

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

parse_s3_url / generate_s3_url now handle AWS and custom endpoints coherently.

The updated pair correctly covers:

  • AWS virtual-hosted URLs bucket.s3[.region].amazonaws.comendpoint_url=None, region_code as parsed, and regenerates back to virtual-hosted AWS URLs.
  • AWS path-style URLs s3[.region].amazonaws.com/bucket/key → parsed as AWS (endpoint_url=None) and normalised to virtual-hosted form on generation.
  • Non‑AWS endpoints, including s3.<domain> and s3.<region>.<domain> (e.g., Wasabi) and plain hosts like minio.example.com:9000, preserving any s3. prefix in endpoint_url and avoiding the previous prefix-loss / double‑s3 issues on round‑trip.
  • Endpoint parsing for generate_s3_url is correctly constrained to scheme://[s3.]endpoint[:port]/?, so misconfigured endpoint URLs with a path are rejected early.

Only a minor nit: the docstring sentence

“A value of None is returned if the endpoint_url originates from AWS.”

could be a bit clearer if rephrased along the lines of:

endpoint_url is None when the S3 endpoint host is amazonaws.com; otherwise it is the scheme://host[:port] of the S3-compatible endpoint.”

to make the None convention explicit and tie it to AWS_ENDPOINT. Based on learnings, this matches how endpoint_url and region_code are interpreted elsewhere in the config model.

Also applies to: 257-285

🤖 Prompt for AI Agents
In components/clp-py-utils/clp_py_utils/s3_utils.py around lines 219-255 (and
similarly lines 257-285), the docstring line “A value of None is returned if the
endpoint_url originates from AWS.” is ambiguous; replace it with a clearer
sentence such as: "`endpoint_url` is None when the S3 endpoint host is
amazonaws.com; otherwise `endpoint_url` is the scheme://host[:port] of the
S3-compatible endpoint." Update both occurrences so the None convention
explicitly ties to AWS_ENDPOINT and matches how endpoint_url and region_code are
used elsewhere.

@gibber9809 gibber9809 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. PR title is fine as well.

I also double-checked that regular S3 urls still work as expected with clp-s on the last commit, so things look good.

@junhaoliao junhaoliao left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code lgtm. posted a question

Comment on lines +40 to +43
S3_PREFIX_REGEXP = r"(?P<s3>s3)"
ENDPOINT_REGEXP = r"(?P<endpoint>[a-z0-9.-]+(\:[0-9]+)?)"
REGION_CODE_REGEXP = r"(?P<region_code>[a-z0-9\-]+)"
BUCKET_NAME_REGEXP = r"(?P<bucket_name>[a-z0-9.-]+)"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems those patterns prohibit UPPER CASE letters, which are techically allowed as per RFC952? ditto for the ones in C++ code

https://datatracker.ietf.org/doc/html/rfc952
ASSUMPTIONS -> 1.

A "name" (Net, Host, Gateway, or Domain name) is a text string up
to 24 characters drawn from the alphabet (A-Z), digits (0-9), minus
sign (-), and period (.). Note that periods are only allowed when
they serve to delimit components of "domain style names". (See
RFC-921, "Domain Name System Implementation Schedule", for
background). No blank or space characters are permitted as part of a
name. No distinction is made between upper and lower case. The first
character must be an alpha character. The last character must not be
a minus sign or period. A host which serves as a GATEWAY should have
"-GATEWAY" or "-GW" as part of its name. Hosts which do not serve as
Internet gateways should not use "-GATEWAY" and "-GW" as part of
their names. A host which is a TAC should have "-TAC" as the last
part of its host name, if it is a DoD host. Single character names
or nicknames are not allowed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coderabbitai Can you create an issue on this? 1. We need to verify if hostname is case insensitive and path is case sensitive. 2. endpoint_url = f"{scheme}://{s3_prefix}{endpoint}" if endpoint != AWS_ENDPOINT else None the comparison needs to be case insensitive.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Create GitHub issue for hostname case sensitivity and endpoint comparison

gh issue create \
  --repo y-scope/clp \
  --title "S3 URL parsing: verify hostname case sensitivity and fix endpoint comparison" \
  --assignee hoophalab \
  --body "## Context
Related to PR #1767 and [this comment](https://github.com/y-scope/clp/pull/1767#discussion_r2621220969).

@junhaoliao raised a concern that the regex patterns in the S3 URL parsing code prohibit uppercase letters, which are technically allowed per [RFC952](https://datatracker.ietf.org/doc/html/rfc952).

## Tasks

1. **Verify hostname and path case sensitivity behavior**
   - Confirm if hostname is case insensitive
   - Confirm if path is case sensitive
   - Update regex patterns accordingly if uppercase letters should be allowed in hostnames

2. **Fix endpoint comparison to be case insensitive**
   - The comparison \`endpoint != AWS_ENDPOINT\` in \`parse_s3_url\` (line 318 in \`components/clp-py-utils/clp_py_utils/s3_utils.py\`) needs to be case insensitive
   - Similar fixes may be needed in the C++ code

## Affected Files
- \`components/clp-py-utils/clp_py_utils/s3_utils.py\`
- \`components/core/src/clp/aws/constants.hpp\`
- \`components/core/src/clp/aws/AwsAuthenticationSigner.cpp\`

---
Requested by: @hoophalab
PR: #1767"

Length of output: 42


@hoophalab, I've created issue #1792 to track these items:

Issue: S3 URL parsing: verify hostname case sensitivity and fix endpoint comparison

The issue covers:

  1. Verifying hostname case insensitivity vs path case sensitivity behavior, and updating regex patterns if needed to allow uppercase letters per RFC952
  2. Making the endpoint != AWS_ENDPOINT comparison case insensitive in both Python and C++ code

The issue has been assigned to you and includes backlinks to this PR and comment thread.


🧬 Code Graph Analysis Results

[components/clp-py-utils/clp_py_utils/s3_utils.py: parse_s3_url function]

Summary:

  • Purpose: Parses an S3 URL to extract endpoint_url, region_code, bucket, and key_prefix.
  • Parameters: s3_url (str) - a host-style or path-style URL.
  • Returns: (endpoint_url: str | None, region_code: str | None, bucket: str, key_prefix: str)
  • Behavior:
    • Attempts host-style URL pattern matching first, then falls back to path-style pattern.
    • Determines endpoint_url as scheme://{s3_prefix}{endpoint} unless endpoint equals AWS_ENDPOINT, in which case endpoint_url is None.
    • s3_prefix is "s3." when the host contains the s3 prefix, otherwise "".
  • Important detail: The comparison of the endpoint to AWS_ENDPOINT is currently case-sensitive.
  • Change needed (per user comment):
    • Make the AWS endpoint comparison case-insensitive, e.g. endpoint.lower() != AWS_ENDPOINT.lower().

Relevant excerpt (as-is from the file):
def parse_s3_url(s3_url: str) -> tuple[str | None, str | None, str, str]:
"""
Parses the endpoint_url, region_code, bucket, and key_prefix from the given S3 URL.
:param s3_url: A host-style URL or path-style URL.
:return: A tuple of (endpoint_url, region_code, bucket, key_prefix). A value of None is returned
if the endpoint_url originates from AWS.
:raise: ValueError if s3_url is not a valid host-style URL or path-style URL.
"""
host_style_url_regex = re.compile(
rf"{SCHEME_REGEXP}://{BUCKET_NAME_REGEXP}."
rf"{S3_PREFIX_REGEXP}.({REGION_CODE_REGEXP}.)?"
rf"{ENDPOINT_REGEXP}/{KEY_PREFIX_REGEXP}.*"
)
match = host_style_url_regex.match(s3_url)

if match is None:
    path_style_url_regex = re.compile(
        rf"{SCHEME_REGEXP}://({S3_PREFIX_REGEXP}\."
        rf"({REGION_CODE_REGEXP}\.)?)?{ENDPOINT_REGEXP}"
        rf"/{BUCKET_NAME_REGEXP}/{KEY_PREFIX_REGEXP}.*"
    )
    match = path_style_url_regex.match(s3_url)

if match is None:
    raise ValueError(f"Unsupported URL format: {s3_url}")

scheme = match.group("scheme")
endpoint = match.group("endpoint")
region_code = match.group("region_code")
bucket_name = match.group("bucket_name")

s3_prefix = "s3." if match.group("s3") is not None else ""
endpoint_url = f"{scheme}://{s3_prefix}{endpoint}" if endpoint != AWS_ENDPOINT else None
key_prefix = match.group("key_prefix")

return endpoint_url, region_code, bucket_name, key_prefix

Note to reviewer: Update needed to compare endpoint and AWS_ENDPOINT in a case-insensitive manner, e.g. endpoint.lower() != AWS_ENDPOINT.lower(). This aligns with hostname case-insensitivity expectations.


🧠 Learnings used
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/clp_config.py:580-585
Timestamp: 2025-12-10T23:32:24.363Z
Learning: In PR #1767, custom S3 endpoint support was added to CLP. The S3Config.region_code field is now optional (NonEmptyStr | None) because custom S3-compatible endpoints (MinIO, LocalStack, etc.) use path-style URLs and don't require AWS region codes. Only AWS S3 endpoints require region_code. Presto integration still requires region_code because it only works with AWS S3.

Learnt from: haiqi96
Repo: y-scope/clp PR: 852
File: components/clp-package-utils/clp_package_utils/scripts/native/compress.py:151-160
Timestamp: 2025-04-25T20:46:20.140Z
Learning: For S3 URLs without region specifications (legacy global endpoints), either assign a default region (us-east-1) or throw a clear error message requiring region specification in the URL. This addresses validation issues in components like S3InputConfig that require a non-nullable region string.

@hoophalab hoophalab merged commit bbc3b71 into y-scope:main Dec 16, 2025
24 checks passed
junhaoliao pushed a commit to junhaoliao/clp that referenced this pull request Dec 16, 2025
davidlion pushed a commit to davidlion/clp that referenced this pull request Jan 17, 2026
junhaoliao pushed a commit to junhaoliao/clp that referenced this pull request May 17, 2026
junhaoliao pushed a commit to junhaoliao/clp that referenced this pull request May 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants