feat(clp-package): Add support for ingesting from custom S3 endpoints (resolves #1758).#1767
Conversation
WalkthroughAdds endpoint-aware S3 support across Python and C++: S3 URL parsing now yields endpoint_url and common key prefix; endpoint_url is propagated into S3 configs, client creation, URL generation, and delete APIs; C++ signer detects virtual-hosted vs path-style and preserves scheme/endpoint in canonicalization and presign formatting. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant CLI as compress.py / tasks
participant PyUtils as clp_py_utils.s3_utils
participant S3 as S3 endpoint
participant CppSigner as AwsAuthenticationSigner (C++)
CLI->>PyUtils: _parse_and_validate_s3_object_urls(urls)
PyUtils-->>CLI: (endpoint_url, region_code, bucket, key_prefix, keys)
CLI->>PyUtils: call S3 ops with S3InputConfig(endpoint_url,...)
PyUtils->>S3: create client(endpoint_url, region) -> request metadata/delete/put
S3-->>PyUtils: responses (metadata/ack)
CLI->>CppSigner: request presigned URL (bucket,key,endpoint,scheme)
CppSigner->>CppSigner: detect Style (VirtualHost/Path), compute canonical request
CppSigner-->>CLI: presigned URL (formatted per style and scheme)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Possibly related issues
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: ASSERTIVE Plan: Pro 📒 Files selected for processing (4)
🧰 Additional context used🧠 Learnings (12)📓 Common learnings📚 Learning: 2025-12-10T23:32:24.363ZApplied to files:
📚 Learning: 2025-04-25T20:46:20.140ZApplied to files:
📚 Learning: 2025-04-07T09:30:20.684ZApplied to files:
📚 Learning: 2025-11-17T22:58:50.056ZApplied to files:
📚 Learning: 2025-12-12T16:22:15.181ZApplied to files:
📚 Learning: 2025-12-12T21:14:28.267ZApplied to files:
📚 Learning: 2025-09-28T15:00:22.170ZApplied to files:
📚 Learning: 2025-10-07T07:54:32.427ZApplied to files:
📚 Learning: 2025-08-13T14:48:49.020ZApplied to files:
📚 Learning: 2024-11-15T16:21:52.122ZApplied to files:
📚 Learning: 2025-01-16T16:58:43.190ZApplied to files:
🧬 Code graph analysis (2)components/job-orchestration/job_orchestration/executor/compress/compression_task.py (1)
components/clp-package-utils/clp_package_utils/scripts/native/compress.py (2)
🪛 Ruff (0.14.8)components/clp-package-utils/clp_package_utils/scripts/native/compress.py258-261: Avoid specifying long messages outside the exception class (TRY003) 264-267: Avoid specifying long messages outside the exception class (TRY003) ⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
🔇 Additional comments (7)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
components/clp-package-utils/clp_package_utils/scripts/native/compress.py (1)
228-245: Return type annotation and docstring are inconsistent with actual return values.
- The return type on line 230 is
tuple[str, str, str, str, list[str]], butendpoint_urlandregion_codecan beNone(as perparse_s3_url's return typetuple[str | None, str | None, str, str]).- The docstring lists only 4 items in the return tuple but there are now 5, and
endpoint_urlis missing from the list.Apply this diff to fix:
def _parse_and_validate_s3_object_urls( urls: list[str], -) -> tuple[str, str, str, str, list[str]]: +) -> tuple[str | None, str | None, str, str, list[str]]: """ Parses and validates S3 object URLs. The validation will ensure: + - All URLs have the same endpoint. - All URLs have the same region and bucket. - No duplicate keys among the URLs. - The URLs share a non-empty common prefix. :param urls: :return: A tuple containing: - - The region code. + - The endpoint URL (None for AWS default). + - The region code (may be None). - The bucket. - The common key prefix. - The list of keys. :raises ValueError: If the validation fails. """components/clp-py-utils/clp_py_utils/s3_utils.py (1)
353-369: Docstring missingendpoint_urlparameter documentation.The function signature now includes
endpoint_url: str | Noneas the first parameter, but the docstring doesn't document it.Apply this diff:
def s3_delete_by_key_prefix( endpoint_url: str | None, region_code: str | None, bucket_name: str, key_prefix: str, s3_auth: AwsAuthentication, ) -> None: """ Deletes all objects under the S3 path `bucket_name`/`key_prefix`. + :param endpoint_url: The S3-compatible endpoint URL, or None for AWS default. :param region_code: :param bucket_name: :param key_prefix: :param s3_auth: :raises: ValueError if any parameter is invalid. :raises: Propagates `boto3.client.delete_objects`'s exceptions. """
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (10)
components/clp-package-utils/clp_package_utils/scripts/native/compress.py(5 hunks)components/clp-package-utils/clp_package_utils/scripts/native/dataset_manager.py(1 hunks)components/clp-py-utils/clp_py_utils/clp_config.py(1 hunks)components/clp-py-utils/clp_py_utils/s3_utils.py(8 hunks)components/core/src/clp/aws/AwsAuthenticationSigner.cpp(4 hunks)components/core/src/clp/aws/AwsAuthenticationSigner.hpp(3 hunks)components/core/src/clp/aws/constants.hpp(1 hunks)components/job-orchestration/job_orchestration/executor/compress/compression_task.py(2 hunks)components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py(2 hunks)components/job-orchestration/job_orchestration/executor/query/fs_search_task.py(2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
⚙️ CodeRabbit configuration file
- Prefer
false == <expression>rather than!<expression>.
Files:
components/core/src/clp/aws/constants.hppcomponents/core/src/clp/aws/AwsAuthenticationSigner.hppcomponents/core/src/clp/aws/AwsAuthenticationSigner.cpp
🧠 Learnings (10)
📓 Common learnings
Learnt from: haiqi96
Repo: y-scope/clp PR: 852
File: components/clp-package-utils/clp_package_utils/scripts/native/compress.py:151-160
Timestamp: 2025-04-25T20:46:20.140Z
Learning: For S3 URLs without region specifications (legacy global endpoints), either assign a default region (us-east-1) or throw a clear error message requiring region specification in the URL. This addresses validation issues in components like S3InputConfig that require a non-nullable region string.
Learnt from: haiqi96
Repo: y-scope/clp PR: 651
File: components/job-orchestration/job_orchestration/scheduler/job_config.py:29-39
Timestamp: 2025-01-15T16:36:48.932Z
Learning: The S3 service in the clp codebase only supports AWS S3, where region_code is mandatory. Other S3-like services are not supported.
📚 Learning: 2025-01-13T21:18:54.629Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 662
File: components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py:167-186
Timestamp: 2025-01-13T21:18:54.629Z
Learning: The s3_put API in clp_py_utils.s3_utils internally handles timeout and retry mechanisms for S3 operations.
Applied to files:
components/job-orchestration/job_orchestration/executor/query/extract_stream_task.pycomponents/job-orchestration/job_orchestration/executor/query/fs_search_task.py
📚 Learning: 2024-11-15T16:21:52.122Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 594
File: components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py:104-110
Timestamp: 2024-11-15T16:21:52.122Z
Learning: In `clp_package_utils/scripts/native/del_archives.py`, when deleting archives, the `archive` variable retrieved from the database is controlled and is always a single string without path components. Therefore, it's acceptable to skip additional validation checks for directory traversal in this context.
Applied to files:
components/clp-package-utils/clp_package_utils/scripts/native/dataset_manager.py
📚 Learning: 2025-08-13T14:48:49.020Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 1144
File: components/clp-package-utils/clp_package_utils/scripts/dataset_manager.py:106-114
Timestamp: 2025-08-13T14:48:49.020Z
Learning: For the dataset manager scripts in components/clp-package-utils/clp_package_utils/scripts/, the native script (native/dataset_manager.py) is designed to only be called through the wrapper script (dataset_manager.py), so dataset validation is only performed at the wrapper level rather than duplicating it in the native script.
Applied to files:
components/clp-package-utils/clp_package_utils/scripts/native/dataset_manager.py
📚 Learning: 2025-04-25T20:46:20.140Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 852
File: components/clp-package-utils/clp_package_utils/scripts/native/compress.py:151-160
Timestamp: 2025-04-25T20:46:20.140Z
Learning: For S3 URLs without region specifications (legacy global endpoints), either assign a default region (us-east-1) or throw a clear error message requiring region specification in the URL. This addresses validation issues in components like S3InputConfig that require a non-nullable region string.
Applied to files:
components/clp-py-utils/clp_py_utils/clp_config.pycomponents/core/src/clp/aws/AwsAuthenticationSigner.cppcomponents/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-01-15T16:36:48.932Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 651
File: components/job-orchestration/job_orchestration/scheduler/job_config.py:29-39
Timestamp: 2025-01-15T16:36:48.932Z
Learning: The S3 service in the clp codebase only supports AWS S3, where region_code is mandatory. Other S3-like services are not supported.
Applied to files:
components/clp-py-utils/clp_py_utils/clp_config.py
📚 Learning: 2025-09-28T15:00:22.170Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 1340
File: components/job-orchestration/job_orchestration/executor/compress/compression_task.py:528-528
Timestamp: 2025-09-28T15:00:22.170Z
Learning: In components/job-orchestration/job_orchestration/executor/compress/compression_task.py, there is a suggestion to refactor from passing logger as a parameter through multiple functions to creating a ClpCompressor class that takes the logger as a class member, with current helper functions becoming private member functions.
Applied to files:
components/job-orchestration/job_orchestration/executor/compress/compression_task.py
📚 Learning: 2025-10-07T07:54:32.427Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-py-utils/clp_py_utils/clp_config.py:47-47
Timestamp: 2025-10-07T07:54:32.427Z
Learning: In components/clp-py-utils/clp_py_utils/clp_config.py, the CONTAINER_AWS_CONFIG_DIRECTORY constant is intentionally set to pathlib.Path("/") / ".aws" (i.e., `/.aws`) rather than a user-specific home directory. This hardcoded path is part of the container orchestration design.
Applied to files:
components/job-orchestration/job_orchestration/executor/compress/compression_task.py
📚 Learning: 2025-01-17T23:25:38.165Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 673
File: components/log-viewer-webui/server/src/S3Manager.js:0-0
Timestamp: 2025-01-17T23:25:38.165Z
Learning: In S3Manager.js, URL validation is handled by the URL constructor which throws TypeError for invalid URLs, making additional scheme (s3://) validation redundant.
Applied to files:
components/core/src/clp/aws/AwsAuthenticationSigner.cpp
📚 Learning: 2025-12-09T01:57:04.839Z
Learnt from: SharafMohamed
Repo: y-scope/clp PR: 1313
File: components/core/tests/test-SchemaSearcher.cpp:136-137
Timestamp: 2025-12-09T01:57:04.839Z
Learning: In C++ files under the components directory, keep forward declarations even when a definition follows in the same scope (such as in anonymous namespaces) to maintain consistency across multiple methods. This style reduces churn and aligns with the existing convention in this repository for both source and tests.
Applied to files:
components/core/src/clp/aws/AwsAuthenticationSigner.cpp
🧬 Code graph analysis (6)
components/core/src/clp/aws/AwsAuthenticationSigner.hpp (1)
components/core/src/clp_s/Schema.hpp (1)
m_schema(51-54)
components/job-orchestration/job_orchestration/executor/query/fs_search_task.py (1)
components/clp-py-utils/clp_py_utils/s3_utils.py (1)
generate_s3_url(264-288)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py (1)
components/clp-py-utils/clp_py_utils/s3_utils.py (1)
generate_s3_url(264-288)
components/clp-package-utils/clp_package_utils/scripts/native/compress.py (2)
components/job-orchestration/job_orchestration/scheduler/job_config.py (1)
S3InputConfig(32-44)components/clp-py-utils/clp_py_utils/s3_utils.py (1)
parse_s3_url(218-261)
components/core/src/clp/aws/AwsAuthenticationSigner.cpp (1)
components/core/src/clp/aws/AwsAuthenticationSigner.hpp (1)
s3_url(103-104)
components/clp-py-utils/clp_py_utils/s3_utils.py (1)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
AwsAuthentication(547-577)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: manylinux_2_28-x86_64-static-linked-bins
- GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
- GitHub Check: package-image
- GitHub Check: musllinux_1_2-x86_64-static-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: build-macos (macos-15, true)
- GitHub Check: build-macos (macos-15, false)
- GitHub Check: build (ubuntu-24.04)
- GitHub Check: build (macos-15)
- GitHub Check: lint-check (ubuntu-24.04)
- GitHub Check: lint-check (macos-15)
🔇 Additional comments (23)
components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py (2)
17-21: LGTM!The import change correctly reflects the updated S3 utility API.
103-108: LGTM!The call to
generate_s3_urlcorrectly passesendpoint_urlas the first parameter, followed byregion_code,bucket, andobject_key. Error handling is appropriately preserved.components/job-orchestration/job_orchestration/executor/compress/compression_task.py (1)
28-32: LGTM!The import change correctly reflects the updated S3 utility API.
components/job-orchestration/job_orchestration/executor/query/fs_search_task.py (2)
15-19: LGTM!The import change correctly reflects the updated S3 utility API.
80-85: LGTM!The call to
generate_s3_urlcorrectly passesendpoint_urlas the first parameter, with appropriate error handling preserved.components/core/src/clp/aws/AwsAuthenticationSigner.hpp (4)
43-46: LGTM!The
Styleenum cleanly distinguishes between virtual-hosted and path-style S3 URLs, which is essential for correct URL parsing and presigned URL generation with custom endpoints.
52-53: LGTM!The accessor is appropriately marked
[[nodiscard]]and returns by value for the enum type.
62-63: LGTM!Returning
std::string_viewfor the schema accessor is efficient and consistent with other accessors in this class.
65-71: LGTM!The new private members
m_styleandm_schemaproperly extend the class state to support both URL styles and schema-aware URL generation.components/core/src/clp/aws/AwsAuthenticationSigner.cpp (2)
149-170: LGTM!The canonical request construction correctly differentiates between VirtualHost and Path styles:
- VirtualHost: URI is
/{key}- Path: URI is
/{bucket}/{key}This aligns with AWS Signature Version 4 requirements for path-style URLs.
173-216: LGTM!The URL parsing logic correctly handles both virtual-hosted and path-style S3 URL formats using distinct regex patterns. The default region assignment (
us-east-1) when the region is empty maintains backward compatibility with URLs that don't specify a region. Based on learnings, this aligns with the guidance for S3 URLs without region specifications.components/core/src/clp/aws/constants.hpp (1)
10-16: Delegate bucket name validation to the AWS S3 API rather than implementing partial validation through regex patterns.While the
cBucketRegexpattern[a-z0-9.-]+is indeed permissive (allowing buckets starting/ending with periods or hyphens, no length constraints, no consecutive period checks, etc.), implementing stricter regex validation in the constants is not the appropriate solution. AWS S3 bucket naming rules are extensive and subject to change. Instead, let the S3 API handle validation on actual requests. If this regex is used for URL parsing/extraction, a permissive pattern is acceptable.Likely an incorrect or invalid review comment.
components/clp-package-utils/clp_package_utils/scripts/native/compress.py (3)
175-188: LGTM!The unpacking of the 5-tuple and propagation of
endpoint_urltoS3InputConfigis correct and consistent with the updated_parse_and_validate_s3_object_urlssignature.
189-205: LGTM!The
S3_KEY_PREFIX_COMPRESSIONpath correctly extractsendpoint_urlfromparse_s3_urland passes it toS3InputConfig.
250-284: LGTM!The validation logic correctly ensures all URLs share the same endpoint, region, and bucket, and properly detects duplicate keys using a set.
components/clp-py-utils/clp_py_utils/s3_utils.py (7)
39-43: LGTM!The regex constants are well-defined for parsing S3 URLs in both host-style and path-style formats.
188-215: LGTM!The
_create_s3_clientfunction correctly acceptsendpoint_urland passes it to the boto3 client, enabling connectivity to S3-compatible services like MinIO and LocalStack.
225-261: LGTM!The URL parsing logic correctly handles both host-style and path-style URLs, with optional region codes, and properly determines whether to return a custom
endpoint_urlorNonefor AWS default endpoints.
264-288: LGTM!The URL generation logic correctly produces virtual-hosted style URLs for AWS and path-style URLs for custom endpoints, aligning with the PR objectives for supporting MinIO/LocalStack.
304-308: LGTM!Correctly passes
endpoint_urlandregion_codeto_create_s3_client.
343-345: LGTM!Correctly passes
endpoint_urlto_create_s3_client.
407-409: LGTM!Correctly passes
endpoint_urlto_create_s3_client.components/clp-package-utils/clp_package_utils/scripts/native/dataset_manager.py (1)
143-149: LGTM!The call to
s3_delete_by_key_prefixcorrectly passess3_config.endpoint_urlas the first argument, aligning with the updated function signature ins3_utils.py.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
components/clp-py-utils/clp_py_utils/s3_utils.py (1)
353-368: Updates3_delete_by_key_prefixdocstring to reflectendpoint_urland optional region.The function signature now includes
endpoint_urland allowsregion_codeto beNone, but the docstring still has no:param endpoint_url:entry and leaves:param region_code:undocumented. That’s misleading for callers trying to understand how to use AWS versus custom S3-compatible endpoints.Consider updating the docstring along these lines:
def s3_delete_by_key_prefix( - endpoint_url: str | None, - region_code: str | None, + endpoint_url: str | None, + region_code: str | None, bucket_name: str, key_prefix: str, s3_auth: AwsAuthentication, ) -> None: """ Deletes all objects under the S3 path `bucket_name`/`key_prefix`. - :param region_code: - :param bucket_name: - :param key_prefix: - :param s3_auth: + :param endpoint_url: Optional custom S3 endpoint base URL (e.g., `http://minio:9000`). + Use `None` for AWS S3 so the default AWS endpoint is used. + :param region_code: Optional AWS region. Must be non-`None` for real AWS S3 endpoints; + may be `None` for S3-compatible custom endpoints. + :param bucket_name: Name of the bucket to delete from. + :param key_prefix: Object key prefix under which all objects will be deleted. + :param s3_auth: AWS authentication configuration used to create the S3 client.This keeps the behaviour unchanged but makes the intended usage much clearer, especially in the context of the new endpoint support. Based on learnings, this also aligns with the updated
S3Config.region_codesemantics (required only for AWS S3 endpoints, optional for custom ones).Also applies to: 375-377
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
components/clp-py-utils/clp_py_utils/s3_utils.py(8 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/clp_config.py:580-585
Timestamp: 2025-12-10T23:32:19.818Z
Learning: In PR #1767, custom S3 endpoint support was added to CLP. The S3Config.region_code field is now optional (NonEmptyStr | None) because custom S3-compatible endpoints (MinIO, LocalStack, etc.) use path-style URLs and don't require AWS region codes. Only AWS S3 endpoints require region_code. Presto integration still requires region_code because it only works with AWS S3.
Learnt from: haiqi96
Repo: y-scope/clp PR: 852
File: components/clp-package-utils/clp_package_utils/scripts/native/compress.py:151-160
Timestamp: 2025-04-25T20:46:20.140Z
Learning: For S3 URLs without region specifications (legacy global endpoints), either assign a default region (us-east-1) or throw a clear error message requiring region specification in the URL. This addresses validation issues in components like S3InputConfig that require a non-nullable region string.
📚 Learning: 2025-12-10T23:32:19.818Z
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/clp_config.py:580-585
Timestamp: 2025-12-10T23:32:19.818Z
Learning: In PR #1767, custom S3 endpoint support was added to CLP. The S3Config.region_code field is now optional (NonEmptyStr | None) because custom S3-compatible endpoints (MinIO, LocalStack, etc.) use path-style URLs and don't require AWS region codes. Only AWS S3 endpoints require region_code. Presto integration still requires region_code because it only works with AWS S3.
Applied to files:
components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-04-25T20:46:20.140Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 852
File: components/clp-package-utils/clp_package_utils/scripts/native/compress.py:151-160
Timestamp: 2025-04-25T20:46:20.140Z
Learning: For S3 URLs without region specifications (legacy global endpoints), either assign a default region (us-east-1) or throw a clear error message requiring region specification in the URL. This addresses validation issues in components like S3InputConfig that require a non-nullable region string.
Applied to files:
components/clp-py-utils/clp_py_utils/s3_utils.py
🔇 Additional comments (2)
components/clp-py-utils/clp_py_utils/s3_utils.py (2)
39-44: Regex building blocks are cohesive and consistent.The shared regex fragments for schema, endpoint, region, bucket, and key prefix are well-factored and make the later URL patterns easier to reason about; no issues from a correctness or readability standpoint for the supported URL shapes.
188-215: Endpoint-aware client creation is consistent with config semantics.Passing
endpoint_urlthrough_create_s3_clientand allowingregion_codeto beNonealigns with theS3Configdesign (AWS endpoints require a region; custom S3-compatible endpoints may omit it). All five call sites correctly passendpoint_urlandregion_codein the intended combinations: AWS calls supply non-Noneregion_codeand leaveendpoint_urlasNone, while MinIO/LocalStack calls setendpoint_urland may omitregion_code. This matches howboto3.Session(..., region_name=region_code)andclient("s3", endpoint_url=endpoint_url, ...)are intended to be used. No remaining callers use the old signature.
gibber9809
left a comment
There was a problem hiding this comment.
Reviewed only the c++ side of this PR. I like the design, just a few nits + I think we should add some tests for the S3Url URL parsing at least.
I confirm rabbit is right. Let me make some changes. |
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (2)
components/clp-py-utils/clp_py_utils/s3_utils.py (2)
218-224: Docstring still doesn’t fully describe the 4‑tuple semantics.You’ve updated it to a 4‑tuple, but it doesn’t document that
endpoint_urlisNonefor*.amazonaws.comURLs and thatregion_codemay beNonefor AWS global endpoints and for non‑AWS endpoints (perS3Config). It also saysbucketinstead ofbucket_name, which is slightly inconsistent with the implementation.Based on learnings, region codes are required only for AWS S3, so spelling out those cases in the docstring will help future callers reason about the return values.
225-261: Non‑AWSs3.<domain>path‑style URLs still drop thes3.prefix and won’t round‑trip.For URLs like
https://s3.wasabisys.com/bucket/keyandhttps://s3.us-west-1.wasabisys.com/bucket/key,path_style_url_regexcapturesendpoint="wasabisys.com"andregion_code(Noneorus-west-1), soparse_s3_urlsetsendpoint_url = "https://wasabisys.com". Subsequent client creation andgenerate_s3_urlthen talk towasabisys.cominstead ofs3[.region].wasabisys.com, which will break typical S3‑compatible providers that expose their service ons3.*hosts.Given that only AWS needs a distinct
region_code(custom endpoints don’t), a safer option for non‑AWS path‑style hosts is to preserve the full host (including anys3.+ region prefix) asendpoint_urland treatregion_codeasNone. One concrete way to do this is:+from urllib.parse import urlsplit @@ - match = host_style_url_regex.match(s3_url) - - if match is None: + match = host_style_url_regex.match(s3_url) + is_path_style = False + + if match is None: path_style_url_regex = re.compile( @@ - ) - match = path_style_url_regex.match(s3_url) + ) + match = path_style_url_regex.match(s3_url) + is_path_style = True @@ - scheme = match.group("scheme") - endpoint = match.group("endpoint") - region_code = match.group("region_code") - bucket_name = match.group("bucket_name") - - endpoint_url = f"{scheme}://{endpoint}" if endpoint != AWS_ENDPOINT else None - key_prefix = match.group("key_prefix") + scheme = match.group("scheme") + endpoint = match.group("endpoint") + region_code = match.group("region_code") + bucket_name = match.group("bucket_name") + key_prefix = match.group("key_prefix") + + full_host = urlsplit(s3_url).netloc + if full_host.endswith(AWS_ENDPOINT): + # AWS: rely on region_code for routing; endpoint_url stays None. + endpoint_url = None + elif is_path_style and full_host.startswith("s3."): + # Non‑AWS `s3[.region].<domain>` endpoints: keep the full host and + # use path‑style addressing; region_code isn’t needed for them. + endpoint_url = f"{scheme}://{full_host}" + region_code = None + else: + endpoint_url = f"{scheme}://{endpoint}" @@ - return endpoint_url, region_code, bucket_name, key_prefix + return endpoint_url, region_code, bucket_name, key_prefixThis preserves
s3.*for non‑AWS path‑style URLs, keeps AWS behaviour unchanged, and allowsgenerate_s3_urlto round‑trip both AWS and S3‑compatible path‑style URLs correctly. Based on learnings, clearingregion_codefor non‑AWS endpoints is consistent with theS3Configcontract (region only required for AWS S3).Please validate this behaviour with at least one AWS S3 URL and one S3‑compatible endpoint that uses
s3.<domain>(e.g., Wasabi or a LocalStack/MinIO deployment behind such a host) to confirm correct round‑tripping and connectivity.Also applies to: 264-289
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (4)
components/clp-py-utils/clp_py_utils/s3_utils.py(8 hunks)components/core/src/clp/aws/AwsAuthenticationSigner.cpp(4 hunks)components/core/src/clp/aws/AwsAuthenticationSigner.hpp(3 hunks)components/core/src/clp/aws/constants.hpp(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
⚙️ CodeRabbit configuration file
- Prefer
false == <expression>rather than!<expression>.
Files:
components/core/src/clp/aws/AwsAuthenticationSigner.cppcomponents/core/src/clp/aws/constants.hppcomponents/core/src/clp/aws/AwsAuthenticationSigner.hpp
🧠 Learnings (6)
📓 Common learnings
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/clp_config.py:580-585
Timestamp: 2025-12-10T23:32:24.363Z
Learning: In PR #1767, custom S3 endpoint support was added to CLP. The S3Config.region_code field is now optional (NonEmptyStr | None) because custom S3-compatible endpoints (MinIO, LocalStack, etc.) use path-style URLs and don't require AWS region codes. Only AWS S3 endpoints require region_code. Presto integration still requires region_code because it only works with AWS S3.
📚 Learning: 2025-04-25T20:46:20.140Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 852
File: components/clp-package-utils/clp_package_utils/scripts/native/compress.py:151-160
Timestamp: 2025-04-25T20:46:20.140Z
Learning: For S3 URLs without region specifications (legacy global endpoints), either assign a default region (us-east-1) or throw a clear error message requiring region specification in the URL. This addresses validation issues in components like S3InputConfig that require a non-nullable region string.
Applied to files:
components/core/src/clp/aws/AwsAuthenticationSigner.cppcomponents/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-12-10T23:32:24.363Z
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/clp_config.py:580-585
Timestamp: 2025-12-10T23:32:24.363Z
Learning: In PR #1767, custom S3 endpoint support was added to CLP. The S3Config.region_code field is now optional (NonEmptyStr | None) because custom S3-compatible endpoints (MinIO, LocalStack, etc.) use path-style URLs and don't require AWS region codes. Only AWS S3 endpoints require region_code. Presto integration still requires region_code because it only works with AWS S3.
Applied to files:
components/core/src/clp/aws/AwsAuthenticationSigner.cppcomponents/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-01-17T23:25:38.165Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 673
File: components/log-viewer-webui/server/src/S3Manager.js:0-0
Timestamp: 2025-01-17T23:25:38.165Z
Learning: In S3Manager.js, URL validation is handled by the URL constructor which throws TypeError for invalid URLs, making additional scheme (s3://) validation redundant.
Applied to files:
components/core/src/clp/aws/AwsAuthenticationSigner.cpp
📚 Learning: 2025-12-09T01:57:04.839Z
Learnt from: SharafMohamed
Repo: y-scope/clp PR: 1313
File: components/core/tests/test-SchemaSearcher.cpp:136-137
Timestamp: 2025-12-09T01:57:04.839Z
Learning: In C++ files under the components directory, keep forward declarations even when a definition follows in the same scope (such as in anonymous namespaces) to maintain consistency across multiple methods. This style reduces churn and aligns with the existing convention in this repository for both source and tests.
Applied to files:
components/core/src/clp/aws/AwsAuthenticationSigner.cpp
📚 Learning: 2025-09-15T22:20:40.750Z
Learnt from: quinntaylormitchell
Repo: y-scope/clp PR: 1125
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:267-291
Timestamp: 2025-09-15T22:20:40.750Z
Learning: For CLP compression jobs, the team has decided to fail the entire job immediately upon encountering any invalid input path, rather than continuing to process valid paths. This decision was made during PR #1125 development.
Applied to files:
components/clp-py-utils/clp_py_utils/s3_utils.py
🧬 Code graph analysis (1)
components/clp-py-utils/clp_py_utils/s3_utils.py (1)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
AwsAuthentication(547-577)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)
- GitHub Check: manylinux_2_28-x86_64-static-linked-bins
- GitHub Check: musllinux_1_2-x86_64-static-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: package-image
- GitHub Check: build-macos (macos-15, true)
- GitHub Check: build-macos (macos-15, false)
- GitHub Check: lint-check (ubuntu-24.04)
- GitHub Check: lint-check (macos-15)
- GitHub Check: build (ubuntu-24.04)
- GitHub Check: build (macos-15)
🔇 Additional comments (9)
components/core/src/clp/aws/AwsAuthenticationSigner.cpp (3)
150-158: LGTM! Canonical URI encoding correctly handles both URL styles.The switch statement properly formats the URI based on the S3Url style:
- VirtualHost includes only the key
- Path includes both bucket and key
This aligns with AWS Signature Version 4 canonical request requirements for each style.
173-216: URL parsing logic is sound.The constructor correctly:
- Attempts to match both virtual-hosted-style and path-style URL patterns
- Sets the appropriate style based on which pattern matches
- Extracts all URL components from the regex match
- Defaults the region to
cDefaultRegionwhen empty (consistent with support for custom S3-compatible endpoints)The extraction logic assumes all named groups exist in the match object. This is safe because boost::regex returns empty strings for optional groups, and the regex structure guarantees the named groups are present.
Based on learnings, the optional region handling correctly supports custom S3-compatible endpoints that don't require region codes.
248-272: LGTM! Presigned URL generation correctly respects URL style and scheme.The switch statement properly formats the presigned URL:
- VirtualHost places the bucket in the hostname, so the path is "/{key}"
- Path places the bucket in the path, so the path is "/{bucket}/{key}"
- Both correctly use the parsed scheme, supporting http, https, and s3 protocols
This is consistent with the canonical request URI formatting and AWS Signature Version 4 requirements.
components/core/src/clp/aws/AwsAuthenticationSigner.hpp (1)
5-5: LGTM! Header additions are clean and well-structured.The new members and accessors correctly support S3 URL style handling:
Styleenum class is properly scoped and usesuint8_tfor efficient storageget_style()andget_scheme()accessors follow existing patterns with[[nodiscard]]andconstm_stylehas default brace initialisation per the style guide- All additions integrate cleanly with the existing interface
Also applies to: 43-46, 52-52, 62-62, 65-65, 71-71
components/clp-py-utils/clp_py_utils/s3_utils.py (5)
39-43: Regex constants for S3 URL parsing look consistent.Centralising the scheme/endpoint/region/bucket/key regex pieces here lines up cleanly with the host‑ and path‑style patterns below and should help keep parsing and validation in sync.
188-215: Endpoint‑aware_create_s3_clientwiring looks correct.Passing
endpoint_urlthrough toaws_session.client("s3", ...)and usingregion_name=region_codeon the session matches boto3 usage and should support both AWS and custom S3‑compatible endpoints, assuming callers only omitregion_codefor non‑AWS endpoints.Please double‑check that all AWS S3 call sites still always provide a non‑None
region_codeso boto3 doesn’t complain about a missing region.
304-308: s3_get_object_metadata now correctly honoursendpoint_url.Using
_create_s3_client(s3_input_config.endpoint_url, s3_input_config.region_code, ...)aligns metadata fetches with the new endpoint‑aware config and should work for both AWS and custom S3 backends.It would be helpful to have (or keep) an integration test hitting a non‑AWS endpoint here to guard against regressions.
343-345: s3_put uses the endpoint‑aware client consistently.The
_create_s3_clientcall now passess3_config.endpoint_urlands3_config.region_code, matching the updated helper signature and ensuring uploads respect custom endpoints.Please confirm all other
_create_s3_clientcallers were updated to the new(endpoint_url, region_code, aws_authentication, boto3_config)ordering.
407-409: s3_delete_objects now respectsendpoint_urlas well.Creating the client with
s3_config.endpoint_urlands3_config.region_codekeeps bulk deletions consistent with the rest of the S3 helpers.As with the other helpers, please ensure this path is exercised against a custom S3 endpoint in tests or manual validation.
There was a problem hiding this comment.
Actionable comments posted: 5
♻️ Duplicate comments (1)
components/clp-py-utils/clp_py_utils/s3_utils.py (1)
352-360: Docstring parameter descriptions are incomplete.The docstring lists
:param endpoint_url:and:param region_code:without descriptions. Consider adding brief descriptions as suggested in the past review.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (2)
components/clp-py-utils/clp_py_utils/s3_utils.py(7 hunks)components/core/src/clp/aws/test/test_AwsAuthenticationSigner.cpp(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
⚙️ CodeRabbit configuration file
- Prefer
false == <expression>rather than!<expression>.
Files:
components/core/src/clp/aws/test/test_AwsAuthenticationSigner.cpp
🧠 Learnings (6)
📓 Common learnings
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/clp_config.py:580-585
Timestamp: 2025-12-10T23:32:24.363Z
Learning: In PR #1767, custom S3 endpoint support was added to CLP. The S3Config.region_code field is now optional (NonEmptyStr | None) because custom S3-compatible endpoints (MinIO, LocalStack, etc.) use path-style URLs and don't require AWS region codes. Only AWS S3 endpoints require region_code. Presto integration still requires region_code because it only works with AWS S3.
Learnt from: haiqi96
Repo: y-scope/clp PR: 852
File: components/clp-package-utils/clp_package_utils/scripts/native/compress.py:151-160
Timestamp: 2025-04-25T20:46:20.140Z
Learning: For S3 URLs without region specifications (legacy global endpoints), either assign a default region (us-east-1) or throw a clear error message requiring region specification in the URL. This addresses validation issues in components like S3InputConfig that require a non-nullable region string.
📚 Learning: 2025-12-09T01:57:04.839Z
Learnt from: SharafMohamed
Repo: y-scope/clp PR: 1313
File: components/core/tests/test-SchemaSearcher.cpp:136-137
Timestamp: 2025-12-09T01:57:04.839Z
Learning: In C++ files under the components directory, keep forward declarations even when a definition follows in the same scope (such as in anonymous namespaces) to maintain consistency across multiple methods. This style reduces churn and aligns with the existing convention in this repository for both source and tests.
Applied to files:
components/core/src/clp/aws/test/test_AwsAuthenticationSigner.cpp
📚 Learning: 2025-12-10T23:32:24.363Z
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/clp_config.py:580-585
Timestamp: 2025-12-10T23:32:24.363Z
Learning: In PR #1767, custom S3 endpoint support was added to CLP. The S3Config.region_code field is now optional (NonEmptyStr | None) because custom S3-compatible endpoints (MinIO, LocalStack, etc.) use path-style URLs and don't require AWS region codes. Only AWS S3 endpoints require region_code. Presto integration still requires region_code because it only works with AWS S3.
Applied to files:
components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-04-25T20:46:20.140Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 852
File: components/clp-package-utils/clp_package_utils/scripts/native/compress.py:151-160
Timestamp: 2025-04-25T20:46:20.140Z
Learning: For S3 URLs without region specifications (legacy global endpoints), either assign a default region (us-east-1) or throw a clear error message requiring region specification in the URL. This addresses validation issues in components like S3InputConfig that require a non-nullable region string.
Applied to files:
components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-09-15T22:20:40.750Z
Learnt from: quinntaylormitchell
Repo: y-scope/clp PR: 1125
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:267-291
Timestamp: 2025-09-15T22:20:40.750Z
Learning: For CLP compression jobs, the team has decided to fail the entire job immediately upon encountering any invalid input path, rather than continuing to process valid paths. This decision was made during PR #1125 development.
Applied to files:
components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-12-12T16:22:05.883Z
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/core/src/clp/aws/constants.hpp:10-16
Timestamp: 2025-12-12T16:22:05.883Z
Learning: AWS S3 region codes follow the pattern `[region]-[zone]-[digit]` where the digit is always a single decimal digit (e.g., us-east-1, eu-west-3, ap-northeast-2). Custom S3-compatible endpoints (MinIO, LocalStack) may use different region naming conventions.
Applied to files:
components/clp-py-utils/clp_py_utils/s3_utils.py
🪛 Cppcheck (2.18.0)
components/core/src/clp/aws/test/test_AwsAuthenticationSigner.cpp
[information] 1-1: Include file
(missingIncludeSystem)
[information] 2-2: Include file
(missingIncludeSystem)
[information] 3-3: Include file
(missingIncludeSystem)
[information] 5-5: Include file
(missingInclude)
[information] 6-6: Include file
(missingInclude)
🔇 Additional comments (4)
components/core/src/clp/aws/test/test_AwsAuthenticationSigner.cpp (1)
12-34: Good coverage of VirtualHost vs Path style parsing (including custom endpoint + port).components/clp-py-utils/clp_py_utils/s3_utils.py (3)
39-45: LGTM!The regex constants are well-defined and reusable. The
REGION_CODE_REGEXPcorrectly matches single-digit AWS region suffixes. Based on learnings, AWS region codes follow the[region]-[zone]-[digit]pattern where the digit is always a single decimal digit.
189-216: LGTM!The updated
_create_s3_clientcorrectly propagatesendpoint_urlto the boto3 client and handles all authentication types consistently.
296-300: LGTM!The updated calls correctly pass
endpoint_urlandregion_codefrom the config objects to_create_s3_client.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
components/clp-py-utils/clp_py_utils/s3_utils.py (1)
189-216: Add explicit path-style addressing for custom S3 endpoints.The
_create_s3_clientfunction doesn't enforces3.addressing_style=pathfor custom endpoints. While boto3's default "auto" mode should eventually fall back to path-style addressing, explicitly setting path-style prevents unnecessary virtual-hosted-style attempts against custom S3-compatible services (MinIO, LocalStack, etc.), which don't support that mode.Update the client creation to merge a path-style config when a custom endpoint is provided:
def _create_s3_client( endpoint_url: str | None, region_code: str | None, s3_auth: AwsAuthentication, boto3_config: Config | None = None, ) -> boto3.client: @@ - s3_client = aws_session.client("s3", endpoint_url=endpoint_url, config=boto3_config) + effective_config = boto3_config + if endpoint_url is not None: + # Custom S3-compatible endpoints generally require path-style addressing. + path_style_cfg = Config(s3={"addressing_style": "path"}) + effective_config = ( + path_style_cfg.merge(boto3_config) if boto3_config is not None else path_style_cfg + ) + + s3_client = aws_session.client("s3", endpoint_url=endpoint_url, config=effective_config) return s3_client
♻️ Duplicate comments (2)
components/core/src/clp/aws/test/test_AwsAuthenticationSigner.cpp (1)
9-33: Split intoSECTIONs andCAPTURE(url)to make failures self-identifying. Right now, a single failure doesn’t immediately tell which input URL broke parsing.components/clp-py-utils/clp_py_utils/s3_utils.py (1)
348-365: Docstring params fors3_delete_by_key_prefixare still blank/out of sync.This was raised previously;
:param endpoint_url:/:param region_code:should clarify AWS-vs-custom semantics (and thatregion_codemay beNonefor non-AWS endpoints).
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (3)
components/clp-py-utils/clp_py_utils/s3_utils.py(7 hunks)components/core/CMakeLists.txt(1 hunks)components/core/src/clp/aws/test/test_AwsAuthenticationSigner.cpp(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
⚙️ CodeRabbit configuration file
- Prefer
false == <expression>rather than!<expression>.
Files:
components/core/src/clp/aws/test/test_AwsAuthenticationSigner.cpp
🧠 Learnings (14)
📓 Common learnings
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/clp_config.py:580-585
Timestamp: 2025-12-10T23:32:24.363Z
Learning: In PR #1767, custom S3 endpoint support was added to CLP. The S3Config.region_code field is now optional (NonEmptyStr | None) because custom S3-compatible endpoints (MinIO, LocalStack, etc.) use path-style URLs and don't require AWS region codes. Only AWS S3 endpoints require region_code. Presto integration still requires region_code because it only works with AWS S3.
Learnt from: haiqi96
Repo: y-scope/clp PR: 852
File: components/clp-package-utils/clp_package_utils/scripts/native/compress.py:151-160
Timestamp: 2025-04-25T20:46:20.140Z
Learning: For S3 URLs without region specifications (legacy global endpoints), either assign a default region (us-east-1) or throw a clear error message requiring region specification in the URL. This addresses validation issues in components like S3InputConfig that require a non-nullable region string.
📚 Learning: 2024-11-29T22:50:17.206Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 595
File: components/core/tests/test-clp_s-end_to_end.cpp:109-110
Timestamp: 2024-11-29T22:50:17.206Z
Learning: In `components/core/tests/test-clp_s-end_to_end.cpp`, the success of `constructor.store()` is verified through `REQUIRE` statements and subsequent comparisons.
Applied to files:
components/core/src/clp/aws/test/test_AwsAuthenticationSigner.cpp
📚 Learning: 2024-11-01T03:26:26.386Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 570
File: components/core/tests/test-ir_encoding_methods.cpp:376-399
Timestamp: 2024-11-01T03:26:26.386Z
Learning: In the test code (`components/core/tests/test-ir_encoding_methods.cpp`), exception handling for `msgpack::unpack` can be omitted because the Catch2 testing framework captures exceptions if they occur.
Applied to files:
components/core/src/clp/aws/test/test_AwsAuthenticationSigner.cpp
📚 Learning: 2024-10-13T09:27:43.408Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 557
File: components/core/tests/test-ir_encoding_methods.cpp:1216-1286
Timestamp: 2024-10-13T09:27:43.408Z
Learning: In the unit test case `ffi_ir_stream_serialize_schema_tree_node_id` in `test-ir_encoding_methods.cpp`, suppressing the `readability-function-cognitive-complexity` warning is acceptable due to the expansion of Catch2 macros in C++ tests, and such test cases may not have readability issues.
Applied to files:
components/core/src/clp/aws/test/test_AwsAuthenticationSigner.cpp
📚 Learning: 2025-12-09T01:57:04.839Z
Learnt from: SharafMohamed
Repo: y-scope/clp PR: 1313
File: components/core/tests/test-SchemaSearcher.cpp:136-137
Timestamp: 2025-12-09T01:57:04.839Z
Learning: In C++ files under the components directory, keep forward declarations even when a definition follows in the same scope (such as in anonymous namespaces) to maintain consistency across multiple methods. This style reduces churn and aligns with the existing convention in this repository for both source and tests.
Applied to files:
components/core/src/clp/aws/test/test_AwsAuthenticationSigner.cpp
📚 Learning: 2025-12-10T23:32:24.363Z
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/clp_config.py:580-585
Timestamp: 2025-12-10T23:32:24.363Z
Learning: In PR #1767, custom S3 endpoint support was added to CLP. The S3Config.region_code field is now optional (NonEmptyStr | None) because custom S3-compatible endpoints (MinIO, LocalStack, etc.) use path-style URLs and don't require AWS region codes. Only AWS S3 endpoints require region_code. Presto integration still requires region_code because it only works with AWS S3.
Applied to files:
components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-04-25T20:46:20.140Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 852
File: components/clp-package-utils/clp_package_utils/scripts/native/compress.py:151-160
Timestamp: 2025-04-25T20:46:20.140Z
Learning: For S3 URLs without region specifications (legacy global endpoints), either assign a default region (us-east-1) or throw a clear error message requiring region specification in the URL. This addresses validation issues in components like S3InputConfig that require a non-nullable region string.
Applied to files:
components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-12-12T21:14:28.267Z
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/s3_utils.py:226-230
Timestamp: 2025-12-12T21:14:28.267Z
Learning: In the CLP project (y-scope/clp), ensure that regex patterns are compiled inline within functions for Python files under components/clp-py-utils/clp_py_utils (not at module scope). Move any re.compile calls from the module top level into the functions that use them to maintain consistency with the codebase style. This applies to all .py files in that directory; use module-level imports only for imports, not for pre-compiling regex patterns.
Applied to files:
components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-09-15T22:20:40.750Z
Learnt from: quinntaylormitchell
Repo: y-scope/clp PR: 1125
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:267-291
Timestamp: 2025-09-15T22:20:40.750Z
Learning: For CLP compression jobs, the team has decided to fail the entire job immediately upon encountering any invalid input path, rather than continuing to process valid paths. This decision was made during PR #1125 development.
Applied to files:
components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-12-12T16:22:05.883Z
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/core/src/clp/aws/constants.hpp:10-16
Timestamp: 2025-12-12T16:22:05.883Z
Learning: AWS S3 region codes follow the pattern `[region]-[zone]-[digit]` where the digit is always a single decimal digit (e.g., us-east-1, eu-west-3, ap-northeast-2). Custom S3-compatible endpoints (MinIO, LocalStack) may use different region naming conventions.
Applied to files:
components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-08-15T21:48:40.228Z
Learnt from: quinntaylormitchell
Repo: y-scope/clp PR: 1181
File: docs/src/user-guide/guides-retention.md:68-73
Timestamp: 2025-08-15T21:48:40.228Z
Learning: In documentation for the CLP project, when suggesting formatting improvements for variables in explanatory text, the user quinntaylormitchell prefers to maintain existing sentence structures (like keeping "that" in "i.e., that the difference...") while applying monospace formatting to technical terms and variables for consistency.
Applied to files:
components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-11-17T22:58:50.056Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 0
File: :0-0
Timestamp: 2025-11-17T22:58:50.056Z
Learning: In the y-scope/clp repository, when enabling new linting tools (ruff, mypy) on Python components, the team uses an incremental approach: first enable the tools with errors allowed (exit code 0), apply only safe auto-fixable fixes, then address remaining issues in follow-up PRs. During the initial enablement PR, reviews should focus on correctness of auto-fixes rather than flagging new code quality issues.
Applied to files:
components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-08-24T17:10:44.565Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: taskfiles/lint.yaml:780-781
Timestamp: 2025-08-24T17:10:44.565Z
Learning: The CLP repository is gradually migrating from Black to Ruff for Python formatting. Some modules haven't been set up for Ruff yet, so there's intentional mixed usage of Black and Ruff formatters during this transition period. This should not be flagged as an inconsistency in PR reviews.
Applied to files:
components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-10-13T03:24:35.074Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1405
File: components/clp-package-utils/pyproject.toml:5-15
Timestamp: 2025-10-13T03:24:35.074Z
Learning: In the y-scope/clp repository, the Python 3.9 to 3.10 version requirement change was intentionally deferred to a separate PR (after PR #1405) to reduce review effort, as decided in an offline discussion between junhaoliao and kirkrodrigues.
Applied to files:
components/clp-py-utils/clp_py_utils/s3_utils.py
🪛 Cppcheck (2.18.0)
components/core/src/clp/aws/test/test_AwsAuthenticationSigner.cpp
[information] 1-1: Include file
(missingIncludeSystem)
[information] 2-2: Include file
(missingIncludeSystem)
[information] 3-3: Include file
(missingIncludeSystem)
[information] 5-5: Include file
(missingInclude)
[information] 6-6: Include file
(missingInclude)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: musllinux_1_2-x86_64-static-linked-bins
- GitHub Check: package-image
- GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
- GitHub Check: manylinux_2_28-x86_64-static-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: build-macos (macos-15, false)
- GitHub Check: build-macos (macos-15, true)
- GitHub Check: build (macos-15)
- GitHub Check: build (ubuntu-24.04)
- GitHub Check: lint-check (macos-15)
- GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (7)
components/core/CMakeLists.txt (1)
440-446: Unit test source wiring looks correct. Addingsrc/clp/aws/test/test_AwsAuthenticationSigner.cpptoSOURCE_FILES_unitTestshould ensure it’s built/linked only withunitTestunderCLP_BUILD_TESTING.components/core/src/clp/aws/test/test_AwsAuthenticationSigner.cpp (1)
26-32: Confirm the intended behaviour for “custom endpoint + no region” (default vs empty). The test currently requirescDefaultRegionforhttp://localhost:4566/...; please verify this matches the PR’s cross-language contract (especially if Python now treats region as optional for non-AWS endpoints).components/clp-py-utils/clp_py_utils/s3_utils.py (5)
219-254: S3 URL parsing changes look consistent with endpoint-aware 4-tuple return.Returning
(endpoint_url, region_code, bucket_name, key_prefix)and treating AWS asendpoint_url=Nonealigns with the new config model (optional region for custom endpoints). Based on learnings, the inline regex compilation is also consistent with repo style.
256-284:generate_s3_urlbehaviour is reasonable, but requires callers to includes3.inendpoint_urlif they wants3.<region>...hosts.For non-AWS endpoints, the presence/absence of
s3.inendpoint_urlchanges the generated host forregion_code != None(e.g.,https://s3.<region>.<endpoint>vshttps://<region>.<endpoint>). Please confirm that’s intended for all supported providers (Wasabi/Spaces/etc.) and document it where configs are defined.
299-313: Call-site propagation ofendpoint_url/region_codeinto_create_s3_clientis correct.
338-340: Call-site propagation ofendpoint_url/region_codeinto_create_s3_clientis correct.
403-405: Call-site propagation ofendpoint_url/region_codeinto_create_s3_clientis correct.
| SCHEME_REGEXP = r"(?P<scheme>(http|https))" | ||
| S3_PREFIX_REGEXP = r"(?P<s3>s3)" | ||
| ENDPOINT_REGEXP = r"(?P<endpoint>[a-z0-9.-]+(\:[0-9]+)?)" | ||
| REGION_CODE_REGEXP = r"(?P<region_code>[a-z]+-[a-z]+-[0-9])" | ||
| BUCKET_NAME_REGEXP = r"(?P<bucket_name>[a-z0-9.-]+)" | ||
| KEY_PREFIX_REGEXP = r"(?P<key_prefix>[^?]+)" | ||
|
|
There was a problem hiding this comment.
REGION_CODE_REGEXP is too strict; likely rejects valid AWS regions (e.g., us-gov-west-1).
REGION_CODE_REGEXP = r"(?P<region_code>[a-z]+-[a-z]+-[0-9])" won’t match regions with more than 2 hyphen-separated alpha segments, which can break parsing and URL generation for some AWS partitions/providers.
-REGION_CODE_REGEXP = r"(?P<region_code>[a-z]+-[a-z]+-[0-9])"
+# Allow additional hyphen-separated segments (e.g., us-gov-west-1).
+REGION_CODE_REGEXP = r"(?P<region_code>[a-z0-9]+(-[a-z0-9]+)*-[0-9])"🤖 Prompt for AI Agents
components/clp-py-utils/clp_py_utils/s3_utils.py around lines 39 to 45: the
REGION_CODE_REGEXP is too strict and only matches two alpha segments plus a
single digit (rejects regions like us-gov-west-1); update the REGION_CODE_REGEXP
to accept one or more hyphen-separated alpha segments followed by a numeric
suffix (one or more digits) so multi-segment region codes are allowed, replace
the constant accordingly and run/adjust any tests that validate region parsing.
gibber9809
left a comment
There was a problem hiding this comment.
A few nits, but otherwise looks good to me.
| def parse_s3_url(s3_url: str) -> tuple[str | None, str | None, str, str]: | ||
| """ | ||
| Parses the region_code, bucket, and key_prefix from the given S3 URL. | ||
| Parses the endpoint_url, region_code, bucket, and key_prefix from the given S3 URL. | ||
| :param s3_url: A host-style URL or path-style URL. | ||
| :return: A tuple of (region_code, bucket, key_prefix). | ||
| :return: A tuple of (endpoint_url, region_code, bucket, key_prefix). | ||
| :raise: ValueError if `s3_url` is not a valid host-style URL or path-style URL. |
There was a problem hiding this comment.
I feel like it's a bit weird that we parse the endpoint url for aws as None. I can see that we're consistently using a None endpoint to indicate aws everywhere, but maybe it would be simpler to have a small EndpointUrl class that stores the endpoint string + a flag that indicates whether the endpoint is S3 or not?
It's also potentially fine to stick with the str | None approach if you prefer, I just think we need to update these docstrings to explain that None means AWS in that case .
There was a problem hiding this comment.
I've updated the documentation. The behavior feels analogous to when we don't specify an AWS_ENDPOINT_URL env in the AWS CLI, or when we pass endpoint_url=None to a boto3.client(...) call.
Honestly, I also feel returning a None for endpoint_url is very weird. Maybe the logic should be applied in _create_s3_client? Open to discussion.
There was a problem hiding this comment.
I think it's probably fine to stick with None for now since it's at least consistent everywhere. If we find it ends up being too confusing we should definitely come back to this after the release though.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
components/clp-py-utils/clp_py_utils/s3_utils.py (1)
348-372: LGTM - Delete operation correctly uses endpoint configuration.The function signature and client creation are properly updated to support custom S3 endpoints.
Note: The docstring enhancement for
endpoint_urlandregion_codeparameters was flagged in a previous review.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (2)
components/clp-py-utils/clp_py_utils/s3_utils.py(7 hunks)components/core/src/clp/aws/constants.hpp(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
⚙️ CodeRabbit configuration file
- Prefer
false == <expression>rather than!<expression>.
Files:
components/core/src/clp/aws/constants.hpp
🧠 Learnings (12)
📓 Common learnings
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/clp_config.py:580-585
Timestamp: 2025-12-10T23:32:24.363Z
Learning: In PR #1767, custom S3 endpoint support was added to CLP. The S3Config.region_code field is now optional (NonEmptyStr | None) because custom S3-compatible endpoints (MinIO, LocalStack, etc.) use path-style URLs and don't require AWS region codes. Only AWS S3 endpoints require region_code. Presto integration still requires region_code because it only works with AWS S3.
Learnt from: haiqi96
Repo: y-scope/clp PR: 852
File: components/clp-package-utils/clp_package_utils/scripts/native/compress.py:151-160
Timestamp: 2025-04-25T20:46:20.140Z
Learning: For S3 URLs without region specifications (legacy global endpoints), either assign a default region (us-east-1) or throw a clear error message requiring region specification in the URL. This addresses validation issues in components like S3InputConfig that require a non-nullable region string.
📚 Learning: 2025-04-25T20:46:20.140Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 852
File: components/clp-package-utils/clp_package_utils/scripts/native/compress.py:151-160
Timestamp: 2025-04-25T20:46:20.140Z
Learning: For S3 URLs without region specifications (legacy global endpoints), either assign a default region (us-east-1) or throw a clear error message requiring region specification in the URL. This addresses validation issues in components like S3InputConfig that require a non-nullable region string.
Applied to files:
components/core/src/clp/aws/constants.hppcomponents/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-12-12T21:14:36.067Z
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/s3_utils.py:226-230
Timestamp: 2025-12-12T21:14:36.067Z
Learning: In the CLP project (y-scope/clp), regex patterns in components/clp-py-utils/clp_py_utils/s3_utils.py should be compiled inline within functions rather than at module level, to maintain consistency with the existing codebase style.
Applied to files:
components/core/src/clp/aws/constants.hpp
📚 Learning: 2025-12-10T23:32:24.363Z
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/clp_config.py:580-585
Timestamp: 2025-12-10T23:32:24.363Z
Learning: In PR #1767, custom S3 endpoint support was added to CLP. The S3Config.region_code field is now optional (NonEmptyStr | None) because custom S3-compatible endpoints (MinIO, LocalStack, etc.) use path-style URLs and don't require AWS region codes. Only AWS S3 endpoints require region_code. Presto integration still requires region_code because it only works with AWS S3.
Applied to files:
components/core/src/clp/aws/constants.hppcomponents/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-12-12T16:22:05.883Z
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/core/src/clp/aws/constants.hpp:10-16
Timestamp: 2025-12-12T16:22:05.883Z
Learning: AWS S3 region codes are expected to follow the pattern <region>-<zone>-<digit> with a single decimal digit (e.g., us-east-1, eu-west-3, ap-northeast-2). In code handling region identifiers, validate against this pattern and treat region strings not matching it as invalid, while allowing the common exception that custom S3-compatible endpoints (e.g., MinIO, LocalStack) may use different region naming conventions. Apply this guideline to all relevant C++ header files under components/core/src/clp/aws (and similar AWS-related headers) to ensure consistent region string validation across the codebase.
Applied to files:
components/core/src/clp/aws/constants.hpp
📚 Learning: 2025-12-12T21:14:28.267Z
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/s3_utils.py:226-230
Timestamp: 2025-12-12T21:14:28.267Z
Learning: In the CLP project (y-scope/clp), ensure that regex patterns are compiled inline within functions for Python files under components/clp-py-utils/clp_py_utils (not at module scope). Move any re.compile calls from the module top level into the functions that use them to maintain consistency with the codebase style. This applies to all .py files in that directory; use module-level imports only for imports, not for pre-compiling regex patterns.
Applied to files:
components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-09-15T22:20:40.750Z
Learnt from: quinntaylormitchell
Repo: y-scope/clp PR: 1125
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:267-291
Timestamp: 2025-09-15T22:20:40.750Z
Learning: For CLP compression jobs, the team has decided to fail the entire job immediately upon encountering any invalid input path, rather than continuing to process valid paths. This decision was made during PR #1125 development.
Applied to files:
components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-12-12T16:22:15.181Z
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/core/src/clp/aws/constants.hpp:10-16
Timestamp: 2025-12-12T16:22:15.181Z
Learning: AWS S3 region codes follow the pattern `[region]-[zone]-[digit]` where the digit is always a single decimal digit (e.g., us-east-1, eu-west-3, ap-northeast-2). Custom S3-compatible endpoints (MinIO, LocalStack) may use different region naming conventions.
Applied to files:
components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-08-15T21:48:40.228Z
Learnt from: quinntaylormitchell
Repo: y-scope/clp PR: 1181
File: docs/src/user-guide/guides-retention.md:68-73
Timestamp: 2025-08-15T21:48:40.228Z
Learning: In documentation for the CLP project, when suggesting formatting improvements for variables in explanatory text, the user quinntaylormitchell prefers to maintain existing sentence structures (like keeping "that" in "i.e., that the difference...") while applying monospace formatting to technical terms and variables for consistency.
Applied to files:
components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-11-17T22:58:50.056Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 0
File: :0-0
Timestamp: 2025-11-17T22:58:50.056Z
Learning: In the y-scope/clp repository, when enabling new linting tools (ruff, mypy) on Python components, the team uses an incremental approach: first enable the tools with errors allowed (exit code 0), apply only safe auto-fixable fixes, then address remaining issues in follow-up PRs. During the initial enablement PR, reviews should focus on correctness of auto-fixes rather than flagging new code quality issues.
Applied to files:
components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-08-24T17:10:44.565Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: taskfiles/lint.yaml:780-781
Timestamp: 2025-08-24T17:10:44.565Z
Learning: The CLP repository is gradually migrating from Black to Ruff for Python formatting. Some modules haven't been set up for Ruff yet, so there's intentional mixed usage of Black and Ruff formatters during this transition period. This should not be flagged as an inconsistency in PR reviews.
Applied to files:
components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-10-13T03:24:35.074Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1405
File: components/clp-package-utils/pyproject.toml:5-15
Timestamp: 2025-10-13T03:24:35.074Z
Learning: In the y-scope/clp repository, the Python 3.9 to 3.10 version requirement change was intentionally deferred to a separate PR (after PR #1405) to reduce review effort, as decided in an offline discussion between junhaoliao and kirkrodrigues.
Applied to files:
components/clp-py-utils/clp_py_utils/s3_utils.py
🧬 Code graph analysis (1)
components/clp-py-utils/clp_py_utils/s3_utils.py (2)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
AwsAuthentication(547-577)components/core/src/clp/aws/AwsAuthenticationSigner.hpp (1)
s3_url(103-104)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: manylinux_2_28-x86_64-static-linked-bins
- GitHub Check: package-image
- GitHub Check: musllinux_1_2-x86_64-static-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: build-macos (macos-15, false)
- GitHub Check: build (ubuntu-24.04)
- GitHub Check: build (macos-15)
- GitHub Check: lint-check (ubuntu-24.04)
- GitHub Check: lint-check (macos-15)
🔇 Additional comments (7)
components/clp-py-utils/clp_py_utils/s3_utils.py (7)
39-45: LGTM - Regex constants support flexible endpoint handling.The relaxed
REGION_CODE_REGEXPpattern supports custom S3-compatible endpoints (MinIO, LocalStack) that may use non-standard region naming conventions. The pattern works within the larger URL regex context where structured matching limits false positives.
189-216: LGTM - Client creation correctly propagates endpoint configuration.The
endpoint_urlandregion_codeparameters are properly threaded through session creation and client instantiation for all authentication types.
219-253: LGTM - URL parsing correctly extracts endpoint and preserves s3. prefix.The implementation properly handles both host-style and path-style URLs. The
s3.prefix is correctly preserved inendpoint_urlfor non-AWS endpoints (line 249-250), addressing the previous round-trip concerns.
256-283: LGTM - URL generation handles AWS and custom endpoints correctly.The function properly generates:
- Virtual-hosted style URLs for AWS (
bucket.s3.region.amazonaws.com)- Path-style URLs for custom endpoints (
endpoint/bucket/key)The
s3.prefix and region code are correctly interpolated for non-AWS endpoints with region support.
299-303: LGTM - Endpoint configuration correctly propagated to client creation.
338-340: LGTM - Endpoint configuration correctly propagated for upload operations.
403-405: LGTM - Batch delete operations correctly use endpoint configuration.
| constexpr std::string_view cSchemeRegex{R"((?<scheme>(http|https)))"}; | ||
| constexpr std::string_view cEndpointRegex{R"((?<endpoint>[a-z0-9.-]+(\:[0-9]+)?))"}; | ||
| constexpr std::string_view cRegionRegex{R"((?<region>[a-z0-9\-]+))"}; | ||
| constexpr std::string_view cBucketRegex{R"((?<bucket>[a-z0-9.-]+))"}; | ||
| constexpr std::string_view cKeyRegex{R"((?<key>[^?]+))"}; |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Regex constants look good; confirm AWS‑specific region validation happens elsewhere
The new regex constants are internally consistent and match the intended S3 URL components (scheme, endpoint + optional port, region, bucket, key). The more permissive cRegionRegex ([a-z0-9\-]+) makes sense for custom S3‑compatible endpoints, but it no longer enforces the AWS <region>-<zone>-<digit> shape by itself. As long as AWS‑only flows (e.g., anything that truly depends on an AWS S3 region code, like Presto integration) still validate region strings against the stricter AWS pattern in their own logic, this shared regex set is a reasonable compromise for mixed AWS/custom endpoint support. Based on learnings, this keeps custom endpoints working without silently relaxing AWS‑specific correctness guarantees.
🤖 Prompt for AI Agents
In components/core/src/clp/aws/constants.hpp around lines 11 to 15, the shared
regexes deliberately relax region validation to support custom S3 endpoints;
ensure any AWS-specific consumers (e.g., Presto integration or other AWS-only
flows) perform stricter region validation elsewhere by adding a validation step
using the AWS region pattern (e.g., full region-zone-digit form) before relying
on cRegionRegex, returning/throwing a clear validation error when the region
does not match the stricter AWS pattern; keep the permissive shared constants
as-is for mixed endpoint support.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
components/clp-py-utils/clp_py_utils/s3_utils.py (1)
349-365: s3_delete_by_key_prefix docstring still doesn’t describe endpoint_url / region_code.The signature now takes both
endpoint_urlandregion_code, but the docstring only lists:param endpoint_url:and:param region_code:without explaining their roles, which is confusing given theNone‑means‑AWS convention used elsewhere.Consider tightening the docstring to mirror
parse_s3_url/S3Configsemantics, e.g.:def s3_delete_by_key_prefix( endpoint_url: str | None, region_code: str | None, @@ - :param endpoint_url: - :param region_code: + :param endpoint_url: Custom S3 endpoint URL (e.g., https://minio.example.com:9000), + or None to use the default AWS S3 endpoint. + :param region_code: AWS region code for AWS endpoints; None for S3-compatible + endpoints that don’t require a region. :param bucket_name: :param key_prefix: :param s3_auth:This keeps the docstring consistent with how
S3Configand_create_s3_clientinterpret these fields. Based on learnings, that clarity is important now thatregion_codeis optional for non‑AWS endpoints.Also applies to: 372-373
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
components/clp-py-utils/clp_py_utils/s3_utils.py(7 hunks)
🧰 Additional context used
🧠 Learnings (10)
📓 Common learnings
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/clp_config.py:580-585
Timestamp: 2025-12-10T23:32:24.363Z
Learning: In PR #1767, custom S3 endpoint support was added to CLP. The S3Config.region_code field is now optional (NonEmptyStr | None) because custom S3-compatible endpoints (MinIO, LocalStack, etc.) use path-style URLs and don't require AWS region codes. Only AWS S3 endpoints require region_code. Presto integration still requires region_code because it only works with AWS S3.
Learnt from: haiqi96
Repo: y-scope/clp PR: 852
File: components/clp-package-utils/clp_package_utils/scripts/native/compress.py:151-160
Timestamp: 2025-04-25T20:46:20.140Z
Learning: For S3 URLs without region specifications (legacy global endpoints), either assign a default region (us-east-1) or throw a clear error message requiring region specification in the URL. This addresses validation issues in components like S3InputConfig that require a non-nullable region string.
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/core/src/clp/aws/constants.hpp:10-16
Timestamp: 2025-12-12T16:22:15.181Z
Learning: AWS S3 region codes follow the pattern `[region]-[zone]-[digit]` where the digit is always a single decimal digit (e.g., us-east-1, eu-west-3, ap-northeast-2). Custom S3-compatible endpoints (MinIO, LocalStack) may use different region naming conventions.
📚 Learning: 2025-12-10T23:32:24.363Z
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/clp_config.py:580-585
Timestamp: 2025-12-10T23:32:24.363Z
Learning: In PR #1767, custom S3 endpoint support was added to CLP. The S3Config.region_code field is now optional (NonEmptyStr | None) because custom S3-compatible endpoints (MinIO, LocalStack, etc.) use path-style URLs and don't require AWS region codes. Only AWS S3 endpoints require region_code. Presto integration still requires region_code because it only works with AWS S3.
Applied to files:
components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-12-12T21:14:28.267Z
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/s3_utils.py:226-230
Timestamp: 2025-12-12T21:14:28.267Z
Learning: In the CLP project (y-scope/clp), ensure that regex patterns are compiled inline within functions for Python files under components/clp-py-utils/clp_py_utils (not at module scope). Move any re.compile calls from the module top level into the functions that use them to maintain consistency with the codebase style. This applies to all .py files in that directory; use module-level imports only for imports, not for pre-compiling regex patterns.
Applied to files:
components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-04-25T20:46:20.140Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 852
File: components/clp-package-utils/clp_package_utils/scripts/native/compress.py:151-160
Timestamp: 2025-04-25T20:46:20.140Z
Learning: For S3 URLs without region specifications (legacy global endpoints), either assign a default region (us-east-1) or throw a clear error message requiring region specification in the URL. This addresses validation issues in components like S3InputConfig that require a non-nullable region string.
Applied to files:
components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-09-15T22:20:40.750Z
Learnt from: quinntaylormitchell
Repo: y-scope/clp PR: 1125
File: components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py:267-291
Timestamp: 2025-09-15T22:20:40.750Z
Learning: For CLP compression jobs, the team has decided to fail the entire job immediately upon encountering any invalid input path, rather than continuing to process valid paths. This decision was made during PR #1125 development.
Applied to files:
components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-12-12T16:22:15.181Z
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/core/src/clp/aws/constants.hpp:10-16
Timestamp: 2025-12-12T16:22:15.181Z
Learning: AWS S3 region codes follow the pattern `[region]-[zone]-[digit]` where the digit is always a single decimal digit (e.g., us-east-1, eu-west-3, ap-northeast-2). Custom S3-compatible endpoints (MinIO, LocalStack) may use different region naming conventions.
Applied to files:
components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-08-15T21:48:40.228Z
Learnt from: quinntaylormitchell
Repo: y-scope/clp PR: 1181
File: docs/src/user-guide/guides-retention.md:68-73
Timestamp: 2025-08-15T21:48:40.228Z
Learning: In documentation for the CLP project, when suggesting formatting improvements for variables in explanatory text, the user quinntaylormitchell prefers to maintain existing sentence structures (like keeping "that" in "i.e., that the difference...") while applying monospace formatting to technical terms and variables for consistency.
Applied to files:
components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-11-17T22:58:50.056Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 0
File: :0-0
Timestamp: 2025-11-17T22:58:50.056Z
Learning: In the y-scope/clp repository, when enabling new linting tools (ruff, mypy) on Python components, the team uses an incremental approach: first enable the tools with errors allowed (exit code 0), apply only safe auto-fixable fixes, then address remaining issues in follow-up PRs. During the initial enablement PR, reviews should focus on correctness of auto-fixes rather than flagging new code quality issues.
Applied to files:
components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-08-24T17:10:44.565Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1100
File: taskfiles/lint.yaml:780-781
Timestamp: 2025-08-24T17:10:44.565Z
Learning: The CLP repository is gradually migrating from Black to Ruff for Python formatting. Some modules haven't been set up for Ruff yet, so there's intentional mixed usage of Black and Ruff formatters during this transition period. This should not be flagged as an inconsistency in PR reviews.
Applied to files:
components/clp-py-utils/clp_py_utils/s3_utils.py
📚 Learning: 2025-10-13T03:24:35.074Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1405
File: components/clp-package-utils/pyproject.toml:5-15
Timestamp: 2025-10-13T03:24:35.074Z
Learning: In the y-scope/clp repository, the Python 3.9 to 3.10 version requirement change was intentionally deferred to a separate PR (after PR #1405) to reduce review effort, as decided in an offline discussion between junhaoliao and kirkrodrigues.
Applied to files:
components/clp-py-utils/clp_py_utils/s3_utils.py
🧬 Code graph analysis (1)
components/clp-py-utils/clp_py_utils/s3_utils.py (1)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
AwsAuthentication(547-577)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: package-image
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: musllinux_1_2-x86_64-static-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: manylinux_2_28-x86_64-static-linked-bins
- GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
- GitHub Check: build (macos-15)
- GitHub Check: build (ubuntu-24.04)
- GitHub Check: build-macos (macos-15, false)
- GitHub Check: build-macos (macos-15, true)
- GitHub Check: lint-check (ubuntu-24.04)
- GitHub Check: lint-check (macos-15)
🔇 Additional comments (3)
components/clp-py-utils/clp_py_utils/s3_utils.py (3)
39-45: Regex fragments for S3 URL parsing look correct and flexible.These constants compose cleanly into the host‑ and path‑style patterns below, and the relaxed
REGION_CODE_REGEXP = r"(?P<region_code>[a-z0-9\-]+)"now admits extended AWS‑style region codes (e.g.,us-gov-west-1) while still stopping at the literal.before the endpoint, so it won’t swallow any of the endpoint host. Based on learnings, this matches the intent to support both AWS and custom S3-compatible region naming.
189-216: Endpoint-aware S3 client creation is wired correctly.Passing
endpoint_urlthrough toaws_session.client("s3", endpoint_url=endpoint_url, config=boto3_config)while keepingregion_name=region_codeon theSessionpreserves existing AWS behaviour and enables custom S3-compatible endpoints (MinIO/LocalStack, etc.) without changing auth branching. The updated signature is consistently used by all local callers.
287-304: All S3 operations now consistently propagate endpoint_url into client creation.
s3_get_object_metadata,s3_put, ands3_delete_objectsall call_create_s3_clientwith(s3_config.endpoint_url, s3_config.region_code, s3_config.aws_authentication, boto3_config)in the new argument order, so metadata reads, uploads, and batched deletes should behave the same across AWS and custom endpoints. This is aligned with the updatedS3Configshape from the config module. Based on learnings, this is the desired propagation of optionalregion_codeandendpoint_url.Also applies to: 338-341, 403-406
| def parse_s3_url(s3_url: str) -> tuple[str | None, str | None, str, str]: | ||
| """ | ||
| Parses the region_code, bucket, and key_prefix from the given S3 URL. | ||
| Parses the endpoint_url, region_code, bucket, and key_prefix from the given S3 URL. | ||
| :param s3_url: A host-style URL or path-style URL. | ||
| :return: A tuple of (region_code, bucket, key_prefix). | ||
| :return: A tuple of (endpoint_url, region_code, bucket, key_prefix). A value of None is returned | ||
| if the endpoint_url originates from AWS. | ||
| :raise: ValueError if `s3_url` is not a valid host-style URL or path-style URL. | ||
| """ | ||
| host_style_url_regex = re.compile( | ||
| r"https://(?P<bucket_name>[a-z0-9.-]+)\.s3(\.(?P<region_code>[a-z]+-[a-z]+-[0-9]))" | ||
| r"\.(?P<endpoint>[a-z0-9.-]+)/(?P<key_prefix>[^?]+).*" | ||
| rf"{SCHEME_REGEXP}://{BUCKET_NAME_REGEXP}\." | ||
| rf"{S3_PREFIX_REGEXP}\.({REGION_CODE_REGEXP}\.)?" | ||
| rf"{ENDPOINT_REGEXP}/{KEY_PREFIX_REGEXP}.*" | ||
| ) | ||
| match = host_style_url_regex.match(s3_url) | ||
|
|
||
| if match is None: | ||
| path_style_url_regex = re.compile( | ||
| r"https://s3(\.(?P<region_code>[a-z]+-[a-z]+-[0-9]))\.(?P<endpoint>[a-z0-9.-]+)/" | ||
| r"(?P<bucket_name>[a-z0-9.-]+)/(?P<key_prefix>[^?]+).*" | ||
| rf"{SCHEME_REGEXP}://({S3_PREFIX_REGEXP}\." | ||
| rf"({REGION_CODE_REGEXP}\.)?)?{ENDPOINT_REGEXP}" | ||
| rf"/{BUCKET_NAME_REGEXP}/{KEY_PREFIX_REGEXP}.*" | ||
| ) | ||
| match = path_style_url_regex.match(s3_url) | ||
|
|
||
| if match is None: | ||
| raise ValueError(f"Unsupported URL format: {s3_url}") | ||
|
|
||
| scheme = match.group("scheme") | ||
| endpoint = match.group("endpoint") | ||
| region_code = match.group("region_code") | ||
| bucket_name = match.group("bucket_name") | ||
| endpoint = match.group("endpoint") | ||
| key_prefix = match.group("key_prefix") | ||
|
|
||
| if AWS_ENDPOINT != endpoint: | ||
| raise ValueError(f"Unsupported endpoint: {endpoint}") | ||
| s3_prefix = "s3." if match.group("s3") is not None else "" | ||
| endpoint_url = f"{scheme}://{s3_prefix}{endpoint}" if endpoint != AWS_ENDPOINT else None | ||
| key_prefix = match.group("key_prefix") | ||
|
|
||
| return region_code, bucket_name, key_prefix | ||
| return endpoint_url, region_code, bucket_name, key_prefix | ||
|
|
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
parse_s3_url / generate_s3_url now handle AWS and custom endpoints coherently.
The updated pair correctly covers:
- AWS virtual-hosted URLs
bucket.s3[.region].amazonaws.com→endpoint_url=None,region_codeas parsed, and regenerates back to virtual-hosted AWS URLs. - AWS path-style URLs
s3[.region].amazonaws.com/bucket/key→ parsed as AWS (endpoint_url=None) and normalised to virtual-hosted form on generation. - Non‑AWS endpoints, including
s3.<domain>ands3.<region>.<domain>(e.g., Wasabi) and plain hosts likeminio.example.com:9000, preserving anys3.prefix inendpoint_urland avoiding the previous prefix-loss / double‑s3issues on round‑trip. - Endpoint parsing for
generate_s3_urlis correctly constrained toscheme://[s3.]endpoint[:port]/?, so misconfigured endpoint URLs with a path are rejected early.
Only a minor nit: the docstring sentence
“A value of None is returned if the endpoint_url originates from AWS.”
could be a bit clearer if rephrased along the lines of:
“
endpoint_urlisNonewhen the S3 endpoint host isamazonaws.com; otherwise it is thescheme://host[:port]of the S3-compatible endpoint.”
to make the None convention explicit and tie it to AWS_ENDPOINT. Based on learnings, this matches how endpoint_url and region_code are interpreted elsewhere in the config model.
Also applies to: 257-285
🤖 Prompt for AI Agents
In components/clp-py-utils/clp_py_utils/s3_utils.py around lines 219-255 (and
similarly lines 257-285), the docstring line “A value of None is returned if the
endpoint_url originates from AWS.” is ambiguous; replace it with a clearer
sentence such as: "`endpoint_url` is None when the S3 endpoint host is
amazonaws.com; otherwise `endpoint_url` is the scheme://host[:port] of the
S3-compatible endpoint." Update both occurrences so the None convention
explicitly ties to AWS_ENDPOINT and matches how endpoint_url and region_code are
used elsewhere.
gibber9809
left a comment
There was a problem hiding this comment.
LGTM. PR title is fine as well.
I also double-checked that regular S3 urls still work as expected with clp-s on the last commit, so things look good.
junhaoliao
left a comment
There was a problem hiding this comment.
the code lgtm. posted a question
| S3_PREFIX_REGEXP = r"(?P<s3>s3)" | ||
| ENDPOINT_REGEXP = r"(?P<endpoint>[a-z0-9.-]+(\:[0-9]+)?)" | ||
| REGION_CODE_REGEXP = r"(?P<region_code>[a-z0-9\-]+)" | ||
| BUCKET_NAME_REGEXP = r"(?P<bucket_name>[a-z0-9.-]+)" |
There was a problem hiding this comment.
it seems those patterns prohibit UPPER CASE letters, which are techically allowed as per RFC952? ditto for the ones in C++ code
https://datatracker.ietf.org/doc/html/rfc952
ASSUMPTIONS -> 1.
A "name" (Net, Host, Gateway, or Domain name) is a text string up
to 24 characters drawn from the alphabet (A-Z), digits (0-9), minus
sign (-), and period (.). Note that periods are only allowed when
they serve to delimit components of "domain style names". (See
RFC-921, "Domain Name System Implementation Schedule", for
background). No blank or space characters are permitted as part of a
name. No distinction is made between upper and lower case. The first
character must be an alpha character. The last character must not be
a minus sign or period. A host which serves as a GATEWAY should have
"-GATEWAY" or "-GW" as part of its name. Hosts which do not serve as
Internet gateways should not use "-GATEWAY" and "-GW" as part of
their names. A host which is a TAC should have "-TAC" as the last
part of its host name, if it is a DoD host. Single character names
or nicknames are not allowed.
There was a problem hiding this comment.
@coderabbitai Can you create an issue on this? 1. We need to verify if hostname is case insensitive and path is case sensitive. 2. endpoint_url = f"{scheme}://{s3_prefix}{endpoint}" if endpoint != AWS_ENDPOINT else None the comparison needs to be case insensitive.
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Create GitHub issue for hostname case sensitivity and endpoint comparison
gh issue create \
--repo y-scope/clp \
--title "S3 URL parsing: verify hostname case sensitivity and fix endpoint comparison" \
--assignee hoophalab \
--body "## Context
Related to PR #1767 and [this comment](https://github.com/y-scope/clp/pull/1767#discussion_r2621220969).
@junhaoliao raised a concern that the regex patterns in the S3 URL parsing code prohibit uppercase letters, which are technically allowed per [RFC952](https://datatracker.ietf.org/doc/html/rfc952).
## Tasks
1. **Verify hostname and path case sensitivity behavior**
- Confirm if hostname is case insensitive
- Confirm if path is case sensitive
- Update regex patterns accordingly if uppercase letters should be allowed in hostnames
2. **Fix endpoint comparison to be case insensitive**
- The comparison \`endpoint != AWS_ENDPOINT\` in \`parse_s3_url\` (line 318 in \`components/clp-py-utils/clp_py_utils/s3_utils.py\`) needs to be case insensitive
- Similar fixes may be needed in the C++ code
## Affected Files
- \`components/clp-py-utils/clp_py_utils/s3_utils.py\`
- \`components/core/src/clp/aws/constants.hpp\`
- \`components/core/src/clp/aws/AwsAuthenticationSigner.cpp\`
---
Requested by: @hoophalab
PR: #1767"
Length of output: 42
@hoophalab, I've created issue #1792 to track these items:
Issue: S3 URL parsing: verify hostname case sensitivity and fix endpoint comparison
The issue covers:
- Verifying hostname case insensitivity vs path case sensitivity behavior, and updating regex patterns if needed to allow uppercase letters per RFC952
- Making the
endpoint != AWS_ENDPOINTcomparison case insensitive in both Python and C++ code
The issue has been assigned to you and includes backlinks to this PR and comment thread.
🧬 Code Graph Analysis Results
[components/clp-py-utils/clp_py_utils/s3_utils.py: parse_s3_url function]
Summary:
- Purpose: Parses an S3 URL to extract endpoint_url, region_code, bucket, and key_prefix.
- Parameters: s3_url (str) - a host-style or path-style URL.
- Returns: (endpoint_url: str | None, region_code: str | None, bucket: str, key_prefix: str)
- Behavior:
- Attempts host-style URL pattern matching first, then falls back to path-style pattern.
- Determines endpoint_url as scheme://{s3_prefix}{endpoint} unless endpoint equals AWS_ENDPOINT, in which case endpoint_url is None.
- s3_prefix is "s3." when the host contains the s3 prefix, otherwise "".
- Important detail: The comparison of the endpoint to AWS_ENDPOINT is currently case-sensitive.
- Change needed (per user comment):
- Make the AWS endpoint comparison case-insensitive, e.g. endpoint.lower() != AWS_ENDPOINT.lower().
Relevant excerpt (as-is from the file):
def parse_s3_url(s3_url: str) -> tuple[str | None, str | None, str, str]:
"""
Parses the endpoint_url, region_code, bucket, and key_prefix from the given S3 URL.
:param s3_url: A host-style URL or path-style URL.
:return: A tuple of (endpoint_url, region_code, bucket, key_prefix). A value of None is returned
if the endpoint_url originates from AWS.
:raise: ValueError if s3_url is not a valid host-style URL or path-style URL.
"""
host_style_url_regex = re.compile(
rf"{SCHEME_REGEXP}://{BUCKET_NAME_REGEXP}."
rf"{S3_PREFIX_REGEXP}.({REGION_CODE_REGEXP}.)?"
rf"{ENDPOINT_REGEXP}/{KEY_PREFIX_REGEXP}.*"
)
match = host_style_url_regex.match(s3_url)
if match is None:
path_style_url_regex = re.compile(
rf"{SCHEME_REGEXP}://({S3_PREFIX_REGEXP}\."
rf"({REGION_CODE_REGEXP}\.)?)?{ENDPOINT_REGEXP}"
rf"/{BUCKET_NAME_REGEXP}/{KEY_PREFIX_REGEXP}.*"
)
match = path_style_url_regex.match(s3_url)
if match is None:
raise ValueError(f"Unsupported URL format: {s3_url}")
scheme = match.group("scheme")
endpoint = match.group("endpoint")
region_code = match.group("region_code")
bucket_name = match.group("bucket_name")
s3_prefix = "s3." if match.group("s3") is not None else ""
endpoint_url = f"{scheme}://{s3_prefix}{endpoint}" if endpoint != AWS_ENDPOINT else None
key_prefix = match.group("key_prefix")
return endpoint_url, region_code, bucket_name, key_prefix
Note to reviewer: Update needed to compare endpoint and AWS_ENDPOINT in a case-insensitive manner, e.g. endpoint.lower() != AWS_ENDPOINT.lower(). This aligns with hostname case-insensitivity expectations.
🧠 Learnings used
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/clp_config.py:580-585
Timestamp: 2025-12-10T23:32:24.363Z
Learning: In PR #1767, custom S3 endpoint support was added to CLP. The S3Config.region_code field is now optional (NonEmptyStr | None) because custom S3-compatible endpoints (MinIO, LocalStack, etc.) use path-style URLs and don't require AWS region codes. Only AWS S3 endpoints require region_code. Presto integration still requires region_code because it only works with AWS S3.
Learnt from: haiqi96
Repo: y-scope/clp PR: 852
File: components/clp-package-utils/clp_package_utils/scripts/native/compress.py:151-160
Timestamp: 2025-04-25T20:46:20.140Z
Learning: For S3 URLs without region specifications (legacy global endpoints), either assign a default region (us-east-1) or throw a clear error message requiring region specification in the URL. This addresses validation issues in components like S3InputConfig that require a non-nullable region string.
Description
S3InputConfigandS3Urlclasses.endpointisn'tamazonaws.com, both clp-s and the compression worker switch to path-style URLs.Checklist
breaking change.
Validation performed
log_inputsection inclp-config.yamlcompress-from-s3.sh --timestamp-key timestamp --dataset default s3-key-prefix http://<endpoint>/test-bucket/log-file-name.logSummary by CodeRabbit
New Features
Improvements
Tests
✏️ Tip: You can customize this high-level summary in your review settings.