feat(presto-clp): Add support for custom S3 endpoint URLs and buckets.#1917
Conversation
WalkthroughAdds an AWS S3 domain constant and a helper to resolve S3 endpoint URLs; makes S3 endpoint and region optional, computes endpoint, and exports S3 bucket and computed endpoint as environment variables and CLP properties in the Presto CLP init script. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
335ee20 to
2be8401
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@tools/deployment/presto-clp/scripts/init.py`:
- Around line 281-283: The current call that assigns s3_endpoint_url using
_get_required_config_value makes endpoint_url mandatory and breaks AWS configs;
change this to first attempt to read an optional endpoint (use a non-failing
getter or check for presence on clp_config using the s3_config_key), and if
absent, derive the endpoint from the bucket and region_code (ensure region_code
is validated/required when deriving an AWS-style endpoint). Update the logic
around s3_endpoint_url, s3_config_key, clp_config and clp_config_file_path so
endpoint_url is used when present, otherwise construct the AWS endpoint from
bucket + region_code, keeping region_code required only for the derived path.
| s3_endpoint_url = _get_required_config_value( | ||
| clp_config, f"{s3_config_key}.endpoint_url", clp_config_file_path | ||
| ) |
There was a problem hiding this comment.
Isn't the endpoint URL optional?
I think this needs to be _get_config_value, with some handling for the case where s3_endpoint_url is NULL? For context s3_endpoint_url is currently always null when the endpoint is actually s3 if I'm remembering correctly.
More specifically, looks like the right thing to do here might be to either:
- Use the definition of s3_end_point value from the old version of the code here whenever
s3_endpoint_urlis null, or - Update
s3_utils.pyto cover this use-case -- I think we basically want what thegenerate_s3_urlfunction does except
a. When the endpoint is null/s3 we want all of the output except the part with the object key
b. Otherwise we want all of the output except the part with the bucket and object key?
There was a problem hiding this comment.
Good catch, following the logic in s3_endpoint_url, here is the implemented logic:
- required config: bucket
- optional config: region and endpoint_url
- when endpoint_url is empty, constructs corresponding aws endpoint pending the existence of region.
- when endpoint_url is not empty, use the user provided url directly
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@tools/deployment/presto-clp/scripts/init.py`:
- Around line 314-315: When region_code is None (and endpoint_url is also None)
the code returns a region-less S3 URL (https://{bucket_name}.s3.{AWS_ENDPOINT}/)
which can be misleading for buckets outside us-east-1; add a warning log right
before this return: use the module logger (or import logging and
getLogger(__name__)) and call logger.warning indicating that no region/endpoint
was provided and the code is falling back to the legacy global S3 endpoint
(include bucket_name in the message), so callers are aware of possible
cross-region routing; leave the return value unchanged.
gibber9809
left a comment
There was a problem hiding this comment.
PR now LGTM. For PR title how about:
feat(presto-clp): Add support for custom S3 endpoint URLs and buckets.
feat because it's adding new functionality to the presto integration, and removing the bit at the end of the original title because I think that the tag presto-clp is sufficient to tell us where the functionality is being added.
junhaoliao
left a comment
There was a problem hiding this comment.
the code looks correct to me
cc @hoophalab for awareness
| if endpoint_url is not None: | ||
| return endpoint_url | ||
|
|
||
| if region_code is None: |
There was a problem hiding this comment.
I understand this is needed to support non-AWS S3 endpoints. However, omitting a region is non-standard for AWS S3. Historically, some AWS SDKs would fall back to us-east-1 when no region was specified, but this behavior should not be relied upon.
Should we add a log message when neither endpoint_url nor region_code is provided (here)?
There was a problem hiding this comment.
e.g.
Neither 'endpoint_url' nor 'region_code' specified in S3 config. Using region-less S3 endpoint which may not work for all buckets.
There was a problem hiding this comment.
As suggested by @gibber9809 , this behavior is followed by generate_s3_url in s3_utils.py so that CLP package always make consistent assumption.
I am okay with adding the log message when neither 'endpoint_url' nor 'region_code' is specified in S3 config as a warnning. But should we keep the same behavior as the changes in this PR?
And thanks for the insight on this S3 support!
There was a problem hiding this comment.
Sorry for slow response. Yeah logging a warning in this case sounds reasonable. It probably also makes sense to log the same warning during the package startup whenever 'endpoint_url' and 'region_code' are missing from the package config for the same reason.
There was a problem hiding this comment.
I would be fine tackling this in a separate PR though that adds the warning both here and in the package startup script.
There was a problem hiding this comment.
@coderabbitai we will merge the current PR first and add the warning messages later. let's create an issue about adding a warning message both here and in the package startup script
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Create a GitHub issue for adding warning messages about missing S3 endpoint_url and region_code
gh issue create \
--repo y-scope/clp \
--title "Add warning when S3 config lacks both endpoint_url and region_code" \
--body "## Context
When neither \`endpoint_url\` nor \`region_code\` is specified in the S3 configuration, the code constructs a region-less S3 endpoint (e.g., \`https://{bucket}.s3.amazonaws.com\`). This behavior is non-standard for AWS S3 and may not work for all buckets, as AWS typically requires a region to be specified.
This behavior was implemented to maintain consistency with [\`generate_s3_url\`](https://github.com/y-scope/clp/blob/01d35a672c2f6fb892912121800b7a85f7448b23/components/clp-py-utils/clp_py_utils/s3_utils.py#L257) in \`s3_utils.py\`.
## Task
Add warning log messages in the following two locations when neither \`endpoint_url\` nor \`region_code\` is provided:
1. **Presto init script**: \`tools/deployment/presto-clp/scripts/init.py\` in the \`_resolve_s3_endpoint_url\` function
2. **Package startup script**: Where S3 configuration is processed during CLP package startup
## Suggested warning message
\`\`\`
Neither 'endpoint_url' nor 'region_code' specified in S3 config. Using region-less S3 endpoint which may not work for all buckets.
\`\`\`
## References
- PR `#1917`: https://github.com/y-scope/clp/pull/1917
- Discussion: https://github.com/y-scope/clp/pull/1917#discussion_r2734254388
- Requested by: `@junhaoliao`" \
--assignee junhaoliao
Length of output: 42
@junhaoliao I've created issue #1932 to track adding the warning messages for S3 configurations that lack both endpoint_url and region_code. The issue has been assigned to you and includes references back to this PR and discussion.
🧬 Code Graph Analysis Results
<no_relevant_code_snippets>
</no_relevant_code_snippets>
🧠 Learnings used
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/clp_config.py:580-585
Timestamp: 2025-12-10T23:32:24.363Z
Learning: In PR `#1767`, custom S3 endpoint support was added to CLP. The S3Config.region_code field is now optional (NonEmptyStr | None) because custom S3-compatible endpoints (MinIO, LocalStack, etc.) use path-style URLs and don't require AWS region codes. Only AWS S3 endpoints require region_code. Presto integration still requires region_code because it only works with AWS S3.
Learnt from: haiqi96
Repo: y-scope/clp PR: 852
File: components/clp-package-utils/clp_package_utils/scripts/native/compress.py:151-160
Timestamp: 2025-04-25T20:46:20.140Z
Learning: For S3 URLs without region specifications (legacy global endpoints), either assign a default region (us-east-1) or throw a clear error message requiring region specification in the URL. This addresses validation issues in components like S3InputConfig that require a non-nullable region string.
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1501
File: tools/deployment/presto-clp/scripts/init.py:10-13
Timestamp: 2025-10-27T07:07:37.901Z
Learning: In `tools/deployment/presto-clp/scripts/init.py`, the `DATABASE_COMPONENT_NAME` and `DATABASE_DEFAULT_PORT` constants are intentionally duplicated from `clp_py_utils.clp_config` because `clp_py_utils` is not installed in the Presto init script's runtime environment. The two flows are separate and this duplication is documented. There are plans to merge these flows after a future release.
Co-authored-by: Junhao Liao <junhao@junhao.ca>
Co-authored-by: Junhao Liao <junhao@junhao.ca>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@tools/deployment/presto-clp/scripts/init.py`:
- Around line 311-312: The current check returns endpoint_url when it's not
None, which still allows an empty string; update the guard to reject
empty/whitespace values—e.g., replace "if endpoint_url is not None: return
endpoint_url" with a truthiness check like "if endpoint_url and
endpoint_url.strip(): return endpoint_url" (or validate URL format) so the
function/logic using the endpoint_url variable does not return an invalid empty
endpoint.
Co-authored-by: Junhao Liao <junhao@junhao.ca>
junhaoliao
left a comment
There was a problem hiding this comment.
Let's wait for @gibber9809 's response before merging: #1917 (comment)
|
Let's merge this PR until this PR is merged. It is pending on Velox CI reviewing for publishing the right image to CLP package. |
y-scope#1917) Co-authored-by: Junhao Liao <junhao@junhao.ca> Co-authored-by: Junhao Liao <junhao.liao@yscope.com>
Description
Summary
Previously, the config script always constructed the S3 endpoint as
https://{bucket}.s3.{region}.amazonaws.com/, which only works for AWS S3. This change allows configuring S3-compatible storage (e.g., MinIO) by reading theendpoint_urlandbucketfield fromclp-config.yaml.Checklist
breaking change.
Validation performed
SELECT CLP_GET_JSON_STRING() from clp.default.default limit 100CLP package's clp-config.yaml:
Velox's clp.properties:
Summary by CodeRabbit
New Features
Bug Fixes
Chores
✏️ Tip: You can customize this high-level summary in your review settings.