Skip to content

feat(presto-clp): Add support for custom S3 endpoint URLs and buckets.#1917

Merged
20001020ycx merged 16 commits into
y-scope:mainfrom
20001020ycx:2026-01-27-update-velox-conf
Feb 20, 2026
Merged

feat(presto-clp): Add support for custom S3 endpoint URLs and buckets.#1917
20001020ycx merged 16 commits into
y-scope:mainfrom
20001020ycx:2026-01-27-update-velox-conf

Conversation

@20001020ycx

@20001020ycx 20001020ycx commented Jan 27, 2026

Copy link
Copy Markdown
Contributor

Description

Summary

Previously, the config script always constructed the S3 endpoint as https://{bucket}.s3.{region}.amazonaws.com/, which only works for AWS S3. This change allows configuring S3-compatible storage (e.g., MinIO) by reading the endpoint_url and bucket field from clp-config.yaml.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

  • End to end test
    • Query: SELECT CLP_GET_JSON_STRING() from clp.default.default limit 100
    • Set up: official clp package v0.8.0, OSS version of Presto, velox with changes in this PR, and a locally set minio. The below velox clp.properties is the one generated using the script updated in this PR.
    • The result is returned as expected.

CLP package's clp-config.yaml:

package:
  storage_engine: "clp-s"
  query_engine: "presto"

archive_output:
  storage:
    # type: "fs"
    type: "s3"
    staging_directory: "var/data/staged-archives"
    s3_config:
      region_code: "us-east-1"
      endpoint_url: "http://172.26.105.44:9000"
      bucket: "logs"
      key_prefix: "archives/"
      aws_authentication:
        type: "credentials"
        credentials:
          access_key_id: "minioadmin"
          secret_access_key: "minioadmin"

Velox's clp.properties:

  connector.name=clp                                                                                                                                                                                                             
  clp.storage-type=s3                                                                                                                                                                                                            
  clp.s3-auth-provider=clp_package                                                                                                                                                                                               
  clp.s3-end-point=http://172.26.105.44:9000                                                                                                                                                                                     
  clp.s3-bucket=logs                                                                                                                                                                                                
  clp.s3-access-key-id=minioadmin                                                                                                                                                                                                
  clp.s3-secret-access-key=minioadmin   

Summary by CodeRabbit

  • New Features

    • Support for configuring a custom S3 endpoint for worker properties, with automatic fallback when none is provided.
    • S3 bucket is now exposed as an environment variable for worker configuration.
  • Bug Fixes

    • Improved S3 endpoint resolution to avoid unreliable region-only derivation across regions/providers.
  • Chores

    • Deployment initialization updated to align with the new S3 configuration approach.

✏️ Tip: You can customize this high-level summary in your review settings.

@20001020ycx 20001020ycx requested a review from a team as a code owner January 27, 2026 19:40
@coderabbitai

coderabbitai Bot commented Jan 27, 2026

Copy link
Copy Markdown
Contributor

Walkthrough

Adds an AWS S3 domain constant and a helper to resolve S3 endpoint URLs; makes S3 endpoint and region optional, computes endpoint, and exports S3 bucket and computed endpoint as environment variables and CLP properties in the Presto CLP init script.

Changes

Cohort / File(s) Summary
S3 configuration and env mapping
tools/deployment/presto-clp/scripts/init.py
Added AWS_S3_DOMAIN = "amazonaws.com". Introduced `_resolve_s3_endpoint_url(endpoint_url: str

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately reflects the main change: adding support for custom S3 endpoint URLs and bucket configuration in the Presto-CLP deployment script.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@20001020ycx 20001020ycx force-pushed the 2026-01-27-update-velox-conf branch from 335ee20 to 2be8401 Compare January 27, 2026 19:42
@20001020ycx 20001020ycx changed the title feat(presto-clp): Support custom S3 endpoint URL and bucket in Velox configuration script. build(presto-clp): Support custom S3 endpoint URL and bucket in Velox configuration script. Jan 27, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@tools/deployment/presto-clp/scripts/init.py`:
- Around line 281-283: The current call that assigns s3_endpoint_url using
_get_required_config_value makes endpoint_url mandatory and breaks AWS configs;
change this to first attempt to read an optional endpoint (use a non-failing
getter or check for presence on clp_config using the s3_config_key), and if
absent, derive the endpoint from the bucket and region_code (ensure region_code
is validated/required when deriving an AWS-style endpoint). Update the logic
around s3_endpoint_url, s3_config_key, clp_config and clp_config_file_path so
endpoint_url is used when present, otherwise construct the AWS endpoint from
bucket + region_code, keeping region_code required only for the derived path.

Comment thread tools/deployment/presto-clp/scripts/init.py Outdated
Comment on lines +284 to +286
s3_endpoint_url = _get_required_config_value(
clp_config, f"{s3_config_key}.endpoint_url", clp_config_file_path
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the endpoint URL optional?

I think this needs to be _get_config_value, with some handling for the case where s3_endpoint_url is NULL? For context s3_endpoint_url is currently always null when the endpoint is actually s3 if I'm remembering correctly.

More specifically, looks like the right thing to do here might be to either:

  1. Use the definition of s3_end_point value from the old version of the code here whenever s3_endpoint_url is null, or
  2. Update s3_utils.py to cover this use-case -- I think we basically want what the generate_s3_url function does except
    a. When the endpoint is null/s3 we want all of the output except the part with the object key
    b. Otherwise we want all of the output except the part with the bucket and object key?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, following the logic in s3_endpoint_url, here is the implemented logic:

  • required config: bucket
  • optional config: region and endpoint_url
  • when endpoint_url is empty, constructs corresponding aws endpoint pending the existence of region.
  • when endpoint_url is not empty, use the user provided url directly

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@tools/deployment/presto-clp/scripts/init.py`:
- Around line 314-315: When region_code is None (and endpoint_url is also None)
the code returns a region-less S3 URL (https://{bucket_name}.s3.{AWS_ENDPOINT}/)
which can be misleading for buckets outside us-east-1; add a warning log right
before this return: use the module logger (or import logging and
getLogger(__name__)) and call logger.warning indicating that no region/endpoint
was provided and the code is falling back to the legacy global S3 endpoint
(include bucket_name in the message), so callers are aware of possible
cross-region routing; leave the return value unchanged.

Comment thread tools/deployment/presto-clp/scripts/init.py Outdated
gibber9809
gibber9809 previously approved these changes Jan 27, 2026

@gibber9809 gibber9809 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR now LGTM. For PR title how about:

feat(presto-clp): Add support for custom S3 endpoint URLs and buckets.

feat because it's adding new functionality to the presto integration, and removing the bit at the end of the original title because I think that the tag presto-clp is sufficient to tell us where the functionality is being added.

@20001020ycx 20001020ycx changed the title build(presto-clp): Support custom S3 endpoint URL and bucket in Velox configuration script. feat(presto-clp): Add support for custom S3 endpoint URLs and buckets. Jan 27, 2026
@junhaoliao junhaoliao self-requested a review January 27, 2026 23:10
@junhaoliao junhaoliao added this to the January 2026 milestone Jan 27, 2026
@junhaoliao junhaoliao requested a review from hoophalab January 27, 2026 23:30

@junhaoliao junhaoliao left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code looks correct to me

cc @hoophalab for awareness

if endpoint_url is not None:
return endpoint_url

if region_code is None:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand this is needed to support non-AWS S3 endpoints. However, omitting a region is non-standard for AWS S3. Historically, some AWS SDKs would fall back to us-east-1 when no region was specified, but this behavior should not be relied upon.

Should we add a log message when neither endpoint_url nor region_code is provided (here)?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g.
Neither 'endpoint_url' nor 'region_code' specified in S3 config. Using region-less S3 endpoint which may not work for all buckets.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As suggested by @gibber9809 , this behavior is followed by generate_s3_url in s3_utils.py so that CLP package always make consistent assumption.

I am okay with adding the log message when neither 'endpoint_url' nor 'region_code' is specified in S3 config as a warnning. But should we keep the same behavior as the changes in this PR?

And thanks for the insight on this S3 support!

@junhaoliao junhaoliao Jan 28, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see what @gibber9809 decides on

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for slow response. Yeah logging a warning in this case sounds reasonable. It probably also makes sense to log the same warning during the package startup whenever 'endpoint_url' and 'region_code' are missing from the package config for the same reason.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be fine tackling this in a separate PR though that adds the warning both here and in the package startup script.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coderabbitai we will merge the current PR first and add the warning messages later. let's create an issue about adding a warning message both here and in the package startup script

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Create a GitHub issue for adding warning messages about missing S3 endpoint_url and region_code

gh issue create \
  --repo y-scope/clp \
  --title "Add warning when S3 config lacks both endpoint_url and region_code" \
  --body "## Context

When neither \`endpoint_url\` nor \`region_code\` is specified in the S3 configuration, the code constructs a region-less S3 endpoint (e.g., \`https://{bucket}.s3.amazonaws.com\`). This behavior is non-standard for AWS S3 and may not work for all buckets, as AWS typically requires a region to be specified.

This behavior was implemented to maintain consistency with [\`generate_s3_url\`](https://github.com/y-scope/clp/blob/01d35a672c2f6fb892912121800b7a85f7448b23/components/clp-py-utils/clp_py_utils/s3_utils.py#L257) in \`s3_utils.py\`.

## Task

Add warning log messages in the following two locations when neither \`endpoint_url\` nor \`region_code\` is provided:

1. **Presto init script**: \`tools/deployment/presto-clp/scripts/init.py\` in the \`_resolve_s3_endpoint_url\` function
2. **Package startup script**: Where S3 configuration is processed during CLP package startup

## Suggested warning message

\`\`\`
Neither 'endpoint_url' nor 'region_code' specified in S3 config. Using region-less S3 endpoint which may not work for all buckets.
\`\`\`

## References

- PR `#1917`: https://github.com/y-scope/clp/pull/1917
- Discussion: https://github.com/y-scope/clp/pull/1917#discussion_r2734254388
- Requested by: `@junhaoliao`" \
  --assignee junhaoliao

Length of output: 42


@junhaoliao I've created issue #1932 to track adding the warning messages for S3 configurations that lack both endpoint_url and region_code. The issue has been assigned to you and includes references back to this PR and discussion.

#1932


🧬 Code Graph Analysis Results

<no_relevant_code_snippets>

</no_relevant_code_snippets>


🧠 Learnings used
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/clp_config.py:580-585
Timestamp: 2025-12-10T23:32:24.363Z
Learning: In PR `#1767`, custom S3 endpoint support was added to CLP. The S3Config.region_code field is now optional (NonEmptyStr | None) because custom S3-compatible endpoints (MinIO, LocalStack, etc.) use path-style URLs and don't require AWS region codes. Only AWS S3 endpoints require region_code. Presto integration still requires region_code because it only works with AWS S3.

Learnt from: haiqi96
Repo: y-scope/clp PR: 852
File: components/clp-package-utils/clp_package_utils/scripts/native/compress.py:151-160
Timestamp: 2025-04-25T20:46:20.140Z
Learning: For S3 URLs without region specifications (legacy global endpoints), either assign a default region (us-east-1) or throw a clear error message requiring region specification in the URL. This addresses validation issues in components like S3InputConfig that require a non-nullable region string.

Learnt from: junhaoliao
Repo: y-scope/clp PR: 1501
File: tools/deployment/presto-clp/scripts/init.py:10-13
Timestamp: 2025-10-27T07:07:37.901Z
Learning: In `tools/deployment/presto-clp/scripts/init.py`, the `DATABASE_COMPONENT_NAME` and `DATABASE_DEFAULT_PORT` constants are intentionally duplicated from `clp_py_utils.clp_config` because `clp_py_utils` is not installed in the Presto init script's runtime environment. The two flows are separate and this duplication is documented. There are plans to merge these flows after a future release.

Comment thread tools/deployment/presto-clp/scripts/init.py Outdated
Comment thread tools/deployment/presto-clp/scripts/init.py Outdated
Comment thread tools/deployment/presto-clp/scripts/init.py Outdated
Comment thread tools/deployment/presto-clp/scripts/init.py Outdated
Comment thread tools/deployment/presto-clp/scripts/init.py Outdated
Co-authored-by: Junhao Liao <junhao@junhao.ca>
Co-authored-by: Junhao Liao <junhao@junhao.ca>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@tools/deployment/presto-clp/scripts/init.py`:
- Around line 311-312: The current check returns endpoint_url when it's not
None, which still allows an empty string; update the guard to reject
empty/whitespace values—e.g., replace "if endpoint_url is not None: return
endpoint_url" with a truthiness check like "if endpoint_url and
endpoint_url.strip(): return endpoint_url" (or validate URL format) so the
function/logic using the endpoint_url variable does not return an invalid empty
endpoint.

Comment thread tools/deployment/presto-clp/scripts/init.py Outdated
Comment thread tools/deployment/presto-clp/scripts/init.py Outdated
junhaoliao
junhaoliao previously approved these changes Jan 28, 2026
Co-authored-by: Junhao Liao <junhao@junhao.ca>

@junhaoliao junhaoliao left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's wait for @gibber9809 's response before merging: #1917 (comment)

@junhaoliao junhaoliao requested a review from gibber9809 January 28, 2026 08:48
@20001020ycx

Copy link
Copy Markdown
Contributor Author

Let's merge this PR until this PR is merged. It is pending on Velox CI reviewing for publishing the right image to CLP package.

@20001020ycx 20001020ycx merged commit f9acdba into y-scope:main Feb 20, 2026
22 checks passed
junhaoliao added a commit to junhaoliao/clp that referenced this pull request May 17, 2026
y-scope#1917)

Co-authored-by: Junhao Liao <junhao@junhao.ca>
Co-authored-by: Junhao Liao <junhao.liao@yscope.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants