Bug
The _resolve_s3_endpoint_url() function in tools/deployment/presto-clp/scripts/init.py
constructs S3 endpoint URLs using virtual-hosted style (e.g.,
https://<bucket>.s3.<region>.amazonaws.com), but when clp.s3-bucket is also set as a separate
property, the Prestissimo worker's CLP connector doubles the bucket name in the final URL.
In y-scope/velox's ClpPackageS3AuthProvider::constructS3Url(), the
connector constructs URLs as follows:
if (bucket_.empty()) {
return fmt::format("{}/{}", endPoint_, splitPath);
}
return fmt::format("{}/{}/{}", endPoint_, bucket_, splitPath);
When clp.s3-bucket is set (non-empty), the connector takes the second branch and prepends the
bucket to the path. With a virtual-hosted endpoint, this produces a malformed URL like:
https://<bucket>.s3.<region>.amazonaws.com/<bucket>/archives/default/<archive-id> — the bucket
name appears twice, causing silent 404s from S3. Queries complete with no error but return zero
rows.
This bug was introduced by two companion PRs that added clp.s3-bucket support for
S3-compatible storage (e.g., MinIO):
- y-scope/velox#48 (
6fabc5e7) added the
clp.s3-bucket config property to the Prestissimo CLP connector and changed
constructS3Url() from {endpoint}/{splitPath} to {endpoint}/{bucket}/{splitPath} when the
bucket is set.
- y-scope/clp#1917 (
f9acdbaa) was the companion
change on the CLP side — it added clp.s3-bucket to the generated worker properties and
introduced the _resolve_s3_endpoint_url() helper. However, the helper's fallback (used when
no custom endpoint_url is configured) kept the original virtual-hosted format
(https://{bucket}.s3.{region}.amazonaws.com).
Before these PRs, there was no clp.s3-bucket property. The original code
(de378af0) only set clp.s3-end-point to a
virtual-hosted URL (https://{bucket}.s3.{region}.amazonaws.com/), and the connector's
constructS3Url() was simply {endpoint}/{splitPath} — the bucket was part of the hostname, so
the URL resolved correctly.
After the two PRs, clp.s3-bucket is set, so the connector takes the new
{endpoint}/{bucket}/{splitPath} branch. But because the endpoint still contains the bucket in
the hostname, the final URL has the bucket name twice. Both PRs were validated only against MinIO
(which uses a custom endpoint_url, bypassing the buggy fallback path in
_resolve_s3_endpoint_url()), so the regression went undetected for AWS S3.
Expected: When clp.s3-bucket is set, _resolve_s3_endpoint_url() should produce path-style
URLs (https://s3.<region>.amazonaws.com) so the connector can correctly construct
{endpoint}/{bucket}/{path}.
CLP version
v0.10.0 (and the current main branch as of commit f9acdbaa)
Environment
Any environment using S3-backed storage for CLP archives with the Docker Compose Presto deployment
(tools/deployment/presto-clp/). Reproducible on Linux (Ubuntu 22.04, Docker v28).
Reproduction steps
- Configure CLP with S3-backed archive storage in
etc/clp-config.yaml (using
archive_output.storage.type: "s3" with an AWS region code and no custom endpoint_url).
- Start CLP and compress logs so archives are stored on S3.
- Run
scripts/set-up-config.sh <clp-package-dir> to generate the Presto config.
- Inspect the generated
.env file:
grep S3_END_POINT tools/deployment/presto-clp/.env
Actual: PRESTO_WORKER_CLPPROPERTIES_S3_END_POINT=https://<bucket>.s3.<region>.amazonaws.com
Expected: PRESTO_WORKER_CLPPROPERTIES_S3_END_POINT=https://s3.<region>.amazonaws.com
- Start Presto with
docker compose up and run a query — the query returns zero rows.
- Manually correct the endpoint in
.env to path-style and restart Presto — the query now
returns the expected rows.
Bug
The
_resolve_s3_endpoint_url()function intools/deployment/presto-clp/scripts/init.pyconstructs S3 endpoint URLs using virtual-hosted style (e.g.,
https://<bucket>.s3.<region>.amazonaws.com), but whenclp.s3-bucketis also set as a separateproperty, the Prestissimo worker's CLP connector doubles the bucket name in the final URL.
In
y-scope/velox'sClpPackageS3AuthProvider::constructS3Url(), theconnector constructs URLs as follows:
When
clp.s3-bucketis set (non-empty), the connector takes the second branch and prepends thebucket to the path. With a virtual-hosted endpoint, this produces a malformed URL like:
https://<bucket>.s3.<region>.amazonaws.com/<bucket>/archives/default/<archive-id>— the bucketname appears twice, causing silent 404s from S3. Queries complete with no error but return zero
rows.
This bug was introduced by two companion PRs that added
clp.s3-bucketsupport forS3-compatible storage (e.g., MinIO):
6fabc5e7) added theclp.s3-bucketconfig property to the Prestissimo CLP connector and changedconstructS3Url()from{endpoint}/{splitPath}to{endpoint}/{bucket}/{splitPath}when thebucket is set.
f9acdbaa) was the companionchange on the CLP side — it added
clp.s3-bucketto the generated worker properties andintroduced the
_resolve_s3_endpoint_url()helper. However, the helper's fallback (used whenno custom
endpoint_urlis configured) kept the original virtual-hosted format(
https://{bucket}.s3.{region}.amazonaws.com).Before these PRs, there was no
clp.s3-bucketproperty. The original code(
de378af0) only setclp.s3-end-pointto avirtual-hosted URL (
https://{bucket}.s3.{region}.amazonaws.com/), and the connector'sconstructS3Url()was simply{endpoint}/{splitPath}— the bucket was part of the hostname, sothe URL resolved correctly.
After the two PRs,
clp.s3-bucketis set, so the connector takes the new{endpoint}/{bucket}/{splitPath}branch. But because the endpoint still contains the bucket inthe hostname, the final URL has the bucket name twice. Both PRs were validated only against MinIO
(which uses a custom
endpoint_url, bypassing the buggy fallback path in_resolve_s3_endpoint_url()), so the regression went undetected for AWS S3.Expected: When
clp.s3-bucketis set,_resolve_s3_endpoint_url()should produce path-styleURLs (
https://s3.<region>.amazonaws.com) so the connector can correctly construct{endpoint}/{bucket}/{path}.CLP version
v0.10.0 (and the current
mainbranch as of commitf9acdbaa)Environment
Any environment using S3-backed storage for CLP archives with the Docker Compose Presto deployment
(
tools/deployment/presto-clp/). Reproducible on Linux (Ubuntu 22.04, Docker v28).Reproduction steps
etc/clp-config.yaml(usingarchive_output.storage.type: "s3"with an AWS region code and no customendpoint_url).scripts/set-up-config.sh <clp-package-dir>to generate the Presto config..envfile:PRESTO_WORKER_CLPPROPERTIES_S3_END_POINT=https://<bucket>.s3.<region>.amazonaws.comExpected:
PRESTO_WORKER_CLPPROPERTIES_S3_END_POINT=https://s3.<region>.amazonaws.comdocker compose upand run a query — the query returns zero rows..envto path-style and restart Presto — the query nowreturns the expected rows.