Skip to content

S3 queries return zero rows in Docker Compose Presto deployment due to doubled bucket name in endpoint URL #2108

@junhaoliao

Description

@junhaoliao

Bug

The _resolve_s3_endpoint_url() function in tools/deployment/presto-clp/scripts/init.py
constructs S3 endpoint URLs using virtual-hosted style (e.g.,
https://<bucket>.s3.<region>.amazonaws.com), but when clp.s3-bucket is also set as a separate
property, the Prestissimo worker's CLP connector doubles the bucket name in the final URL.

In y-scope/velox's ClpPackageS3AuthProvider::constructS3Url(), the
connector constructs URLs as follows:

if (bucket_.empty()) {
  return fmt::format("{}/{}", endPoint_, splitPath);
}
return fmt::format("{}/{}/{}", endPoint_, bucket_, splitPath);

When clp.s3-bucket is set (non-empty), the connector takes the second branch and prepends the
bucket to the path. With a virtual-hosted endpoint, this produces a malformed URL like:
https://<bucket>.s3.<region>.amazonaws.com/<bucket>/archives/default/<archive-id> — the bucket
name appears twice, causing silent 404s from S3. Queries complete with no error but return zero
rows
.

This bug was introduced by two companion PRs that added clp.s3-bucket support for
S3-compatible storage (e.g., MinIO):

  • y-scope/velox#48 (6fabc5e7) added the
    clp.s3-bucket config property to the Prestissimo CLP connector and changed
    constructS3Url() from {endpoint}/{splitPath} to {endpoint}/{bucket}/{splitPath} when the
    bucket is set.
  • y-scope/clp#1917 (f9acdbaa) was the companion
    change on the CLP side — it added clp.s3-bucket to the generated worker properties and
    introduced the _resolve_s3_endpoint_url() helper. However, the helper's fallback (used when
    no custom endpoint_url is configured) kept the original virtual-hosted format
    (https://{bucket}.s3.{region}.amazonaws.com).

Before these PRs, there was no clp.s3-bucket property. The original code
(de378af0) only set clp.s3-end-point to a
virtual-hosted URL (https://{bucket}.s3.{region}.amazonaws.com/), and the connector's
constructS3Url() was simply {endpoint}/{splitPath} — the bucket was part of the hostname, so
the URL resolved correctly.

After the two PRs, clp.s3-bucket is set, so the connector takes the new
{endpoint}/{bucket}/{splitPath} branch. But because the endpoint still contains the bucket in
the hostname, the final URL has the bucket name twice. Both PRs were validated only against MinIO
(which uses a custom endpoint_url, bypassing the buggy fallback path in
_resolve_s3_endpoint_url()), so the regression went undetected for AWS S3.

Expected: When clp.s3-bucket is set, _resolve_s3_endpoint_url() should produce path-style
URLs (https://s3.<region>.amazonaws.com) so the connector can correctly construct
{endpoint}/{bucket}/{path}.

CLP version

v0.10.0 (and the current main branch as of commit f9acdbaa)

Environment

Any environment using S3-backed storage for CLP archives with the Docker Compose Presto deployment
(tools/deployment/presto-clp/). Reproducible on Linux (Ubuntu 22.04, Docker v28).

Reproduction steps

  1. Configure CLP with S3-backed archive storage in etc/clp-config.yaml (using
    archive_output.storage.type: "s3" with an AWS region code and no custom endpoint_url).
  2. Start CLP and compress logs so archives are stored on S3.
  3. Run scripts/set-up-config.sh <clp-package-dir> to generate the Presto config.
  4. Inspect the generated .env file:
    grep S3_END_POINT tools/deployment/presto-clp/.env
    Actual: PRESTO_WORKER_CLPPROPERTIES_S3_END_POINT=https://<bucket>.s3.<region>.amazonaws.com
    Expected: PRESTO_WORKER_CLPPROPERTIES_S3_END_POINT=https://s3.<region>.amazonaws.com
  5. Start Presto with docker compose up and run a query — the query returns zero rows.
  6. Manually correct the endpoint in .env to path-style and restart Presto — the query now
    returns the expected rows.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions