Skip to content

fix(presto-clp): Fix Presto S3 queries returning zero rows by updating worker image and correcting S3 endpoint URL format (fixes #2107, fixes #2108).#2109

Merged
junhaoliao merged 7 commits into
y-scope:mainfrom
junhaoliao:presto-s3
Mar 23, 2026

Conversation

@junhaoliao

@junhaoliao junhaoliao commented Mar 19, 2026

Copy link
Copy Markdown
Member

Description

Presto queries against S3-backed CLP archives silently return zero rows due to two issues:

  1. Missing CA certificates in the worker image (#2107): The Prestissimo worker
    image clp-v0.10.0 was built without ca-certificates and tzdata after an upstream merge
    (y-scope/presto#147) accidentally dropped the
    packages originally added in
    y-scope/presto#55. This causes silent TLS
    verification failures against presigned S3 URLs. The Dockerfile fix has been merged upstream
    (y-scope/presto#154) and a corrected image
    published as clp-v0.10.0-fix.1.

  2. Virtual-hosted style S3 endpoint URL: _resolve_s3_endpoint_url() in init.py constructs
    URLs like https://<bucket>.s3.<region>.amazonaws.com, but the Prestissimo worker expects
    path-style URLs (https://s3.<region>.amazonaws.com). Since the bucket name is passed
    separately via clp.s3-bucket, the worker ends up doubling the bucket name in the request
    path, causing silent 404s. This bug has been present since S3 support was first added in
    de378af0 and carried through the refactor in feat(presto-clp): Add support for custom S3 endpoint URLs and buckets. #1917.

This PR:

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Scenario 1: Verify CA certificates in new vs old worker image

Task: Confirm that the clp-v0.10.0 image is missing CA certificates and that
clp-v0.10.0-fix.1 includes them.

Command:

echo "=== NEW IMAGE (clp-v0.10.0-fix.1) ==="
docker run --rm --entrypoint="" \
  ghcr.io/y-scope/presto/prestissimo-worker:clp-v0.10.0-fix.1 \
  ls /etc/ssl/certs/ca-certificates.crt

echo ""
echo "=== OLD IMAGE (clp-v0.10.0) ==="
docker run --rm --entrypoint="" \
  ghcr.io/y-scope/presto/prestissimo-worker:clp-v0.10.0 \
  ls /etc/ssl/certs/ca-certificates.crt

Output:

=== NEW IMAGE (clp-v0.10.0-fix.1) ===
/etc/ssl/certs/ca-certificates.crt

=== OLD IMAGE (clp-v0.10.0) ===
ls: cannot access '/etc/ssl/certs/ca-certificates.crt': No such file or directory

Explanation: The old image is missing /etc/ssl/certs/ca-certificates.crt, which causes the
Prestissimo worker's bundled libcurl to silently fail TLS verification when accessing S3 presigned
URLs. The new image includes the certificate bundle.

Scenario 2: Verify S3 endpoint URL fix

Task: Confirm that init.py now generates path-style S3 endpoint URLs instead of
virtual-hosted style.

Command:

cd tools/deployment/presto-clp
scripts/set-up-config.sh /path/to/clp-package
grep S3_END_POINT .env

Output:

PRESTO_WORKER_CLPPROPERTIES_S3_END_POINT=https://s3.us-east-2.amazonaws.com

Explanation: The endpoint is now path-style (https://s3.<region>.amazonaws.com) instead of
virtual-hosted style (https://<bucket>.s3.<region>.amazonaws.com). This matches the Helm chart's
configmap.yaml and prevents the Prestissimo worker from doubling the bucket name in S3 requests.

Scenario 3: End-to-end query via Presto CLI

Task: Verify that Presto queries return data when archives are stored on S3.

Command:

cd build/clp-package
./sbin/start-clp.sh
./sbin/compress.sh --timestamp-key timestamp ~/samples/postgresql.jsonl

Output:

2026-03-19T05:31:54.037 INFO [compress] Compression job 3 submitted.
2026-03-19T05:31:58.043 INFO [compress] Compression finished.
2026-03-19T05:31:58.043 INFO [compress] Compressed 385.21MB into 10.06MB (38.31x). Speed: 101.42MB/s.

Then start Presto and query:

Command:

cd tools/deployment/presto-clp
docker compose up -d
# Wait for worker to register with coordinator (~20s)
docker exec presto-clp-presto-coordinator-1 \
  presto-cli --catalog clp --schema default \
  --execute 'SELECT count(*) FROM "default"'

Output:

"2000000"

Explanation: The query returns 2,000,000 rows from the compressed PostgreSQL sample dataset,
confirming that the worker can successfully read archives from S3 with both the CA certificate fix
and the path-style endpoint fix applied.

Scenario 4: End-to-end query via CLP Web UI

Task: Verify that the CLP Web UI can execute Presto SQL queries and display results.

Steps:

  1. Open http://localhost:4000 in a browser.
  2. Click Search.
  3. Leave the default query (SELECT * FROM default) and click Run.

The Web UI shows "Success - search job ... found 2000000 results" with a timestamp histogram
and a data table displaying the PostgreSQL log fields (backend_type, error_severity, ps,
query_id, etc.), confirming the full pipeline works end-to-end.

Scenario 5: Custom S3 endpoint (MinIO) still works

Task: Verify that the endpoint URL fix does not break the custom endpoint_url path used for
S3-compatible storage like MinIO.

Setup:

  1. Start a local MinIO instance:
    docker run -d --name minio-test -p 9000:9000 \
      -e MINIO_ROOT_USER=minioadmin -e MINIO_ROOT_PASSWORD=minioadmin \
      minio/minio server /data
  2. Create a bucket:
    docker run --rm --network host --entrypoint sh minio/mc -c \
      "mc alias set local http://localhost:9000 minioadmin minioadmin && mc mb local/test-logs"
  3. Configure clp-config.yaml with MinIO as the archive storage:
    archive_output:
      storage:
        type: "s3"
        s3_config:
          aws_authentication:
            type: "credentials"
            credentials:
              access_key_id: "minioadmin"
              secret_access_key: "minioadmin"
          endpoint_url: "http://172.17.0.1:9000"
          bucket: "test-logs"
          key_prefix: "archives/"
  4. Start CLP, compress data, regenerate Presto config, and start Presto.

Command:

grep "S3_END_POINT\|S3_BUCKET" tools/deployment/presto-clp/.env

Output:

PRESTO_WORKER_CLPPROPERTIES_S3_BUCKET=test-logs
PRESTO_WORKER_CLPPROPERTIES_S3_END_POINT=http://172.17.0.1:9000

Explanation: When a custom endpoint_url is configured, _resolve_s3_endpoint_url() returns
it directly (the early-return on line 312), so the fix to the fallback path does not affect this
code path.

Command:

docker exec presto-clp-presto-coordinator-1 \
  presto-cli --catalog clp --schema default \
  --execute 'SELECT count(*) FROM "default"'

Output:

"1000000"

Command:

docker exec presto-clp-presto-coordinator-1 \
  presto-cli --catalog clp --schema default \
  --execute 'SELECT * FROM "default" LIMIT 3'

Output:

"client backend","DEBUG","startup","0","7","64211afd.1e8d","0","7821","3/1","CommitTransaction(1) name: unnamed; blockState: STARTED; state: INPROGRESS, xid/subid/cid: 0/1/0","2023-03-27 00:26:37","[local]","psql","template1","","","","","","","postgres","2023-03-27 00:26:37.134"
"client backend","LOG","idle","0","9","64211afd.1e8d","0","7821","3/2","statement: ","2023-03-27 00:26:37","[local]","psql","template1","","","","","","","postgres","2023-03-27 00:26:37.134"
"client backend","LOG","idle","0","11","64211afd.1e8d","0","7821","3/0","duration: 0.066 ms","2023-03-27 00:26:37","[local]","psql","template1","","","","","","","postgres","2023-03-27 00:26:37.134"

Explanation: Queries against MinIO-backed archives return 1,000,000 rows with correct data,
confirming that the custom endpoint path is unaffected by the fix.

Summary by CodeRabbit

  • Bug Fixes

    • Updated Presto worker container image to a patched tag.
    • Changed S3 endpoint URL fallback behavior so generated endpoints no longer include the bucket name and follow region/no-region AWS patterns.
  • Chores

    • Adjusted deployment configuration and initialization scripts to reflect the updated image tag and endpoint resolution logic.

…g worker image and correcting S3 endpoint URL format (fixes y-scope#2107, fixes y-scope#2108).
@coderabbitai

coderabbitai Bot commented Mar 19, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9930759c-3714-471e-99d8-a3b137482385

📥 Commits

Reviewing files that changed from the base of the PR and between d28a6bc and fecb6d3.

📒 Files selected for processing (1)
  • tools/deployment/presto-clp/docker-compose.yaml

Walkthrough

Presto worker image default tag updated in the CLP docker-compose file. S3 endpoint resolution logic in the CLP init script was changed to remove bucket-based hostname derivation and the helper signature was adjusted accordingly.

Changes

Cohort / File(s) Summary
Container Image Update
tools/deployment/presto-clp/docker-compose.yaml
Default presto-worker image tag changed from clp-v0.10.0 to clp-v0.10.0-fix.1; image string formatting changed to a folded multi-line YAML value.
S3 Endpoint Resolution Refactor
tools/deployment/presto-clp/scripts/init.py
_resolve_s3_endpoint_url signature updated to remove s3_bucket parameter; endpoint construction no longer uses bucket name and now returns either the provided custom endpoint (trimmed) or https://s3.<AWS_S3_DOMAIN> / https://s3.<region_code>.<AWS_S3_DOMAIN> depending on region presence.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and accurately summarizes the two main changes: updating the Presto worker image tag and fixing the S3 endpoint URL format to resolve the issue of queries returning zero rows.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@junhaoliao junhaoliao marked this pull request as ready for review March 19, 2026 06:05
@junhaoliao junhaoliao requested a review from a team as a code owner March 19, 2026 06:05

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tools/deployment/presto-clp/docker-compose.yaml`:
- Line 27: The image line exceeds the 100-column lint limit; update the image
declaration for prestissimo-worker by wrapping the long string so it doesn't
exceed 100 columns (modify the line containing image:
"ghcr.io/y-scope/presto/prestissimo-worker:${CLP_PRESTO_WORKER_IMAGE_TAG:-clp-v0.10.0-fix.1}").
Use a YAML-safe multiline approach (e.g., a folded block scalar or split the
value across lines) or shorten the default tag via the
CLP_PRESTO_WORKER_IMAGE_TAG default to keep the final rendered string within 100
columns while preserving the same image and variable substitution behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: af500ec1-6c39-4b88-80d3-c1916752b98b

📥 Commits

Reviewing files that changed from the base of the PR and between a43dbd1 and d28a6bc.

📒 Files selected for processing (2)
  • tools/deployment/presto-clp/docker-compose.yaml
  • tools/deployment/presto-clp/scripts/init.py

Comment thread tools/deployment/presto-clp/docker-compose.yaml Outdated
@junhaoliao junhaoliao added this to the March 2026 milestone Mar 20, 2026
@junhaoliao junhaoliao merged commit 0fbca5b into y-scope:main Mar 23, 2026
21 checks passed
@junhaoliao junhaoliao deleted the presto-s3 branch March 23, 2026 14:44
junhaoliao added a commit to junhaoliao/clp that referenced this pull request May 17, 2026
…g worker image and correcting S3 endpoint URL format (fixes y-scope#2107, fixes y-scope#2108). (y-scope#2109)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants