Skip to content

fix(plugin-delta): Fix delta file path resolution by preserving full S3 URI#26826

Closed
ShahimSharafudeen wants to merge 1 commit into
prestodb:masterfrom
ShahimSharafudeen:delta_s3_file_path_fix
Closed

fix(plugin-delta): Fix delta file path resolution by preserving full S3 URI#26826
ShahimSharafudeen wants to merge 1 commit into
prestodb:masterfrom
ShahimSharafudeen:delta_s3_file_path_fix

Conversation

@ShahimSharafudeen

@ShahimSharafudeen ShahimSharafudeen commented Dec 18, 2025

Copy link
Copy Markdown
Contributor

Description

Problem
Presto Delta queries were failing with File does not exist exceptions even though the Parquet files were present in S3. The root cause was incorrect URI handling:
URI.create(...).getPath() stripped the scheme (s3a) and bucket name, resulting in invalid file paths.

2025-12-15T23:17:00.845Z	INFO	Query-20251215_231653_00890_tacwd-102561	io.delta.kernel.internal.snapshot.SnapshotManager	s3a://xpeng-ibm/admin_system/admin_system_city: Took 0ms to construct the snapshot (loading protocol and metadata) for 0 .
2025-12-15T23:17:00.948Z	ERROR	SplitRunner-1-134	com.facebook.presto.execution.executor.TaskExecutorError processing Split 20251215_231653_00890_tacwd.1.0.0.0-0 com.facebook.presto.delta.DeltaSplit@29bec417 (start = 1.701514336299152E9, wall = 1 ms, cpu = 0 ms, wait = 0 ms, calls = 1): DELTA_CANNOT_OPEN_SPLIT: File /admin_system/admin_system_city/part-00000-49dc75f4-b65f-4959-8b10-a97e594d5a42-c000.snappy.parquet does not exist
2025-12-15T23:17:00.950Z	ERROR	remote-task-callback-1410	com.facebook.presto.execution.StageExecutionStateMachine	Stage execution 20251215_231653_00890_tacwd.1.0 failed
com.facebook.presto.spi.PrestoException: File /admin_system/admin_system_city/part-00000-49dc75f4-b65f-4959-8b10-a97e594d5a42-c000.snappy.parquet does not exist
	at com.facebook.presto.delta.DeltaPageSourceProvider.createParquetPageSource(DeltaPageSourceProvider.java:352)
	at com.facebook.presto.delta.DeltaPageSourceProvider.createPageSource(DeltaPageSourceProvider.java:164)
	at com.facebook.presto.spi.connector.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:65)
	at com.facebook.presto.split.PageSourceManager.createPageSource(PageSourceManager.java:81)
	at com.facebook.presto.operator.TableScanOperator.getOutput(TableScanOperator.java:263)
	at com.facebook.presto.operator.Driver.processInternal(Driver.java:440)
	at com.facebook.presto.operator.Driver.lambda$processFor$10(Driver.java:323)
	at com.facebook.presto.operator.Driver.tryWithLock(Driver.java:749)
	at com.facebook.presto.operator.Driver.processFor(Driver.java:316)
	at com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1078)
	at com.facebook.presto.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:165)
	at com.facebook.presto.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:619)
	at com.facebook.presto.$gen.Presto_0_296_SNAPSHOT_19bfd80__0_296_SNAPSHOT____20251202_065933_1.run(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: java.io.FileNotFoundException: File /admin_system/admin_system_city/part-00000-49dc75f4-b65f-4959-8b10-a97e594d5a42-c000.snappy.parquet does not exist
	at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:917)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:1238)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:907)
	at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:462)
	at org.apache.hadoop.fs.HadoopExtendedFileSystem.getFileStatus(HadoopExtendedFileSystem.java:388)
	at com.facebook.presto.delta.DeltaPageSourceProvider.createParquetPageSource(DeltaPageSourceProvider.java:226)
	... 15 more

Root Cause
Object-store URIs (S3) must retain their scheme and authority. Using URI.getPath() converts a fully qualified URI into a relative filesystem path, which is not valid for Presto’s S3 filesystem.

Root Cause OSS PR : #26397

Fix
The code now preserves the full URI by using URI.toString() (or by passing the original path directly), ensuring that the correct s3a://bucket/... path is passed to the filesystem layer and enabling support for reading tables with spaces in S3 locations or partition values.

Motivation and Context

Impact

Test Plan

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.
  • If adding new dependencies, verified they have an OpenSSF Scorecard score of 5.0 or higher (or obtained explicit TSC approval for lower scores).

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== NO RELEASE NOTE ==

@prestodb-ci prestodb-ci added the from:IBM PR from IBM label Dec 18, 2025
@sourcery-ai

sourcery-ai Bot commented Dec 18, 2025

Copy link
Copy Markdown
Contributor
Reviewer's guide (collapsed on small PRs)

Reviewer's Guide

This PR fixes incorrect handling of Delta Lake file paths for S3-backed tables by preserving the full S3 URI instead of converting it to a path, ensuring Presto passes valid s3a:// URIs through partition pruning and split generation.

Sequence diagram for Delta split generation with preserved S3 URI

sequenceDiagram
    actor UserQuery
    participant PrestoCoordinator
    participant DeltaSplitManager
    participant DeltaExpressionUtils
    participant HadoopFileSystem

    UserQuery->>PrestoCoordinator: submit SQL query on Delta table
    PrestoCoordinator->>DeltaSplitManager: getNextBatch(partitionHandle, table, splitCount)
    DeltaSplitManager->>DeltaSplitManager: read addFileStatus.getPath()
    DeltaSplitManager->>DeltaSplitManager: filePath = URI.create(path).toString()
    DeltaSplitManager->>HadoopFileSystem: open(filePath s3a_uri)
    HadoopFileSystem-->>DeltaSplitManager: input stream to Parquet file

    PrestoCoordinator->>DeltaExpressionUtils: evaluatePartitionPredicate(row, partitionPredicate)
    DeltaExpressionUtils->>DeltaExpressionUtils: addFileStatus = InternalScanFileUtils.getAddFileStatus(row)
    DeltaExpressionUtils->>DeltaExpressionUtils: filePath = URI.create(addFileStatus.getPath()).toString()
    DeltaExpressionUtils-->>PrestoCoordinator: domain for partition pruning using full s3a_uri

    PrestoCoordinator-->>UserQuery: return query results without FileNotFoundException
Loading

Updated class diagram for Delta path handling in DeltaSplitManager and DeltaExpressionUtils

classDiagram
    class DeltaSplitManager {
        +CompletableFuture~ConnectorSplitBatch~ getNextBatch(ConnectorPartitionHandle partitionHandle, ConnectorTableLayoutHandle layoutHandle, List~ConnectorSplit~ splits, int maxSize)
        -ConnectorId connectorId
        -DeltaTable deltaTable
        -DeltaMetadata deltaMetadata
    }

    class DeltaExpressionUtils {
        <<utility>>
        -static boolean evaluatePartitionPredicate(DeltaColumnHandle partitionColumn, TupleDomain~DeltaColumnHandle~ partitionPredicate, TypeManager typeManager, Object row)
    }

    class InternalScanFileUtils {
        <<utility>>
        +Map~String,String~ getPartitionValues(Object row)
        +AddFileStatus getAddFileStatus(Object row)
    }

    class AddFileStatus {
        +String getPath()
        +long getSize()
    }

    class Domain {
        +static Domain getDomain(DeltaColumnHandle column, String partitionValue, TypeManager typeManager, String filePath)
    }

    class DeltaColumnHandle {
        +String getName()
    }

    class TypeManager
    class ConnectorSplitBatch
    class ConnectorSplit
    class ConnectorPartitionHandle
    class ConnectorTableLayoutHandle
    class ConnectorId
    class DeltaTable {
        +String getSchemaName()
        +String getTableName()
    }
    class DeltaMetadata
    class String

    DeltaSplitManager --> DeltaTable : uses
    DeltaSplitManager --> DeltaMetadata : uses
    DeltaSplitManager --> AddFileStatus : uses getPath and getSize
    DeltaSplitManager --> ConnectorSplitBatch : returns
    DeltaSplitManager --> ConnectorSplit : creates
    DeltaSplitManager --> ConnectorId : uses

    DeltaExpressionUtils --> InternalScanFileUtils : uses
    DeltaExpressionUtils --> AddFileStatus : uses getPath
    DeltaExpressionUtils --> DeltaColumnHandle : uses
    DeltaExpressionUtils --> Domain : computes
    DeltaExpressionUtils --> TypeManager : uses

    InternalScanFileUtils --> AddFileStatus : returns

    AddFileStatus --> String : path preserved as full s3a_uri
    Domain --> String : filePath parameter is full s3a_uri
    DeltaSplitManager --> String : filePath parameter is full s3a_uri for splits
    DeltaExpressionUtils --> String : filePath parameter is full s3a_uri for domains
Loading

Flow diagram for Delta file path handling before and after fix

flowchart LR
    A["Delta addFileStatus.getPath() returns s3a://bucket/path/file.parquet"] --> B{Old vs new handling}

    B -->|Old behavior| C["URI.create(path).getPath()"]
    C --> D["Produces /path/file.parquet (scheme and bucket stripped)"]
    D --> E["Hadoop LocalFileSystem attempts to open /path/file.parquet"]
    E --> F["FileNotFoundException: File does not exist"]

    B -->|New behavior| G["URI.create(path).toString()"]
    G --> H["Preserves s3a://bucket/path/file.parquet"]
    H --> I["S3 filesystem opens s3a://bucket/path/file.parquet"]
    I --> J["Presto reads Parquet file successfully"]
Loading

File-Level Changes

Change Details Files
Preserve full S3 URIs when deriving file paths for partition predicate evaluation to avoid stripping scheme and bucket.
  • Replace use of URI.getPath() with URI.toString() when converting Delta addFileStatus paths into filePath strings used in logging and domain calculation for partition pruning.
presto-delta/src/main/java/com/facebook/presto/delta/DeltaExpressionUtils.java
Preserve full S3 URIs when constructing Delta splits so the filesystem sees a proper s3a://bucket/... path.
  • Replace use of URI.getPath() with URI.toString() when constructing the file path passed into DeltaSplit for each addFileStatus, ensuring splits reference fully qualified object-store URIs.
presto-delta/src/main/java/com/facebook/presto/delta/DeltaSplitManager.java

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@agrawalreetika

Copy link
Copy Markdown
Member

I don't think, this will be needed any more. As changes of #26397 are getting reverted in this upgrade PR - #26814

@ShahimSharafudeen

Copy link
Copy Markdown
Contributor Author

I don't think, this will be needed any more. As changes of #26397 are getting reverted in this upgrade PR - #26814

Thanks @agrawalreetika for the information. So closing this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

from:IBM PR from IBM

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants