Skip to content

[Python] [AWS] Fail to open partitioned parquet with s3fs + pyarrow due to s3 prefix #38794

@yf-yang

Description

@yf-yang

Describe the bug, including details regarding any error messages, version, and platform.

pyarrow == 14.0.1

How the parquet is created:

import polars as pl
import pyarrow.dataset as ds
import s3fs

s3fs = s3fs.S3FileSystem()
df = pl.DataFrame()
ds.write_dataset(
    df.to_arrow(),
    "s3://bucket/parquet_root",
    format='parquet', 
    filesystem=s3fs,
    partitioning=ds.partitioning(pa.schema([("set", pa.string()), ("subset", pa.string())])),
    existing_data_behavior='delete_matching'
)

After the action, in the bucket, path parquet_root/abc/def/part-0.parquet exists.

Try to access the parquet

NOTE: same API call endures a behavior change after I call s3fs.isdir in between, that is also weird

import polars as pl
import pyarrow as pa
import pyarrow.dataset as ds
import pyarrow.parquet as pq
import s3fs

s3fs = s3fs.S3FileSystem()
pq.read_table('bucket/parquet_root/abc/def/part-0.parquet',filesystem=s3fs) # ok
pq.read_table('s3://bucket/parquet_root/abc/def/part-0.parquet',filesystem=s3fs) # ok

pq.ParquetDataset('s3://bucket/parquet_root/', filesystem=s3fs, 
  partitioning=ds.DirectoryPartitioning(pa.schema([("set", pa.string()), ("subset", pa.string())])))
# pyarrow.lib.ArrowInvalid: Could not open Parquet input source 's3://bucket/parquet_root/': Parquet file size is 0 bytes

# after I manually call s3fs.isdir, things changes, I suspect this is another bug
s3fs.isdir('s3://bucket/parquet_root/') # True

# repeat the call
pq.ParquetDataset('s3://bucket/parquet_root/', filesystem=s3fs, 
  partitioning=ds.DirectoryPartitioning(pa.schema([("set", pa.string()), ("subset", pa.string())])))
# pyarrow.lib.ArrowInvalid: GetFileInfo() yielded path 'bucket/parquet_root/abc/def/part-0.parquet', which is outside base dir 's3://bucket/parquet_root/'

# another try, the same error
dataset = ds.dataset(
  's3://bucket/parquet_root/',
  format='parquet', 
  filesystem=s3fs, 
  partitioning=ds.DirectoryPartitioning(pa.schema([("set", pa.string()), ("subset", pa.string())]))
)
# pyarrow.lib.ArrowInvalid: GetFileInfo() yielded path 'bucket/parquet_root/abc/def/part-0.parquet', which is outside base dir 's3://bucket/parquet_root/'

Component(s)

Python

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions