-
Notifications
You must be signed in to change notification settings - Fork 4.1k
[Python] [AWS] Fail to open partitioned parquet with s3fs + pyarrow due to s3 prefix #38794
Copy link
Copy link
Closed
Description
Describe the bug, including details regarding any error messages, version, and platform.
pyarrow == 14.0.1
How the parquet is created:
import polars as pl
import pyarrow.dataset as ds
import s3fs
s3fs = s3fs.S3FileSystem()
df = pl.DataFrame()
ds.write_dataset(
df.to_arrow(),
"s3://bucket/parquet_root",
format='parquet',
filesystem=s3fs,
partitioning=ds.partitioning(pa.schema([("set", pa.string()), ("subset", pa.string())])),
existing_data_behavior='delete_matching'
)After the action, in the bucket, path parquet_root/abc/def/part-0.parquet exists.
Try to access the parquet
NOTE: same API call endures a behavior change after I call s3fs.isdir in between, that is also weird
import polars as pl
import pyarrow as pa
import pyarrow.dataset as ds
import pyarrow.parquet as pq
import s3fs
s3fs = s3fs.S3FileSystem()
pq.read_table('bucket/parquet_root/abc/def/part-0.parquet',filesystem=s3fs) # ok
pq.read_table('s3://bucket/parquet_root/abc/def/part-0.parquet',filesystem=s3fs) # ok
pq.ParquetDataset('s3://bucket/parquet_root/', filesystem=s3fs,
partitioning=ds.DirectoryPartitioning(pa.schema([("set", pa.string()), ("subset", pa.string())])))
# pyarrow.lib.ArrowInvalid: Could not open Parquet input source 's3://bucket/parquet_root/': Parquet file size is 0 bytes
# after I manually call s3fs.isdir, things changes, I suspect this is another bug
s3fs.isdir('s3://bucket/parquet_root/') # True
# repeat the call
pq.ParquetDataset('s3://bucket/parquet_root/', filesystem=s3fs,
partitioning=ds.DirectoryPartitioning(pa.schema([("set", pa.string()), ("subset", pa.string())])))
# pyarrow.lib.ArrowInvalid: GetFileInfo() yielded path 'bucket/parquet_root/abc/def/part-0.parquet', which is outside base dir 's3://bucket/parquet_root/'
# another try, the same error
dataset = ds.dataset(
's3://bucket/parquet_root/',
format='parquet',
filesystem=s3fs,
partitioning=ds.DirectoryPartitioning(pa.schema([("set", pa.string()), ("subset", pa.string())]))
)
# pyarrow.lib.ArrowInvalid: GetFileInfo() yielded path 'bucket/parquet_root/abc/def/part-0.parquet', which is outside base dir 's3://bucket/parquet_root/'Component(s)
Python
Reactions are currently unavailable