Skip to content

Fix reading Parquet files from S3#562

Merged
auxten merged 4 commits into
chdb-io:mainfrom
mneedham:fix/preserves-row-order-none-format
May 11, 2026
Merged

Fix reading Parquet files from S3#562
auxten merged 4 commits into
chdb-io:mainfrom
mneedham:fix/preserves-row-order-none-format

Conversation

@mneedham

@mneedham mneedham commented Apr 17, 2026

Copy link
Copy Markdown
Contributor

Summary

Two fixes to enable reading Parquet files from S3.

1. read_parquet doesn't work with S3 URLs

In [2]: pd.read_parquet('s3://datasets-documentation/amazon_reviews/amazon_reviews_2015.snappy.parquet')
Out[2]: E [chDB] Query failed: Code: 400. DB::ErrnoException: Cannot stat file /Users/markhneedham/projects/videos/20241029-chdbPandas/s3:/datasets-documentation/amazon_reviews/amazon_reviews_2015.snappy.parquet: , errno: 2, strerror: No such file or directory: The table structure cannot be extracted from a Parquet format file. You can specify the structure manually: (in file/uri /Users/markhneedham/projects/videos/20241029-chdbPandas/s3:/datasets-documentation/amazon_reviews/amazon_reviews_2015.snappy.parquet). (CANNOT_STAT)
...
SQL: DESCRIBE file('s3://datasets-documentation/amazon_reviews/amazon_reviews_2015.snappy.parquet', 'Parquet')

s3:// paths were being routed to from_file(), which treats them as local paths. Fixed by routing s3:// paths in read_parquet() to DataStore.from_s3() instead.

2. DataStore.from_s3() crashes without an explicit format

In [3]: pd.DataStore.from_s3('s3://datasets-documentation/amazon_reviews/amazon_reviews_2015.snappy.parquet', nosign=True)
Out[3]: DataStore(execution failed: 'NoneType' object has no attribute 'lower')

Root cause: from_s3() stores {"format": None} in table function params when no format is specified. dict.get("format", "") returns None (not "") when the key exists with value None — the default only applies when the key is absent entirely. None.lower() then crashes in preserves_row_order().

Fixed by using (self.params.get("format") or "").lower() in both FileTableFunction and S3TableFunction.

Changes

  • datastore/pandas_api.py: Route s3:// paths in read_parquet() to DataStore.from_s3()
  • datastore/table_functions.py: Fix None.lower() crash in FileTableFunction.preserves_row_order() and S3TableFunction.preserves_row_order()

Test plan

  • read_parquet("s3://...") routes to S3 table function instead of crashing
  • DataStore.from_s3("s3://...", nosign=True) no longer raises 'NoneType' object has no attribute 'lower'
  • DataStore.from_file("data.parquet") with no explicit format still works

🤖 Generated with Claude Code

…functions

dict.get("key", default) returns None when the key exists with value None,
not the default. Using `or ""` handles the None case correctly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@CLAassistant

CLAassistant commented Apr 17, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

mneedham and others added 2 commits April 17, 2026 17:07
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…support

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@mneedham mneedham changed the title fix: handle None format in preserves_row_order Fix reading Parquet files from S3 Apr 17, 2026
@wudidapaopao

Copy link
Copy Markdown
Contributor

Thanks for your contribution and the fix!

@mneedham

Copy link
Copy Markdown
Contributor Author

I'm not sure how to sign the CLA?

2026-04-20_11-50-03

@auxten

auxten commented Apr 20, 2026

Copy link
Copy Markdown
Member

I'm not sure how to sign the CLA?

2026-04-20_11-50-03

I just made the CLA of chDB identical with ClickHouse. So, everyone need re sign the CLA.

It might be just very slow loading. How about wait for more time?

@mneedham

Copy link
Copy Markdown
Contributor Author

I tried again now and it came up with the form!

@auxten auxten merged commit 38a55c9 into chdb-io:main May 11, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants