-
Notifications
You must be signed in to change notification settings - Fork 421
Closed
Milestone
Description
Apache Iceberg version
0.8.0 (latest release)
Please describe the bug 🐞
Using the NYC taxi data set found here, if I follow the standard way of creating catalog, and table, but instead of doing append, I do add_files:
from pyiceberg.catalog.sql import SqlCatalog
import pyarrow.parquet as pq
warehouse_path = "/tmp/warehouse"
data_file_path = "/tmp/test-data"
catalog = SqlCatalog(
"default",
**{
"uri": f"sqlite:///{warehouse_path}/pyiceberg_catalog.db",
"warehouse": f"file://{warehouse_path}",
}
)
df = pq.read_table(f"{data_file_path}/yellow_tripdata_2024-01.parquet")
catalog.create_namespace("default")
table = catalog.create_table(
"default.taxi_dataset",
schema=df.schema,
)
table.add_files([f"{data_file_path}/yellow_tripdata_2024-01.parquet"])I get a KeyError:
Traceback (most recent call last):
File "/home/binayak/Dropbox/dev/tests/iceberg-test/main.py", line 42, in <module>
main()
File "/home/binayak/Dropbox/dev/tests/iceberg-test/main.py", line 29, in main
table.add_files([f"{data_file_path}/yellow_tripdata_2024-01.parquet"])
File "/home/binayak/Dropbox/dev/my-github/iceberg-python/pyiceberg/table/__init__.py", line 1036, in add_files
tx.add_files(
File "/home/binayak/Dropbox/dev/my-github/iceberg-python/pyiceberg/table/__init__.py", line 594, in add_files
for data_file in data_files:
File "/home/binayak/Dropbox/dev/my-github/iceberg-python/pyiceberg/table/__init__.py", line 1537, in _parquet_files_to_data_files
yield from parquet_files_to_data_files(io=io, table_metadata=table_metadata, file_paths=iter(file_paths))
File "/home/binayak/Dropbox/dev/my-github/iceberg-python/pyiceberg/io/pyarrow.py", line 2535, in parquet_files_to_data_files
statistics = data_file_statistics_from_parquet_metadata(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/binayak/Dropbox/dev/my-github/iceberg-python/pyiceberg/io/pyarrow.py", line 2400, in data_file_statistics_from_parquet_metadata
del col_aggs[field_id]
~~~~~~~~^^^^^^^^^^
KeyError: 1
This is because since this parquet file does not have columns level stats sets, in the source code, it goes into the else block here
So col_aggs and null_value_counts is not updated, but invalidate_col is update. So when the del command is run here, the KeyError is thrown.
As discussed on slack, @kevinjqliu proposed to switch del col_aggs[field_id] with col_aggs.pop(field_id, None).
I will be raising a PR soon.
Fokko, kevinjqliu and tusharchou
Metadata
Metadata
Assignees
Labels
No labels