-
Notifications
You must be signed in to change notification settings - Fork 3k
Python: Integration tests #6398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ad9dae5 to
05b8aed
Compare
e8fc9e1 to
79b8e36
Compare
| python-version: '3.9' | ||
| cache: poetry | ||
| cache-dependency-path: | | ||
| ./python/poetry.lock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be more than just the lock file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you change the dependencies, you need to regenerate the lock file. So that should be enough
python/dev/spark-defaults.conf
Outdated
|
|
||
| spark.sql.extensions org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions | ||
| spark.sql.catalog.demo org.apache.iceberg.spark.SparkCatalog | ||
| spark.sql.catalog.demo.catalog-impl org.apache.iceberg.rest.RESTCatalog |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type=rest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Less is more, thanks!
| - minio:minio | ||
| rest: | ||
| image: tabulario/iceberg-rest:0.2.0 | ||
| container_name: pyiceberg-rest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does this store the underlying catalog metadata?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An in-memory SQLite
| def visit_is_nan(self, term: BoundTerm[Any]) -> pc.Expression: | ||
| ref = pc.field(term.ref().field.name) | ||
| return ref.is_null(nan_is_null=True) & ref.is_valid() | ||
| return pc.is_nan(ref) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably shouldn't be in this PR right? Seems like an update with a new version of pyarrow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually to make the CI pass. I've created a PR to allow ref.is_nan() as well, but this is not released yet.
| table: Table, | ||
| row_filter: Union[str, BooleanExpression] = ALWAYS_TRUE, | ||
| selected_fields: Tuple[str] = ("*",), | ||
| selected_fields: Tuple[str, ...] = ("*",), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also seems like a separate PR change, but good cleanup.
python/tests/test_integration.py
Outdated
| arrow_table = table_test_null_nan.scan(row_filter=IsNaN("col_numeric"), selected_fields=("idx", "col_numeric")).to_arrow() | ||
| assert len(arrow_table) == 1 | ||
| assert arrow_table[0][0].as_py() == 1 | ||
| assert math.isnan(arrow_table[1][0].as_py()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be easier to read these tests if you called as_py() to produce rows and validated the rows. It looks like there's just one row, but the row/column indexes are backward because this is columnar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me rewrite those tests a bit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed it into assert math.isnan(arrow_table["col_numeric"][0].as_py())
python/tests/test_integration.py
Outdated
| def test_duckdb_nan(table_test_null_nan_rewritten: Table) -> None: | ||
| con = table_test_null_nan_rewritten.scan().to_duckdb("table_test_null_nan") | ||
| result = con.query("SELECT idx FROM table_test_null_nan WHERE isnan(col_numeric)").fetchone() | ||
| assert result == (1,) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't return NaN?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now it does :)
rdblue
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall the changes look like a good start.
|
Thanks for the review @rdblue we can more tests later on |
* Integration tests * First version * Add caching * Add caching * Restore pyproject * WIP * NaN seems to be broken * WIP * Coming along * Cleanup * Install duckdb * Cleanup * Revert changes to poetry * Make it even nicer * Revert unneeded change * Update Spark version * Make test passing * comments
This is the first version of a framework to read Iceberg tables, produced by Spark, using PyIceberg. This makes it easier to run end-to-end tests and also validate the behavior of PyArrow and DuckDB.