Python: Integration tests #6398

Fokko · 2022-12-09T22:10:10Z

This is the first version of a framework to read Iceberg tables, produced by Spark, using PyIceberg. This makes it easier to run end-to-end tests and also validate the behavior of PyArrow and DuckDB.

…n-tests

rdblue · 2023-03-14T20:50:41Z

.github/workflows/python-integration.yml

+        python-version: '3.9'
+        cache: poetry
+        cache-dependency-path: |
+          ./python/poetry.lock


Should there be more than just the lock file?

If you change the dependencies, you need to regenerate the lock file. So that should be enough

rdblue · 2023-03-14T20:52:19Z

python/dev/spark-defaults.conf

+
+spark.sql.extensions                   org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
+spark.sql.catalog.demo                 org.apache.iceberg.spark.SparkCatalog
+spark.sql.catalog.demo.catalog-impl    org.apache.iceberg.rest.RESTCatalog


Less is more, thanks!

rdblue · 2023-03-14T20:54:01Z

python/dev/docker-compose-integration.yml

+      - minio:minio
+  rest:
+    image: tabulario/iceberg-rest:0.2.0
+    container_name: pyiceberg-rest


Where does this store the underlying catalog metadata?

An in-memory SQLite

rdblue · 2023-03-14T20:54:35Z

python/pyiceberg/io/pyarrow.py

    def visit_is_nan(self, term: BoundTerm[Any]) -> pc.Expression:
        ref = pc.field(term.ref().field.name)
-        return ref.is_null(nan_is_null=True) & ref.is_valid()
+        return pc.is_nan(ref)


This probably shouldn't be in this PR right? Seems like an update with a new version of pyarrow?

This is actually to make the CI pass. I've created a PR to allow ref.is_nan() as well, but this is not released yet.

rdblue · 2023-03-14T20:54:53Z

python/pyiceberg/table/__init__.py

        table: Table,
        row_filter: Union[str, BooleanExpression] = ALWAYS_TRUE,
-        selected_fields: Tuple[str] = ("*",),
+        selected_fields: Tuple[str, ...] = ("*",),


This also seems like a separate PR change, but good cleanup.

rdblue · 2023-03-14T20:57:18Z

python/tests/test_integration.py

+    arrow_table = table_test_null_nan.scan(row_filter=IsNaN("col_numeric"), selected_fields=("idx", "col_numeric")).to_arrow()
+    assert len(arrow_table) == 1
+    assert arrow_table[0][0].as_py() == 1
+    assert math.isnan(arrow_table[1][0].as_py())


I think it would be easier to read these tests if you called as_py() to produce rows and validated the rows. It looks like there's just one row, but the row/column indexes are backward because this is columnar?

Let me rewrite those tests a bit

I've changed it into assert math.isnan(arrow_table["col_numeric"][0].as_py())

rdblue · 2023-03-14T20:58:10Z

python/tests/test_integration.py

+def test_duckdb_nan(table_test_null_nan_rewritten: Table) -> None:
+    con = table_test_null_nan_rewritten.scan().to_duckdb("table_test_null_nan")
+    result = con.query("SELECT idx FROM table_test_null_nan WHERE isnan(col_numeric)").fetchone()
+    assert result == (1,)


It doesn't return NaN?

Now it does :)

rdblue

Overall the changes look like a good start.

…n-tests

Fokko · 2023-03-15T19:05:33Z

Thanks for the review @rdblue we can more tests later on

* Integration tests * First version * Add caching * Add caching * Restore pyproject * WIP * NaN seems to be broken * WIP * Coming along * Cleanup * Install duckdb * Cleanup * Revert changes to poetry * Make it even nicer * Revert unneeded change * Update Spark version * Make test passing * comments

Fokko added 2 commits December 7, 2022 20:51

Integration tests

d4e1916

Merge branch 'master' of github.com:apache/iceberg into fd-integratio…

9c68ca2

…n-tests

github-actions bot added INFRA python labels Dec 9, 2022

First version

05b8aed

Fokko force-pushed the fd-integration-tests branch from ad9dae5 to 05b8aed Compare December 9, 2022 22:11

Add caching

79b8e36

Fokko force-pushed the fd-integration-tests branch from e8fc9e1 to 79b8e36 Compare December 11, 2022 21:12

Add caching

58af0c3

This was referenced Dec 15, 2022

Python: Add adlfs support (Azure DataLake FileSystem) #6392

Merged

[Python] support iceberg hadoop catalog in python library #3220

Closed

Fokko added 2 commits December 20, 2022 22:05

Merge branch 'master' of github.com:apache/iceberg into fd-integratio…

084ea4d

…n-tests

Restore pyproject

9b7fc33

Fokko mentioned this pull request Dec 21, 2022

Python: Read a date as an int #6478

Merged

Fokko mentioned this pull request Jan 11, 2023

Python write support #6564

Closed

4 tasks

Fokko mentioned this pull request Jan 30, 2023

Python: Add visitor to DNF expr into Dask/PyArrow format #6566

Merged

Fokko added this to the Python 0.4.0 release milestone Jan 30, 2023

Fokko added 13 commits January 31, 2023 17:49

WIP

b81b45f

Merge branch 'master' of github.com:apache/iceberg into fd-integratio…

cb2741b

…n-tests

NaN seems to be broken

3ff7427

WIP

cffa6cd

Coming along

e3e70ae

Merge branch 'master' of github.com:apache/iceberg into fd-integratio…

9f13128

…n-tests

Cleanup

ff08efc

Install duckdb

3b564d0

Cleanup

8cb8b9c

Revert changes to poetry

099d720

Merge branch 'master' of github.com:apache/iceberg into fd-integratio…

2b9836a

…n-tests

Make it even nicer

0f19e2f

Merge branch 'master' of github.com:apache/iceberg into fd-integratio…

c2635cf

…n-tests

Fokko added 3 commits February 23, 2023 18:32

Revert unneeded change

0bc6861

Merge branch 'master' of github.com:apache/iceberg into fd-integratio…

843e5f0

…n-tests

Update Spark version

8972f22

Fokko mentioned this pull request Feb 27, 2023

Python: Fix timezone concat issue #6946

Merged

Fokko added 3 commits February 28, 2023 09:14

Make test passing

3516159

Merge branch 'master' of github.com:apache/iceberg into fd-integratio…

8205f34

…n-tests

Merge branch 'master' of github.com:apache/iceberg into fd-integratio…

6d857ba

…n-tests

This was referenced Mar 8, 2023

Python: Add Google Cloud Storage support #6906

Closed

Python: Add positional deletes #6775

Merged

Fokko requested a review from rdblue March 12, 2023 21:51

Merge branch 'master' of github.com:apache/iceberg into fd-integratio…

89bf8f7

…n-tests

rdblue reviewed Mar 14, 2023

View reviewed changes

rdblue approved these changes Mar 14, 2023

View reviewed changes

Fokko added 2 commits March 15, 2023 16:36

Merge branch 'master' of github.com:apache/iceberg into fd-integratio…

b1ec6a5

…n-tests

comments

bf1d59a

Fokko merged commit 0807857 into apache:master Mar 15, 2023

Fokko deleted the fd-integration-tests branch March 15, 2023 19:04

This was referenced Mar 18, 2023

Python: Current Python CI ignore most of the unit tests #7135

Closed

Python: Add more unit tests to remain 90% test coverage result #7149

Closed

Fokko mentioned this pull request Oct 2, 2023

Python write support apache/iceberg-python#23

Closed

4 tasks

Python: Integration tests #6398

Python: Integration tests #6398

Uh oh!

Conversation

Fokko commented Dec 9, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue left a comment

Choose a reason for hiding this comment

Uh oh!

Fokko commented Mar 15, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants