Skip to content

[Python] Add support for Pandas 2.0.0#7005

Merged
Mytherin merged 76 commits intoduckdb:masterfrom
Tishj:python_pandas_pyarrow
Apr 24, 2023
Merged

[Python] Add support for Pandas 2.0.0#7005
Mytherin merged 76 commits intoduckdb:masterfrom
Tishj:python_pandas_pyarrow

Conversation

@Tishj
Copy link
Contributor

@Tishj Tishj commented Apr 8, 2023

This PR fixes #6954, #6695

Reorganizing python package source code

I've reorganized the files and folder structure within the tools/pythonpkg/src folder a bit.

Pybind11 Wrapper

I've decided to no longer use namespace py = pybind11 and instead create namespace py that uses namespace pybind11
It should have no effect on existing functionality, but allows us to add to the functionality and override/shadow some of the behavior that we've had to patch around before.
One such example is py::isinstance, our import cache keeps the py::object ptr as nullptr when it could not be imported and it's optional.
When these objects are used with pybind11's isinstance method, this will segfault because it does no nullptr check.
In the past we've patched around this by adding a IsInstance method to our PythonImportCacheItem, but I've adapted py::isinstance to just return false if the type object is null, removing the need for this IsInstance method.

Split Numpy logic from Pandas logic

Split some of the Numpy scanning code from Pandas, and introduced a PandasColumn class + PandasColumnBackend enum to take some steps into supporting multiple backends in the future.

PyArrow Pandas DataFrames

PyArrow backed dataframes are transformed into pyarrow.lib.Table instead and dealt with earlier on.
For now mixed dataframes are converted into Tables too.

Adapting python tests

I've reworked our python tests to use dependency injection for pandas instead of the base module.
I've created two proxies of the pandas module:

  • NumpyPandas
    Regular behavior of pandas
  • ArrowPandas
    pandas.DataFrame constructor produces a dataframe that has been converted to using pyarrow backed columns.
    pandas.testing.assert_frame_equal has been intercepted to first convert any pyarrow backed column to numpy backed instead. This is done because we can't produce pyarrow backed dataframes from DuckDB yet, this support should be easier to add when we can produce pyarrow backed dataframes.

Verified that our entire test suite works on pandas versions: 2.0.0, 1.5.3, 1.3.3

Tishj added 30 commits March 31, 2023 17:03
…' it's implicitly converted from to py::array
Copy link
Collaborator

@pdet pdet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just have some small comments. In general, I think the logic there seems pretty solid. Excellent work!

I do have some suggestions for further testing. These may be already covered, so ignore them if that's the case.

Could we test Pandas with the Arrow representation on queries with filter and projection pushdown?
Is it possible to create an empty data frame, where only the columns are defined, and try to scan that?
Could we also add tests with Null values? Including completely null columns?
Do we have tests with mixed type columns (i.e., some columns are numpy, others py object (i.e., ints, strings, maps, going nuts on one column) , others arrow)?

Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! @pdet can you have another look?

@Mytherin
Copy link
Collaborator

@Tishj could you merge this with master?

Copy link
Collaborator

@pdet pdet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thijs, thanks for adjusting the tests, I think reusing the existing tests is an excellent idea. I've just added two small comments!

@Mytherin
Copy link
Collaborator

@pdet can you do another pass over this?

@pdet
Copy link
Collaborator

pdet commented Apr 24, 2023

LGTM! Thanks for all the hard work on this @Tishj! It looks great!

@Mytherin Mytherin merged commit c5737e4 into duckdb:master Apr 24, 2023
@Tishj Tishj deleted the python_pandas_pyarrow branch November 7, 2025 16:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Python] Pandas 2.0.0 datetime with timezone consumption issue

3 participants