[Python] Add support for Pandas 2.0.0#7005
Conversation
…hod to get the datetimes as int64_t epochs
…e should do so as well
…' it's implicitly converted from to py::array
…e if it's pyarrow backed
…arrow" This reverts commit 75ba77a.
pdet
left a comment
There was a problem hiding this comment.
Just have some small comments. In general, I think the logic there seems pretty solid. Excellent work!
I do have some suggestions for further testing. These may be already covered, so ignore them if that's the case.
Could we test Pandas with the Arrow representation on queries with filter and projection pushdown?
Is it possible to create an empty data frame, where only the columns are defined, and try to scan that?
Could we also add tests with Null values? Including completely null columns?
Do we have tests with mixed type columns (i.e., some columns are numpy, others py object (i.e., ints, strings, maps, going nuts on one column) , others arrow)?
tools/pythonpkg/src/include/duckdb_python/pandas/pandas_bind.hpp
Outdated
Show resolved
Hide resolved
…in an earlier test
… whether it's supported or not
|
@Tishj could you merge this with master? |
pdet
left a comment
There was a problem hiding this comment.
Thijs, thanks for adjusting the tests, I think reusing the existing tests is an excellent idea. I've just added two small comments!
…ould be int32 or int64
|
@pdet can you do another pass over this? |
|
LGTM! Thanks for all the hard work on this @Tishj! It looks great! |
This PR fixes #6954, #6695
Reorganizing python package source code
I've reorganized the files and folder structure within the
tools/pythonpkg/srcfolder a bit.Pybind11 Wrapper
I've decided to no longer use
namespace py = pybind11and instead createnamespace pythat usesnamespace pybind11It should have no effect on existing functionality, but allows us to add to the functionality and override/shadow some of the behavior that we've had to patch around before.
One such example is
py::isinstance, our import cache keeps thepy::objectptras nullptr when it could not be imported and it's optional.When these objects are used with
pybind11'sisinstancemethod, this will segfault because it does no nullptr check.In the past we've patched around this by adding a
IsInstancemethod to ourPythonImportCacheItem, but I've adaptedpy::isinstanceto just return false if the type object is null, removing the need for thisIsInstancemethod.Split Numpy logic from Pandas logic
Split some of the Numpy scanning code from Pandas, and introduced a
PandasColumnclass +PandasColumnBackendenum to take some steps into supporting multiple backends in the future.PyArrow Pandas DataFrames
PyArrow backed dataframes are transformed into
pyarrow.lib.Tableinstead and dealt with earlier on.For now mixed dataframes are converted into Tables too.
Adapting python tests
I've reworked our python tests to use dependency injection for
pandasinstead of the base module.I've created two proxies of the pandas module:
NumpyPandasRegular behavior of
pandasArrowPandaspandas.DataFrameconstructor produces a dataframe that has been converted to using pyarrow backed columns.pandas.testing.assert_frame_equalhas been intercepted to first convert any pyarrow backed column to numpy backed instead. This is done because we can't produce pyarrow backed dataframes from DuckDB yet, this support should be easier to add when we can produce pyarrow backed dataframes.Verified that our entire test suite works on
pandasversions: 2.0.0, 1.5.3, 1.3.3