Skip to content

FIX Make get_namespace handle pandas dataframe input#32838

Merged
lesteve merged 7 commits intoscikit-learn:mainfrom
betatim:fix-get_namespace-for-df
Dec 5, 2025
Merged

FIX Make get_namespace handle pandas dataframe input#32838
lesteve merged 7 commits intoscikit-learn:mainfrom
betatim:fix-get_namespace-for-df

Conversation

@betatim
Copy link
Copy Markdown
Member

@betatim betatim commented Dec 4, 2025

Reference Issues/PRs

Closes #32836

What does this implement/fix? Explain your changes.

A pandas DataFrame or Series does not have a namespace associated to it in the array API. As such we should filter it out, like we do for sparse arrays, before trying to determine the namespace. This means that for get_namespace(a_df) (with array API enabled) we get the NumPy compat namespace back as xp. I think this makes sense.

AI usage disclosure

I used AI assistance for:

  • Code generation (e.g., when writing an implementation or fixing a bug)
    • Cursor's tab complete helped adapt the test cases from test_get_namespace_ndarray_with_dispatch which I used as starting point

Any other comments?

How do we solve the problem that we can't import _is_pandas_df_or_series from sklearn.utils.validation as it causes a circular import. Duplicating it feels a bit weird, but I can't think of a great alternative. We could use hasattr(array, "iloc") to detect "pandas-ness"?

Before discussing how we best detect "pandas-ness" we should agree that filtering DataFrame and Seriesout (like sparse arrays) is the right approach.

cc @ogrisel

A pandas DataFrame or Series does not have a namespace associated to it
in the array API. As such we should filter it out, like sparse arrays,
before trying to determine the namespace.
@betatim betatim added this to the 1.8 milestone Dec 4, 2025
@lesteve
Copy link
Copy Markdown
Member

lesteve commented Dec 4, 2025

A pandas DataFrame or Series does not have a namespace associated to it in the array API. As such we should filter it out, like we do for sparse arrays, before trying to determine the namespace. This means that for get_namespace(a_df) (with array API enabled) we get the NumPy compat namespace back as xp. I think this makes sense.

Before discussing how we best detect "pandas-ness" we should agree that filtering DataFrame and Seriesout (like sparse arrays) is the right approach.

Treating pandas dataframes the same as sparse arrays seems to make sense to me. ccing the other members of the Array API special team to have a second opinion @OmarManzoor @lucyleeow

Copy link
Copy Markdown
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I think this is the correct fix but it needs to be generalized, see below.

@lucyleeow
Copy link
Copy Markdown
Member

Treating pandas dataframes the same as sparse arrays seems to make sense to me.

+1 from me too

@betatim betatim force-pushed the fix-get_namespace-for-df branch from 9fa59a3 to ae852b8 Compare December 5, 2025 08:51
@betatim
Copy link
Copy Markdown
Member Author

betatim commented Dec 5, 2025

I've moved the is_* functions to their own module, moved the tests to sklear/utils/tests/test_dataframe.py and added a is_df_or_series(). The functions have also changed from _is_pandas to is_pandas as they are now in a private module.

Copy link
Copy Markdown
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as it is, but feel free to to implement https://github.com/scikit-learn/scikit-learn/pull/32838/files#r2591717992 as well.

@lesteve
Copy link
Copy Markdown
Member

lesteve commented Dec 5, 2025

Let's merge this one!

@lesteve
Copy link
Copy Markdown
Member

lesteve commented Dec 5, 2025

Actually @betatim pushed a commit at about the same time I was about to merge, oh well @betatim let me know when you think this is ready!

@betatim
Copy link
Copy Markdown
Member Author

betatim commented Dec 5, 2025

If the tests pass this is ready. Automerge?

@lesteve lesteve enabled auto-merge (squash) December 5, 2025 16:19
@lesteve lesteve merged commit 47621fe into scikit-learn:main Dec 5, 2025
39 checks passed
@betatim betatim deleted the fix-get_namespace-for-df branch December 5, 2025 17:07
lesteve pushed a commit to lesteve/scikit-learn that referenced this pull request Dec 9, 2025
@lesteve lesteve mentioned this pull request Dec 9, 2025
14 tasks
lesteve pushed a commit to lesteve/scikit-learn that referenced this pull request Dec 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enabling array API dispatch causes pipelines to reject dataframe inputs

4 participants