[MRG] EHN add support for scalar, slice and mask in safe_indexing axis=0#14475
[MRG] EHN add support for scalar, slice and mask in safe_indexing axis=0#14475jnothman merged 55 commits intoscikit-learn:masterfrom
Conversation
|
@jnothman @NicolasHug As promised with a bit of delay, I draft something to make |
…into is/consistent_safe_indexing
NicolasHug
left a comment
There was a problem hiding this comment.
Dumb question, but is there a real use-case for this?
Passing a mask array to select sample does not seem inappropriate. |
jnothman
left a comment
There was a problem hiding this comment.
Does this mean that previously unsupported code will now work, such as cv splitters returning masks or feature selectors...? Are we being too lenient in our APIs then?
sklearn/utils/__init__.py
Outdated
| - To select multiple elements (i.e. rows or columns), `indices` can be | ||
| one of the following: `list`, `array`, `slice`. The type used in | ||
| these containers can be one of the following: `int`, `bool`, and | ||
| `str`. `str` is only supported when `X` is a dataframe. |
There was a problem hiding this comment.
what are slices of bool and str?
Would this work?
To select multiple elements (i.e. rows or columns),
indicescan be
a slice, or a container (listorarray). The type used in
these containers can be one of the following:int,bool, and
str.stris only supported whenXis a dataframe.
There was a problem hiding this comment.
slice of bool will give you the first line (it casting stuff into int). We should raise an error for such case I think.
slice of string is useful with dataframe:
df.loc[slice('xxx', 'yyy')]I see that we have an issue because we only use iloc and then it would not work. So the docstring is wrong
There was a problem hiding this comment.
do we really want to support slices of strings? I'd say we don't for now to keep it simple?
There was a problem hiding this comment.
It is already supported for the column with dataframe
There was a problem hiding this comment.
What I mean is:
from sklearn.datasets import fetch_openml
iris = fetch_openml('iris', as_frame=True)
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import make_column_transformer
transformer = make_column_transformer(
(FunctionTransformer(), slice('sepallength', 'sepalwidth'))
)
transformer.fit_transform(iris.data, iris.target)|
I refactor the code which should make it more understandable but less easy to review by looking at the diff (sorry for that). I think that I will need to do the same regarding the tests if we want to keep this indexing manageable. So before to go further, I think that we need to answer a couple of questions:
I think that this is the main blocker here. |
amueller
left a comment
There was a problem hiding this comment.
I'm not sure if I understand what's going on. Are we considering safe_indexing not public and so we can break backward-compatibility?
…into is/consistent_safe_indexing
As mentioned in one of my comment, I reintroduced support for boolean array-like and integer slice meaning since we were supporting it. |
|
Seems that masks were used for indexing in several safe_indexing uses.
Sorry for sending you off the path. And CV is the only place where the mask
is provided effectively by users, so we should be safe to support more
formats if all formats are acceptable there.
|
jnothman
left a comment
There was a problem hiding this comment.
I'm somewhat confused by this _check_key_type(key, superclass) design. It repeats very similar checks for each of three classes, but also has specialised code paths depending on superclass. Why not _determine_key_type(key) -> {'s', 'i', 'b', 'none-slice'} called exactly once per safe_indexing?
sklearn/utils/__init__.py
Outdated
| """Index a pandas dataframe or a series.""" | ||
| # check whether we should index with loc or iloc | ||
| if _check_key_type(key, int): | ||
| by_name = False |
There was a problem hiding this comment.
should we be raising a warning/error if X.index.dtype.kind in 'iu' (and similar on the other axis)? This won't handle the case where there are ints of object dtype in the index.
|
|
||
| if hasattr(key, 'shape'): | ||
| # Work-around for indexing with read-only key in pandas | ||
| # FIXME: solved in pandas 0.25 |
There was a problem hiding this comment.
should we use an explicit version comparison to take this codepath?
There was a problem hiding this comment.
then we need to make an explicit pandas import
sklearn/utils/__init__.py
Outdated
| if np.isscalar(key) or isinstance(key, slice): | ||
| # key is a slice or a scalar | ||
| return X[key] | ||
| key_set = set(key) |
There was a problem hiding this comment.
key will usually be an array, won't it? Should we cast into an array at some point?
There was a problem hiding this comment.
I don't know if we need to do so. We can always postpone until required.
| return None | ||
| if isinstance(key, tuple(dtype_to_str.keys())): | ||
| try: | ||
| return dtype_to_str[type(key)] |
There was a problem hiding this comment.
Not sure this is the right behaviour for a scalar bool
There was a problem hiding this comment.
What do you mean? bool and np.bool_ will return the string 'bool'. What would you expect instead?
Note that in safe indexing, we are not using scalar bool to index.
There was a problem hiding this comment.
Note that in safe indexing, we are not using scalar bool to index.
Okay ;)
| return set_type | ||
| if hasattr(key, 'dtype'): | ||
| try: | ||
| return array_dtype_to_str[key.dtype.kind] |
There was a problem hiding this comment.
Is behaviour with dtype='O' where the array contains only ints or only bools correct? (How does numpy handle this?)
There was a problem hiding this comment.
NumPy will return 'O' and nothing about 'i' or 'b' so it will fail. I am not really sure how to solve this issue without adding so much complexity.
One would need to try converting the array to bool and int first. However, I thought that we are usually dtype=object to handle string.
There was a problem hiding this comment.
This maintains the behavior on master where we use ('O', 'U', 'S') to mean string:
scikit-learn/sklearn/utils/__init__.py
Lines 311 to 318 in 68044b0
jnothman
left a comment
There was a problem hiding this comment.
I've not reviewed tests thoroughly, but I'm pretty happy with this!
Co-Authored-By: Joel Nothman <joel.nothman@gmail.com>
|
Looks like there are two approvals. Anything blocking the merge? |
|
I think this is ready to be merged. I address all comments. |
|
Thanks @glemaitre |
This is the follow-up of #14035.
It is a refactoring of
safe_indexingsplitting theXindexing depending on the type and reorganizing the tests as well.