[MRG] EHN add support for scalar, slice and mask in safe_indexing axis=0 by glemaitre · Pull Request #14475 · scikit-learn/scikit-learn

glemaitre · 2019-07-25T16:29:13Z

This is the follow-up of #14035.

It is a refactoring of safe_indexing splitting the X indexing depending on the type and reorganizing the tests as well.

glemaitre · 2019-07-25T16:31:10Z

@jnothman @NicolasHug As promised with a bit of delay, I draft something to make safe_indexing consistent. Happy to have feedback.

thomasjpfan

Nice! :)

sklearn/utils/__init__.py

sklearn/utils/tests/test_utils.py

…into is/consistent_safe_indexing

…indexing

NicolasHug

Dumb question, but is there a real use-case for this?

doc/whats_new/v0.22.rst

sklearn/utils/__init__.py

glemaitre · 2019-07-30T12:55:55Z

Dumb question, but is there a real use-case for this?

Passing a mask array to select sample does not seem inappropriate.
This is feature is mainly to make safe_indexing consistent after @jnothman had concerned about it. I agree with him that having a close behaviour and support for axis=0 and axis=1 would make sense.

sklearn/utils/__init__.py

jnothman

Does this mean that previously unsupported code will now work, such as cv splitters returning masks or feature selectors...? Are we being too lenient in our APIs then?

sklearn/utils/__init__.py

NicolasHug · 2019-08-01T15:05:34Z

sklearn/utils/__init__.py

+        - To select multiple elements (i.e. rows or columns), `indices` can be
+          one of the following: `list`, `array`, `slice`. The type used in
+          these containers can be one of the following: `int`, `bool`, and
+          `str`. `str` is only supported when `X` is a dataframe.


what are slices of bool and str?

Would this work?

To select multiple elements (i.e. rows or columns), indices can be
a slice, or a container (list orarray). The type used in
these containers can be one of the following: int, bool, and
str. str is only supported when X is a dataframe.

slice of bool will give you the first line (it casting stuff into int). We should raise an error for such case I think.

slice of string is useful with dataframe:

df.loc[slice('xxx', 'yyy')]

I see that we have an issue because we only use iloc and then it would not work. So the docstring is wrong

do we really want to support slices of strings? I'd say we don't for now to keep it simple?

It is already supported for the column with dataframe

What I mean is:

from sklearn.datasets import fetch_openml iris = fetch_openml('iris', as_frame=True) from sklearn.preprocessing import FunctionTransformer from sklearn.compose import make_column_transformer transformer = make_column_transformer( (FunctionTransformer(), slice('sepallength', 'sepalwidth')) ) transformer.fit_transform(iris.data, iris.target)

glemaitre · 2019-08-01T17:02:59Z

I refactor the code which should make it more understandable but less easy to review by looking at the diff (sorry for that). I think that I will need to do the same regarding the tests if we want to keep this indexing manageable.

So before to go further, I think that we need to answer a couple of questions:

for axis=1, X cannot be a list?
do we want to support slicing with string for both axis=0 and axis=1?
we can now index using boolean array-like or arrays. I first thought that @jnothman wanted this ([MRG] EHN add parameter axis in safe_indexing to slice rows and columns #14035 (comment)) but his last comment [MRG] EHN add support for scalar, slice and mask in safe_indexing axis=0 #14475 (review) question this additional support.

I think that this is the main blocker here.

…indexing

amueller

I'm not sure if I understand what's going on. Are we considering safe_indexing not public and so we can break backward-compatibility?

sklearn/impute/_iterative.py

sklearn/metrics/cluster/unsupervised.py

sklearn/model_selection/tests/test_search.py

…into is/consistent_safe_indexing

glemaitre · 2019-08-19T12:50:23Z

I'm not sure if I understand what's going on. Are we considering safe_indexing not public and so we can break backward-compatibility?

As mentioned in one of my comment, I reintroduced support for boolean array-like and integer slice meaning since we were supporting it.

jnothman · 2019-08-20T08:26:02Z

Seems that masks were used for indexing in several safe_indexing uses. Sorry for sending you off the path. And CV is the only place where the mask is provided effectively by users, so we should be safe to support more formats if all formats are acceptable there.

jnothman

I'm somewhat confused by this _check_key_type(key, superclass) design. It repeats very similar checks for each of three classes, but also has specialised code paths depending on superclass. Why not _determine_key_type(key) -> {'s', 'i', 'b', 'none-slice'} called exactly once per safe_indexing?

jnothman · 2019-08-20T08:30:02Z

sklearn/utils/__init__.py

+    """Index a pandas dataframe or a series."""
+    # check whether we should index with loc or iloc
+    if _check_key_type(key, int):
+        by_name = False


should we be raising a warning/error if X.index.dtype.kind in 'iu' (and similar on the other axis)? This won't handle the case where there are ints of object dtype in the index.

jnothman · 2019-08-20T08:30:41Z

sklearn/utils/__init__.py

+
+    if hasattr(key, 'shape'):
+        # Work-around for indexing with read-only key in pandas
+        # FIXME: solved in pandas 0.25


should we use an explicit version comparison to take this codepath?

then we need to make an explicit pandas import

sklearn/utils/__init__.py

jnothman · 2019-08-20T08:34:42Z

sklearn/utils/__init__.py

+    if np.isscalar(key) or isinstance(key, slice):
+        # key is a slice or a scalar
+        return X[key]
+    key_set = set(key)


key will usually be an array, won't it? Should we cast into an array at some point?

I don't know if we need to do so. We can always postpone until required.

sklearn/utils/__init__.py

jnothman · 2019-08-20T21:05:51Z

sklearn/utils/__init__.py

+        return None
+    if isinstance(key, tuple(dtype_to_str.keys())):
+        try:
+            return dtype_to_str[type(key)]


Not sure this is the right behaviour for a scalar bool

What do you mean? bool and np.bool_ will return the string 'bool'. What would you expect instead?

Note that in safe indexing, we are not using scalar bool to index.

Note that in safe indexing, we are not using scalar bool to index.

Okay ;)

sklearn/utils/__init__.py

jnothman · 2019-08-20T21:08:28Z

sklearn/utils/__init__.py

+        return set_type
+    if hasattr(key, 'dtype'):
+        try:
+            return array_dtype_to_str[key.dtype.kind]


Is behaviour with dtype='O' where the array contains only ints or only bools correct? (How does numpy handle this?)

NumPy will return 'O' and nothing about 'i' or 'b' so it will fail. I am not really sure how to solve this issue without adding so much complexity.

One would need to try converting the array to bool and int first. However, I thought that we are usually dtype=object to handle string.

This maintains the behavior on master where we use ('O', 'U', 'S') to mean string:

scikit-learn/sklearn/utils/__init__.py

Lines 311 to 318 in 68044b0

if hasattr(key, 'dtype'):

if superclass is int:

return key.dtype.kind == 'i'

elif superclass is bool:

return key.dtype.kind == 'b'

else:

# superclass = str

return key.dtype.kind in ('O', 'U', 'S')

jnothman

I've not reviewed tests thoroughly, but I'm pretty happy with this!

sklearn/utils/__init__.py

Co-Authored-By: Joel Nothman <joel.nothman@gmail.com>

rth · 2019-09-04T08:47:13Z

Looks like there are two approvals. Anything blocking the merge?

glemaitre · 2019-09-04T09:42:01Z

I think this is ready to be merged. I address all comments.

jnothman · 2019-09-04T09:45:38Z

Thanks @glemaitre

glemaitre added 2 commits July 25, 2019 18:20

EHN add support for scalar, slice and mask in safe_indexing axis=0

c01385c

DOC

0e5c037

glemaitre changed the title ~~EHN add support for scalar, slice and mask in safe_indexing axis=0~~ [MRG] EHN add support for scalar, slice and mask in safe_indexing axis=0 Jul 25, 2019

glemaitre added 2 commits July 25, 2019 22:20

FIX behaviour when passing None

f5e08c4

PEP8

bb4db91

thomasjpfan reviewed Jul 27, 2019

View reviewed changes

sklearn/utils/__init__.py Outdated Show resolved Hide resolved

sklearn/utils/__init__.py Outdated Show resolved Hide resolved

sklearn/utils/tests/test_utils.py Outdated Show resolved Hide resolved

glemaitre added 3 commits July 29, 2019 12:05

address thomas comments

8cd74db

Merge remote-tracking branch 'glemaitre/is/consistent_safe_indexing' …

9878ef1

…into is/consistent_safe_indexing

Merge remote-tracking branch 'origin/master' into is/consistent_safe_…

2f6a0bd

…indexing

NicolasHug reviewed Jul 29, 2019

View reviewed changes

doc/whats_new/v0.22.rst Outdated Show resolved Hide resolved

sklearn/utils/__init__.py Outdated Show resolved Hide resolved

sklearn/utils/__init__.py Outdated Show resolved Hide resolved

glemaitre added 5 commits July 29, 2019 18:31

FIX change boolean array-likes indexing in old NumPy version

d0f8d60

change indexing

f95a228

add regression test in utils

1c81803

fix

c8009a2

add test in column transformer

a80b33d

jnothman reviewed Jul 30, 2019

View reviewed changes

sklearn/utils/__init__.py Outdated Show resolved Hide resolved

jnothman reviewed Jul 30, 2019

View reviewed changes

sklearn/utils/__init__.py Outdated Show resolved Hide resolved

sklearn/utils/__init__.py Outdated Show resolved Hide resolved

sklearn/utils/__init__.py Outdated Show resolved Hide resolved

glemaitre added 5 commits August 1, 2019 14:12

Merge remote-tracking branch 'origin/master' into is/mask_indexing

56a6759

raise error if axis not 0 or 1

9fb045d

Merge branch 'is/mask_indexing' into is/consistent_safe_indexing

0d46f7f

itert

5dcf34f

iter

70f0e02

NicolasHug reviewed Aug 1, 2019

View reviewed changes

glemaitre added 4 commits August 1, 2019 18:31

refactor

7127b5a

PEP8 comments

2f96882

iter

619fb05

style

b7539bd

glemaitre added 2 commits August 14, 2019 18:37

Merge remote-tracking branch 'origin/master' into is/consistent_safe_…

395803a

…indexing

be explicit

a7d29f6

thomasjpfan approved these changes Aug 14, 2019

View reviewed changes

amueller reviewed Aug 14, 2019

View reviewed changes

sklearn/impute/_iterative.py Outdated Show resolved Hide resolved

sklearn/metrics/cluster/unsupervised.py Outdated Show resolved Hide resolved

iter

3624cc5

amueller reviewed Aug 15, 2019

View reviewed changes

sklearn/model_selection/tests/test_search.py Outdated Show resolved Hide resolved

glemaitre added 4 commits August 19, 2019 14:43

add back support for mask and slice

557aa43

Merge remote-tracking branch 'glemaitre/is/consistent_safe_indexing' …

7d5404a

…into is/consistent_safe_indexing

PEP8

698aef2

typo

4f2fd8f

fix corner case

abc90d7

jnothman reviewed Aug 20, 2019

View reviewed changes

glemaitre added 2 commits August 20, 2019 15:57

determine the dtype instead of checking it several times

b99b5a2

add for coverage

3bf36b0

thomasjpfan reviewed Aug 20, 2019

View reviewed changes

sklearn/utils/__init__.py Outdated Show resolved Hide resolved

sklearn/utils/__init__.py Outdated Show resolved Hide resolved

jnothman reviewed Aug 20, 2019

View reviewed changes

glemaitre added 3 commits August 21, 2019 10:54

iter

bd86ccd

itert

bfb9fa2

bool type

c433494

jnothman reviewed Aug 21, 2019

View reviewed changes

sklearn/utils/__init__.py Outdated Show resolved Hide resolved

sklearn/utils/__init__.py Outdated Show resolved Hide resolved

glemaitre and others added 2 commits August 21, 2019 13:47

joel comments

7994cc9

Update sklearn/utils/__init__.py

5e29cc6

Co-Authored-By: Joel Nothman <joel.nothman@gmail.com>

jnothman approved these changes Aug 25, 2019

View reviewed changes

jnothman merged commit ae5f558 into scikit-learn:master Sep 4, 2019

BenjaminBossan mentioned this pull request Dec 5, 2019

Bugfix: sklearn 0.22 skorch-dev/skorch#571

Merged

	if hasattr(key, 'dtype'):
	if superclass is int:
	return key.dtype.kind == 'i'
	elif superclass is bool:
	return key.dtype.kind == 'b'
	else:
	# superclass = str
	return key.dtype.kind in ('O', 'U', 'S')

Uh oh!

Conversation

glemaitre commented Jul 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

glemaitre commented Jul 25, 2019

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NicolasHug left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

glemaitre commented Jul 30, 2019

Uh oh!

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glemaitre commented Aug 1, 2019

Uh oh!

amueller left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

glemaitre commented Aug 19, 2019

Uh oh!

jnothman commented Aug 20, 2019 via email

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

glemaitre commented Jul 25, 2019 •

edited

Loading