[WIP] MNT Use isinstance instead of dtype.kind check for scalar validation.#10017
[WIP] MNT Use isinstance instead of dtype.kind check for scalar validation.#10017massich wants to merge 6 commits intoscikit-learn:masterfrom
Conversation
|
For completness, here are some similar tuples:
Shall we agree in lower/upper case ? |
53f65f6 to
01c417f
Compare
|
Hijacking since its also relevant to an open PR I have #10042 and @jnothman sent me to #7394 (which this is the successor of right?) |
Uppercase were used since it was global at that time. If we define something, I would use lower case. |
You also have FLOAT_DTYPE in validation.py |
|
yes validation.py please
|
|
Also important for the matter, should 0d arrays be considered as ints for all pratical purposes (and be included in the int types)? It seems scikit is still inconsistent in this aspect. |
|
I don't think we should differentiate between scalar arrays and python
scalars... But I also don't think we should promise that we support scalar
arrays. It's tempting to raise an error when they are passed as
parameters... But I'm sure we'll break someone's code if we do so...
|
01c417f to
c05498e
Compare
This comment has been minimized.
This comment has been minimized.
c05498e to
f838fa9
Compare
|
@glemaitre regarding maybe an option is to set |
|
Actually there are no many places where |
Actually it should be equivalent to |
|
Actually, |
|
So at the moment this doesn't handle 0d arrays. Are we making that a policy? |
|
It doesn't. I kept the behavior that was on the codebase. IMHO if a function needs to accept 0d arrays it should be handled there directly (in the same manner that def my_func (param=None):
if not ( isinstance(param, integer_types) or param is None):
if param.ndim == 0 and isinstance(param[()], integer_types):
param = param[()]
else
raise ValueError("n_estimators must be an integer, got {0}.".format(type(param)))
... |
|
I think we have previously had issues where 0d arrays were passed in unknowingly, but it may have been due to bugs in numpy/scipy/etc that previously existed (e.g. iteration over memmapped array or something obscure like that). |
|
And answering to @jnothman I'm +1 to not explicitly support 0d arrays. @glemaitre any thoughts ? |
|
0d arrays were not supported, I don't we should try to add support. It would make the code overly complex for no real benefits. |
|
I don't think the code becomes overly complex if we have an is_integer or a check_integer helper instead of using isinstance. Yes, 0d arrays were not supported, but this could also be inducing unintentional behaviour. Note that, for instance, 0d arrays are output when iterating over a numpy array. Thus for instance |
|
If that's the case, I think we should implement our own checker (or validation unwrapping the 0d array). >>> foo = np.array(42)
>>> isinstance(foo, (numbers.Integral, np.integer))
False
>>> isinstance(foo[()], (numbers.Integral, np.integer))
Trueand adding |
|
@jnothman >>> import numbers
>>> import numpy as np
>>> from sklearn.model_selection import ParameterGrid
>>> [isinstance(d['a'], (numbers.Integral, np.integer))
... for d in ParameterGrid(dict(a=np.arange(3)))]
...
[True, True, True]
>>> [isinstance(d['a'], np.ndarray) for d in ParameterGrid(dict(a=np.arange(3)))]
[False, False, False] |
|
oh dear... am I getting multiple things confused? they do have attributes
like shape and ndim. well whatever we get out of ParameterGrid should be
handled and tested!
|
|
I had not realized that instances of |
I agree with that. 0d array is a weird quirk which to be honest I don't really understand. They should not be confused with numpy scalars. arr_0d = np.array(0) # 0d array
scalar = np.int64(0) # numpy scalarWe should definitly support numpy scalars because they are very easy to get, e.g. by indexing, doing reduction like |
|
Also note that in recent versions of numpy: |
sklearn/utils/validation.py
Outdated
|
|
||
|
|
||
| integer_types = (numbers.Integral, np.integer) | ||
| floating_types = (float, np.floating) |
There was a problem hiding this comment.
Those two constants should be moved to sklearn.utils.fixes and renamed to:
# In Python 1.8.2 np.integer is not yet a subclass of numbers.Integral
SCALAR_INTEGRAL_TYPES = (numbers.Integral, np.integer)
SCALAR_REAL_TYPES = (numbers.Real, np.floating)There was a problem hiding this comment.
Ideally, it would be nice to find out the version of numpy that made it such that is scalar types are subclasses of the corresponding Python scalar base types from numbers.
There was a problem hiding this comment.
In Python 1.8.2 np.integer is not yet a subclass of numbers.Integral
I think that you mean Numpy, I could not find Python 1.* in conda :)
There was a problem hiding this comment.
it would be nice to find out the version of numpy that made it such that is scalar types are subclasses of the corresponding Python scalar
it is 1.9
There was a problem hiding this comment.
@ogrisel I disagree with the scalar_real_types. Since integers derive from real whereas they do not derive from float.
>>> import numbers
>>> isinstance(int(42), numbers.Real)
True
>>> isinstance(int(42), float)
False0a56a5b to
a67a607
Compare
|
There's also another scikit-learn/sklearn/externals/six.py Lines 35 to 46 in 3fa7a06 which is only used these two tests, but I'm not sure that it has the same functionality: scikit-learn/sklearn/model_selection/tests/test_split.py Lines 549 to 550 in 3fa7a06 scikit-learn/sklearn/tests/test_cross_validation.py Lines 446 to 447 in 3fa7a06 |
|
Do not modify the vendored six module but fill tree to replace usage of |
Indeed, good catch. |
sklearn/utils/validation.py
Outdated
| from ..externals.joblib import Memory | ||
|
|
||
|
|
||
| floating_types = (float, np.floating) |
There was a problem hiding this comment.
I think we should rename this to scalar_floating_types to make it explicit that this should not be used to validated array dtypes.
Also because this tuple of types is not a provided as a backport for old version of numpy but is going to be used to validate scalar paramters in similar ways to scalar_integer_types, I think we should both constant here instead of sklearn.utils.fixes. Sorry for the back and forth comments.
a67a607 to
e0e6281
Compare
|
There are places where only numbers.Integral is used like: scikit-learn/sklearn/tree/export.py Line 14 in 3fa7a06 scikit-learn/sklearn/tree/export.py Lines 413 to 420 in 3fa7a06 which can raise an error if numpy is 1.8.2. But there's no open issue in that. Shall we make it homogeneous and add a test for it? Another disturbing thought is this comment scikit-learn/sklearn/feature_extraction/hashing.py Lines 103 to 105 in 3fa7a06 >>> np.__version__
'1.8.2'
>>> types = (np.int0, np.int8, np.int16, np.int32, np.int64)
>>> [issubclass(t, numbers.Integral) for t in types]
[False, False, False, False, False]
....
>>> np.__version__
'1.9.3'
>>> types = (np.int0, np.int8, np.int16, np.int32, np.int64)
>>> [issubclass(t, numbers.Integral) for t in types]
[True, True, True, True, True]
...
>>> np.__version__
'1.10.4'
>>> types = (np.int0, np.int8, np.int16, np.int32, np.int64)
>>> [issubclass(t, numbers.Integral) for t in types]
[True, True, True, True, True]
...
>>> np.__version__
'1.11.3'
>>> types = (np.int0, np.int8, np.int16, np.int32, np.int64)
>>> [issubclass(t, numbers.Integral) for t in types]
[True, True, True, True, True]
...
>>> np.__version__
'1.12.1'
>>> types = (np.int0, np.int8, np.int16, np.int32, np.int64)
>>> [issubclass(t, numbers.Integral) for t in types]
[True, True, True, True, True] |
|
And IIRC there has also been some variation across Python versions.
It's unlikely that a int16 scalar (or a uint) is going to be used for a
parameter, but there's no harm in supporting it. Yes, we should be
replacing uses of numbers.integral, and of dtype.kind checks for scalars in
this PR.
|
Since currently the minimal numpy version is 1.11.0 a significant part of this PR is no longer needed (and was fixed in #14004) |
|
This was taken care by #14004, but some were left off ~/code/scikit-learn master
(mne) ❯ git grep -i 'np.asarray(test_size)'
sklearn/model_selection/_split.py: test_size_type = np.asarray(test_size).dtype.kindI'll close this one and maybe open a new one |
Reference Issues/PRs
This PR takes over the stalling #7394.
What does this implement/fix? Explain your changes.
Any other comments?
Based on the comments, @lesteve and @jnothman were in favor of using a helper function. I propose to use a set of types to mimic this:
My only question is where should I define such tuples. Any ideas?