[MRG+1] fused types in isotonic_regression#9106
[MRG+1] fused types in isotonic_regression#9106agramfort merged 11 commits intoscikit-learn:masterfrom
Conversation
|
I implemented and tested dtype preservation for IsotonicRegression as well. I followed the convention that X should drive the dtype, and checked that Ping @Henley13 for comments :) |
|
further ping @lesteve, @glemaitre, @raghavrv, @TomDLT just in case someone bites :P I am pretty sure I forgot things here, because I failed to participate to all of the discussions on 32bit support over the week. |
| copy.copy(ir) | ||
|
|
||
|
|
||
| def test_isotonic_dtype(): |
There was a problem hiding this comment.
should this kind of test be here or in common tests / estimator tags?
There was a problem hiding this comment.
Probably there, but not possible without estimator tags, right? preserves_precision? Have you run it on all_estimators to see the status? We could post an issue with a check list.
|
test failures were genuine: one of the tests is calling predict without calling fit. Instead of changing the test I made the model support this behavior (so it works the same way as on master) |
sklearn/isotonic.py
Outdated
| The transformed data | ||
| """ | ||
| T = as_float_array(T) | ||
| if hasattr(self, '_dtype'): |
There was a problem hiding this comment.
Is this required? Is this faster?
There was a problem hiding this comment.
Basically a way to check if there is a saved X. I could replace it with
if hasattr(self, '_necessary_X_'):
T = T.astype(self._necessary_X_.dtype, copy=False)
I could check if it makes a difference in terms of speed but it's probably minimal (just attribute accesses, right?)
sklearn/isotonic.py
Outdated
| sample_weight = np.ones(len(y), dtype=y.dtype) | ||
| else: | ||
| sample_weight = np.array(sample_weight[order], dtype=np.float64) | ||
| sample_weight = np.array(sample_weight[order]) |
There was a problem hiding this comment.
should we make sure that sample_weight has the same dtype than y?
| check_consistent_length(X, y, sample_weight) | ||
| X, y = [check_array(x, ensure_2d=False) for x in [X, y]] | ||
|
|
||
| X = as_float_array(X) |
There was a problem hiding this comment.
This will trigger a copy, and is rather redundant with check_array.
I would prefer:
X = check_array(X, dtype=[np.float64, np.float32])
y = check_array(y, dtype=X.dtype, ensure_2d=False)There was a problem hiding this comment.
@TomDLT this has different behavior than as_float_array is the input is int32 (and maybe others too?)
import numpy as np
from sklearn.utils import as_float_array, check_array
for dtype in (np.int32, np.int64, np.uint32, np.uint64, np.float32, np.float64):
print('input dtype\t', dtype.__name__)
x = np.arange(5).astype(dtype)
x = as_float_array(x)
print('as_float_array\t', x.dtype)
x = np.arange(5).astype(dtype)
x = check_array(x, dtype=[np.float64, np.float32], ensure_2d=False)
print('check_array\t', x.dtype)
print()outputs
input dtype int32
as_float_array float32
check_array float64
input dtype int64
as_float_array float64
check_array float64
input dtype uint32
as_float_array float64
check_array float64
input dtype uint64
as_float_array float64
check_array float64
input dtype float32
as_float_array float32
check_array float32
input dtype float64
as_float_array float64
check_array float64
Has there been a decision in the other 32-bit prs what to do in this situation?
There was a problem hiding this comment.
in logistic regression it seems like the check_array way is used:
There was a problem hiding this comment.
this has different behavior than as_float_array is the input is int32
Fair enough, just make sure that it does not copy more than necessary
There was a problem hiding this comment.
I removed the redundant call to check_array on X and am now just calling it on y. Still, I think the behavior on int inputs should be consistent if possible over the codebase.
@Henley13 have you already discussed this?
There was a problem hiding this comment.
Failing tests on travis seem to suggest that my assumptions here fail on some configs; in particular this:
input dtype int32
as_float_array float32
check_array float64
|
thanks @agramfort ! my memory might fail but let me know if I can help |
|
@vene can you replicate the travis failure? It seems I can't |
I was able to reproduce in a clean python 3.5.6 installed with pyenv, with The problem is caused by the call to Basically, at this line, |
# file check.py
import numpy as np
import scipy
from scipy import interpolate
print(np.__version__)
print(scipy.__version__)
X = np.random.randn(5).astype(np.float32)
y = np.random.randn(5).astype(np.float32)
T = np.random.randn(5).astype(np.float32)
f_ = scipy.interpolate.interp1d(X, y, kind='linear', bounds_error=False)
out = f_(T)
print(out.dtype)output |
|
I think there are two solutions
EDIT: pushed a quick fix for solution (1) |
sorry, i'm rusty
|
this is good to go from my end. Should I be worried that azure and appveyor failed? it seems unrelated... |
|
|
|
The failure is when calling def _make_unique(np.ndarray[dtype=floating] X,
np.ndarray[dtype=floating] y,
np.ndarray[dtype=floating] sample_weights):Could this be because the fused type specializations are tied, i.e., either |
|
yes it's very likely. Can you fix? |
|
Hopefully! I'm trying to reproduce on windows, but it might take a bit longer for technical reasons. |
|
I was able to reproduce. Will find a fix tomorrow. |
|
FWIW I can also reproduce on my laptop. |
|
Thanks @albertcthomas. It was a legitimate bug; per new failing test in previous commit 6f2b6d1. Should be fixed now, let's see what ci says. |
|
thanks @vene |
|
thanks @agramfort !! |
* ENH fused types in isotonic_regression * make X drive computation dtype in IsotonicRegression * preserve current behaviour if transform w/o fit * thoroughly check sample weights; avoid storing dtype explicitly * consistent testing and behavior * misc * update what's new * fix for interp1d upcast on old scipy * flake8 remove blank line sorry, i'm rusty * add failing test * FIX dtype bug in _make_unique
This reverts commit a29db54.
This reverts commit a29db54.
* ENH fused types in isotonic_regression * make X drive computation dtype in IsotonicRegression * preserve current behaviour if transform w/o fit * thoroughly check sample weights; avoid storing dtype explicitly * consistent testing and behavior * misc * update what's new * fix for interp1d upcast on old scipy * flake8 remove blank line sorry, i'm rusty * add failing test * FIX dtype bug in _make_unique
Reference Issue #8769 -ish?
What does this implement/fix? Explain your changes.
isotonic regression should preserve 32bit dtypes when possible.
Any other comments?