-
-
Notifications
You must be signed in to change notification settings - Fork 26.9k
SkLearn .score() method generating error with Dask DataFrames #12461
Description
When using Dask Dataframes with SkLearn, I used to be able to just ask SkLearn for the score of any given algorithm. It would spit out a nice answer and I'd move on. After updating to the newest versions, all metrics that compute based on (y_true, y_predicted) are failing. I've tested accuracy_score, precision_score, r2_score, and mean_squared_error. Work-around shown below, but it's not ideal because it requires me to cast from Dask Arrays to numpy arrays which won't work if the data is huge.
I've asked Dask about it here: dask/dask#4137 and they've said it's an issue with the SkLearn shape check, and that they won't be addressing it. It seems like it should be not super complicated to add a try-except that says "if shape doesn't return a tuple revert to pretending shape didn't exist". If others think that sounds correct, I can attempt a pull-request, but I don't want to attempt to solve it on my own only to find out others don't deem that an acceptable solutions.
Trace, MWE, versions, and workaround all in-line.
MWE:
import dask.dataframe as dd
from sklearn.linear_model import LinearRegression, SGDRegressor
df = dd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv", sep=';')
lr = LinearRegression()
X = df.drop('quality', axis=1)
y = df['quality']
lr.fit(X,y)
lr.score(X,y)
Output of error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-5-4eafa0e7fc85> in <module>
8
9 lr.fit(X,y)
---> 10 lr.score(X,y)
~/anaconda3/lib/python3.6/site-packages/sklearn/base.py in score(self, X, y, sample_weight)
327 from .metrics import r2_score
328 return r2_score(y, self.predict(X), sample_weight=sample_weight,
--> 329 multioutput='variance_weighted')
330
331
~/anaconda3/lib/python3.6/site-packages/sklearn/metrics/regression.py in r2_score(y_true, y_pred, sample_weight, multioutput)
532 """
533 y_type, y_true, y_pred, multioutput = _check_reg_targets(
--> 534 y_true, y_pred, multioutput)
535 check_consistent_length(y_true, y_pred, sample_weight)
536
~/anaconda3/lib/python3.6/site-packages/sklearn/metrics/regression.py in _check_reg_targets(y_true, y_pred, multioutput)
73
74 """
---> 75 check_consistent_length(y_true, y_pred)
76 y_true = check_array(y_true, ensure_2d=False)
77 y_pred = check_array(y_pred, ensure_2d=False)
~/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py in check_consistent_length(*arrays)
225
226 lengths = [_num_samples(X) for X in arrays if X is not None]
--> 227 uniques = np.unique(lengths)
228 if len(uniques) > 1:
229 raise ValueError("Found input variables with inconsistent numbers of"
~/anaconda3/lib/python3.6/site-packages/numpy/lib/arraysetops.py in unique(ar, return_index, return_inverse, return_counts, axis)
229
230 """
--> 231 ar = np.asanyarray(ar)
232 if axis is None:
233 ret = _unique1d(ar, return_index, return_inverse, return_counts)
~/anaconda3/lib/python3.6/site-packages/numpy/core/numeric.py in asanyarray(a, dtype, order)
551
552 """
--> 553 return array(a, dtype, copy=False, order=order, subok=True)
554
555
TypeError: int() argument must be a string, a bytes-like object or a number, not 'Scalar'
Problem occurs after upgrading as follows:
Before bug:
for lib in (sklearn, dask):
print(f'{lib.__name__} Version: {lib.__version__}')
> sklearn Version: 0.19.1
> dask Version: 0.18.2
Update from conda, then bug starts:
for lib in (sklearn, dask):
print(f'{lib.__name__} Version: {lib.__version__}')
> sklearn Version: 0.20.0
> dask Version: 0.19.4
Work around:
from sklearn.metrics import r2_score
preds = lr.predict(X_test)
r2_score(np.array(y_test), np.array(preds))