-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
SkLearn .score() method generating error with Dask DataFrames #4137
Copy link
Copy link
Closed
Description
When using Dask Dataframes with SkLearn, I used to be able to just ask SkLearn for the score of any given algorithm. It would spit out a nice answer and I'd move on. After updating to the newest versions, all metrics that compute based on (y_true, y_predicted) are failing. I've tested accuracy_score, precision_score, r2_score, and mean_squared_error. Work-around shown below, but it's not idea because it requires me to cast from Dask Arrays to numpy arrays which won't work if the data is huge.
Trace, MWE, workaround all in-line.
MWE:
import dask.dataframe as dd
from sklearn.linear_model import LinearRegression, SGDRegressor
df = dd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv", sep=';')
lr = LinearRegression()
X = df.drop('quality', axis=1)
y = df['quality']
lr.fit(X,y)
lr.score(X,y)
Output of error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-5-4eafa0e7fc85> in <module>
8
9 lr.fit(X,y)
---> 10 lr.score(X,y)
~/anaconda3/lib/python3.6/site-packages/sklearn/base.py in score(self, X, y, sample_weight)
327 from .metrics import r2_score
328 return r2_score(y, self.predict(X), sample_weight=sample_weight,
--> 329 multioutput='variance_weighted')
330
331
~/anaconda3/lib/python3.6/site-packages/sklearn/metrics/regression.py in r2_score(y_true, y_pred, sample_weight, multioutput)
532 """
533 y_type, y_true, y_pred, multioutput = _check_reg_targets(
--> 534 y_true, y_pred, multioutput)
535 check_consistent_length(y_true, y_pred, sample_weight)
536
~/anaconda3/lib/python3.6/site-packages/sklearn/metrics/regression.py in _check_reg_targets(y_true, y_pred, multioutput)
73
74 """
---> 75 check_consistent_length(y_true, y_pred)
76 y_true = check_array(y_true, ensure_2d=False)
77 y_pred = check_array(y_pred, ensure_2d=False)
~/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py in check_consistent_length(*arrays)
225
226 lengths = [_num_samples(X) for X in arrays if X is not None]
--> 227 uniques = np.unique(lengths)
228 if len(uniques) > 1:
229 raise ValueError("Found input variables with inconsistent numbers of"
~/anaconda3/lib/python3.6/site-packages/numpy/lib/arraysetops.py in unique(ar, return_index, return_inverse, return_counts, axis)
229
230 """
--> 231 ar = np.asanyarray(ar)
232 if axis is None:
233 ret = _unique1d(ar, return_index, return_inverse, return_counts)
~/anaconda3/lib/python3.6/site-packages/numpy/core/numeric.py in asanyarray(a, dtype, order)
551
552 """
--> 553 return array(a, dtype, copy=False, order=order, subok=True)
554
555
TypeError: int() argument must be a string, a bytes-like object or a number, not 'Scalar'
Problem occurs after upgrading as follows:
Before bug:
for lib in (sklearn, dask):
print(f'{lib.__name__} Version: {lib.__version__}')
> sklearn Version: 0.19.1
> dask Version: 0.18.2
Update from conda, then bug starts:
for lib in (sklearn, dask):
print(f'{lib.__name__} Version: {lib.__version__}')
> sklearn Version: 0.20.0
> dask Version: 0.19.4
Work around:
from sklearn.metrics import r2_score
preds = lr.predict(X_test)
r2_score(np.array(y_test), np.array(preds))
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels