Skip to content

SkLearn .score() method generating error with Dask DataFrames #12461

@ZWMiller

Description

@ZWMiller

When using Dask Dataframes with SkLearn, I used to be able to just ask SkLearn for the score of any given algorithm. It would spit out a nice answer and I'd move on. After updating to the newest versions, all metrics that compute based on (y_true, y_predicted) are failing. I've tested accuracy_score, precision_score, r2_score, and mean_squared_error. Work-around shown below, but it's not ideal because it requires me to cast from Dask Arrays to numpy arrays which won't work if the data is huge.

I've asked Dask about it here: dask/dask#4137 and they've said it's an issue with the SkLearn shape check, and that they won't be addressing it. It seems like it should be not super complicated to add a try-except that says "if shape doesn't return a tuple revert to pretending shape didn't exist". If others think that sounds correct, I can attempt a pull-request, but I don't want to attempt to solve it on my own only to find out others don't deem that an acceptable solutions.

Trace, MWE, versions, and workaround all in-line.

MWE:

import dask.dataframe as dd
from sklearn.linear_model import LinearRegression, SGDRegressor

df = dd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv", sep=';')
lr = LinearRegression()
X = df.drop('quality', axis=1)
y = df['quality']

lr.fit(X,y)
lr.score(X,y)

Output of error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-4eafa0e7fc85> in <module>
      8 
      9 lr.fit(X,y)
---> 10 lr.score(X,y)

~/anaconda3/lib/python3.6/site-packages/sklearn/base.py in score(self, X, y, sample_weight)
    327         from .metrics import r2_score
    328         return r2_score(y, self.predict(X), sample_weight=sample_weight,
--> 329                         multioutput='variance_weighted')
    330 
    331 

~/anaconda3/lib/python3.6/site-packages/sklearn/metrics/regression.py in r2_score(y_true, y_pred, sample_weight, multioutput)
    532     """
    533     y_type, y_true, y_pred, multioutput = _check_reg_targets(
--> 534         y_true, y_pred, multioutput)
    535     check_consistent_length(y_true, y_pred, sample_weight)
    536 

~/anaconda3/lib/python3.6/site-packages/sklearn/metrics/regression.py in _check_reg_targets(y_true, y_pred, multioutput)
     73 
     74     """
---> 75     check_consistent_length(y_true, y_pred)
     76     y_true = check_array(y_true, ensure_2d=False)
     77     y_pred = check_array(y_pred, ensure_2d=False)

~/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py in check_consistent_length(*arrays)
    225 
    226     lengths = [_num_samples(X) for X in arrays if X is not None]
--> 227     uniques = np.unique(lengths)
    228     if len(uniques) > 1:
    229         raise ValueError("Found input variables with inconsistent numbers of"

~/anaconda3/lib/python3.6/site-packages/numpy/lib/arraysetops.py in unique(ar, return_index, return_inverse, return_counts, axis)
    229 
    230     """
--> 231     ar = np.asanyarray(ar)
    232     if axis is None:
    233         ret = _unique1d(ar, return_index, return_inverse, return_counts)

~/anaconda3/lib/python3.6/site-packages/numpy/core/numeric.py in asanyarray(a, dtype, order)
    551 
    552     """
--> 553     return array(a, dtype, copy=False, order=order, subok=True)
    554 
    555 

TypeError: int() argument must be a string, a bytes-like object or a number, not 'Scalar'

Problem occurs after upgrading as follows:

Before bug:

for lib in (sklearn, dask):
    print(f'{lib.__name__} Version: {lib.__version__}')
> sklearn Version: 0.19.1
> dask Version: 0.18.2

Update from conda, then bug starts:

for lib in (sklearn, dask):
    print(f'{lib.__name__} Version: {lib.__version__}')
> sklearn Version: 0.20.0
> dask Version: 0.19.4

Work around:

from sklearn.metrics import r2_score
preds = lr.predict(X_test)
r2_score(np.array(y_test), np.array(preds))

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions