[WIP] Common test for equivalence between sparse and dense matrices. by wdevazelhes · Pull Request #13246 · scikit-learn/scikit-learn

wdevazelhes · 2019-02-25T13:38:39Z

Fixes #1572
Follow-up of #7590
TODO:

Try to be more robust by maybe testing only the score (if possible), with a certain tolerance, and for certain random seeds (or if testing the predict also maybe test that like 90% of predictions are the same ?)
After discussion with @agramfort , it would be better to test on a realistic dataset which would avoid weird instable behaviours
address [WIP] Common test for equivalence between sparse and dense matrices. #13246 (review)
Remove the fix for the Affinity propagation and put it in another PR
Wait for the tests to pass thanks to other PR's fix (see for instance ~~[MRG] Implement fitting intercept with sparse_cg solver in Ridge regression #13336~~ Cython code for PolynomialFeatures should use int64s for indices. #17554 )
Replace the sparse format generator in check_estimator_sparse_dense by the one (more exhaustive) in check_estimator_sparse_data
Address @jnothman's comment [WIP] Common test for equivalence between sparse and dense matrices. #13246 (comment), and be more exhaustive in the fake dataset creation, in particular ensuring it has some zeros (for the sparsity to actually do sth), and ~~that it tests well multiclass and multioutputs estimators~~ ->see if it's necessary here to check with multioutput sparse datasets
Look at sparse matrices which may contain duplicates (see one of the comments in the thread I think)

# Conflicts: # sklearn/utils/estimator_checks.py

wdevazelhes · 2019-02-25T13:49:26Z

I tried to integrate the dictionary of values to overwrite the default, as in this @lesteve comment: #7590 (comment)
My strategy is to look at tests that fail, and if they fail because of different parameters settings than default, I would add them to this dictionary, otherwise I'll file the bug. Does this sound good @jnothman, @lesteve ?

wdevazelhes · 2019-02-25T13:58:14Z

Right now here are the tests that fail (without for now looking at whether it's a real fail, or if it's because the input takes different paths between sparse and non sparse):

TransformedTargetRegressor
SGDRegressor
SGDClassifier
RobustScaler
RidgeClassifierCV
RidgeCV
Ridge
RANSACRegressor
Perceptron
PassiveAggressiveRegressor
PassiveAggressiveClassifier
MultiOutputRegressor
LinearRegression
KNeighborsRegressor
KNeighborsClassifier
GradientBoostingClassifier

agramfort · 2019-02-25T14:01:57Z

for linear models it's likely that sparse data are not centered to output will differ. only coordinate_descent will implicitly center sparse data.

…

wdevazelhes · 2019-02-25T15:10:41Z

Setting fit_intercept=False, these are the tests that fail:
TransformedTargetRegressor
RobustScaler
RANSACRegressor
MultiOutputRegressor
MultiOutputRegressor
KNeighborsRegressor
KNeighborsClassifier
GradientBoostingClassifier

…learn into tests_sparse

wdevazelhes · 2019-02-26T13:46:41Z

sklearn/neighbors/regression.py

        """
        if issparse(X) and self.metric == 'precomputed':
-            raise ValueError(
+            raise TypeError(


@agramfort should I leave this for another PR or is it ok to change it here ?

@agramfort related to our discussion, in fact here they do call check_array but the sparse error is raised by this "if", which has a ValueError, so I guess this is the right fix (and I'll let it in this PR for this one as you said)

wdevazelhes · 2019-02-26T13:47:34Z

sklearn/utils/estimator_checks.py

                assert_equal(probs.shape, (X.shape[0], 4))
        except TypeError as e:
-            if 'sparse' not in repr(e):
+            if 'sparse' not in str.lower(repr(e)):


I put this because the estimator that used kernel='precomputed' raised something like "Sparse..." which is an OK message

…nd of the test)

wdevazelhes · 2019-02-26T14:33:20Z

sklearn/utils/estimator_checks.py

 @ignore_warnings(category=(DeprecationWarning, FutureWarning))
 def check_estimator_sparse_dense(name, estimator_orig):
-    rng = np.random.RandomState(42)
+    rng = np.random.RandomState(52)


@agramfort I had the same problem of random seed here for AdaBoostClassifier: (tested at the end of test_check_estimator)
this random seed fixed it

…se same n_samples as n_features created other bugs, and fix pep8 errors

wdevazelhes · 2019-02-27T08:58:58Z

The new seed (52) introduced a weird bug for AffinityPropagation that predicts two really different arrays (one with zeros everywhere and one with -1 everywhere (putting back the seed to 0 fixes the pb). I'll try to investigate that

Err message:

Testing started at 09:57 ...
/home/will/anaconda3/envs/sprint/bin/python /snap/pycharm-community/112/helpers/pycharm/_jb_pytest_runner.py --target test_common.py::test_estimators -- -k test_estimators[AffinityPropagation-check_estimator_sparse_dense]
Launching pytest with arguments -k test_estimators[AffinityPropagation-check_estimator_sparse_dense] test_common.py::test_estimators in /home/will/Code/sklearn-forks/wdevazelhes/scikit-learn/sklearn/tests

============================= test session starts ==============================
platform linux -- Python 3.7.2, pytest-4.3.0, py-1.7.0, pluggy-0.8.1
rootdir: /home/will/Code/sklearn-forks/wdevazelhes/scikit-learn, inifile: setup.cfg
plugins: sugar-0.9.2collected 5385 items / 5384 deselected / 1 selected

test_common.py FEstimator AffinityPropagation doesn't seem to fail gracefully on sparse data: it should raise a TypeError if sparse input is explicitly not supported.

sklearn/tests/test_common.py:101 (test_estimators[AffinityPropagation-check_estimator_sparse_dense])
estimator = AffinityPropagation(affinity='euclidean', convergence_iter=15, copy=True,
                    damping=0.5, max_iter=5, preference=None, verbose=False)
check = <function check_estimator_sparse_dense at 0x7fa5ee03e048>

    @pytest.mark.parametrize(
            "estimator, check",
            _generate_checks_per_estimator(_yield_all_checks,
                                           _tested_estimators()),
            ids=_rename_partial
    )
    def test_estimators(estimator, check):
        # Common tests for estimator instances
        with ignore_warnings(category=(DeprecationWarning, ConvergenceWarning,
                                       UserWarning, FutureWarning)):
            set_checking_parameters(estimator)
            name = estimator.__class__.__name__
>           check(name, estimator)

test_common.py:114: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../utils/testing.py:350: in wrapper
    return fn(*args, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

name = 'AffinityPropagation'
estimator_orig = AffinityPropagation(affinity='euclidean', convergence_iter=15, copy=True,
                    damping=0.5, max_iter=5, preference=None, verbose=False)

    @ignore_warnings(category=(DeprecationWarning, FutureWarning))
    def check_estimator_sparse_dense(name, estimator_orig):
        rng = np.random.RandomState(52)
        X = rng.rand(40, 10)
        X[X < .8] = 0
        X_csr = sparse.csr_matrix(X)
        y = (4 * rng.rand(40)).astype(np.int)
        estimator = clone(estimator_orig)
        estimator_sp = clone(estimator_orig)
        for sparse_format in ['csr', 'csc', 'dok', 'lil', 'coo', 'dia', 'bsr']:
            X_sp = X_csr.asformat(sparse_format)
            # catch deprecation warnings
            with ignore_warnings(category=DeprecationWarning):
                if name in ['Scaler', 'StandardScaler']:
                    estimator.set_params(with_mean=False)
                    estimator_sp.set_params(with_mean=False)
            dense_vs_sparse_additional_params = defaultdict(dict,
                    {'Ridge': {'solver': 'cholesky'}})
            params = dense_vs_sparse_additional_params[
                estimator.__class__.__name__]
            estimator.set_params(**params)
            estimator_sp.set_params(**params)
            set_random_state(estimator)
            set_random_state(estimator_sp)
            try:
                with ignore_warnings(category=DeprecationWarning):
                    estimator_sp.fit(X_sp, y)
                    estimator.fit(X, y)
                if hasattr(estimator, "predict"):
                    pred = estimator.predict(X)
                    pred_sp = estimator_sp.predict(X_sp)
>                   assert_array_almost_equal(pred, pred_sp, 2)
E                   AssertionError: 
E                   Arrays are not almost equal to 2 decimals
E                   
E                   (mismatch 100.0%)
E                    x: array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
E                          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
E                    y: array([-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
E                          -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
E                          -1, -1, -1, -1, -1, -1])

../utils/estimator_checks.py:2367: AssertionError
                                                         [100%]

=================================== FAILURES ===================================
______ test_estimators[AffinityPropagation-check_estimator_sparse_dense] _______

estimator = AffinityPropagation(affinity='euclidean', convergence_iter=15, copy=True,
                    damping=0.5, max_iter=5, preference=None, verbose=False)
check = <function check_estimator_sparse_dense at 0x7fa5ee03e048>

    @pytest.mark.parametrize(
            "estimator, check",
            _generate_checks_per_estimator(_yield_all_checks,
                                           _tested_estimators()),
            ids=_rename_partial
    )
    def test_estimators(estimator, check):
        # Common tests for estimator instances
        with ignore_warnings(category=(DeprecationWarning, ConvergenceWarning,
                                       UserWarning, FutureWarning)):
            set_checking_parameters(estimator)
            name = estimator.__class__.__name__
>           check(name, estimator)

test_common.py:114: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../utils/testing.py:350: in wrapper
    return fn(*args, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

name = 'AffinityPropagation'
estimator_orig = AffinityPropagation(affinity='euclidean', convergence_iter=15, copy=True,
                    damping=0.5, max_iter=5, preference=None, verbose=False)

    @ignore_warnings(category=(DeprecationWarning, FutureWarning))
    def check_estimator_sparse_dense(name, estimator_orig):
        rng = np.random.RandomState(52)
        X = rng.rand(40, 10)
        X[X < .8] = 0
        X_csr = sparse.csr_matrix(X)
        y = (4 * rng.rand(40)).astype(np.int)
        estimator = clone(estimator_orig)
        estimator_sp = clone(estimator_orig)
        for sparse_format in ['csr', 'csc', 'dok', 'lil', 'coo', 'dia', 'bsr']:
            X_sp = X_csr.asformat(sparse_format)
            # catch deprecation warnings
            with ignore_warnings(category=DeprecationWarning):
                if name in ['Scaler', 'StandardScaler']:
                    estimator.set_params(with_mean=False)
                    estimator_sp.set_params(with_mean=False)
            dense_vs_sparse_additional_params = defaultdict(dict,
                    {'Ridge': {'solver': 'cholesky'}})
            params = dense_vs_sparse_additional_params[
                estimator.__class__.__name__]
            estimator.set_params(**params)
            estimator_sp.set_params(**params)
            set_random_state(estimator)
            set_random_state(estimator_sp)
            try:
                with ignore_warnings(category=DeprecationWarning):
                    estimator_sp.fit(X_sp, y)
                    estimator.fit(X, y)
                if hasattr(estimator, "predict"):
                    pred = estimator.predict(X)
                    pred_sp = estimator_sp.predict(X_sp)
>                   assert_array_almost_equal(pred, pred_sp, 2)
E                   AssertionError: 
E                   Arrays are not almost equal to 2 decimals
E                   
E                   (mismatch 100.0%)
E                    x: array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
E                          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
E                    y: array([-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
E                          -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
E                          -1, -1, -1, -1, -1, -1])

… not subscriptable

wdevazelhes · 2021-02-26T10:58:11Z

This issue: #17554, is causing the following test not to pass: test_estimators[PolynomialFeatures()-check_estimator_sparse_data(strict_mode=True)]. Indeed, one of the sparse matrices tested in check_estimator_sparse_data has int64 as dtype for the indices. I'll comment on that issue to see if I can fix it.

wdevazelhes · 2021-03-03T07:11:28Z

sklearn/utils/estimator_checks.py

        tags = estimator_orig._get_tags()
        centers = 2 if tags["binary_only"] else None
-        X, y = make_blobs(random_state=rng, cluster_std=0.5, centers=centers)
+        X, y = make_blobs(n_samples=10, n_features=2, random_state=rng,


I simplified the data here similarly to before, because there were some numerical uncertainties with the Nystrom method (particularly with almost zero elements: the relative error between two numbers can be quite high but both number can be of order 1e-15 for instance, so they are clearly supposed to be zeros). With a simpler dataset that will lead to less operations (because less samples, less features, and less nonzero decimals), the uncertainties disappeared on my machine

cmarmo · 2021-03-03T08:23:37Z

Hi @wdevazelhes , do you mind synchronizing with upstream? Checks are failing looking for the master branch. Thanks!

wdevazelhes · 2021-03-03T08:36:05Z

Hi @cmarmo , thanks, I just merged with master.
I also added a few things in the TODO that I need to finish before putting the PR in MRG

adrinjalali · 2022-04-07T13:28:48Z

removing the milestone for now.

haiatn · 2023-07-29T12:50:52Z

Is this PR ready for review?

agramfort · 2023-08-12T14:28:38Z

@haiatn this one would deserve a big refresher to be now considered for merge. I don't know how much we should invest on this one at this point.

haiatn · 2023-08-12T15:19:01Z

@haiatn this one would deserve a big refresher to be now considered for merge. I don't know how much we should invest on this one at this point.

I do like the idea of testing this, but if you think so, maybe we should close this and leave the issue open for whoever that would like to start fresh

agramfort · 2023-08-12T16:11:56Z

you can see how difficult it would be to rebase this PR on current main branch. Feel free to push directly to this PR or open a new one.

…

Message ID: ***@***.***>

adrinjalali · 2024-04-17T14:26:24Z

I think this can be closed, the original issue is still open and relevant

maniteja123 and others added 6 commits October 4, 2016 23:24

Add test for sparse and dense equivalence

a22ee34

Set random state

17b4da6

Merge branch 'master' into tests_sparse

c4241f0

# Conflicts: # sklearn/utils/estimator_checks.py

wip tests for sparse estimators

3c4cd3e

remove wrongly merged code

9f70939

Fix: make it work

80cc509

William de Vazelhes and others added 7 commits February 25, 2019 16:18

Add fit_intercept=False for linear models

6acd14d

add brute exception for kneighbors classifier

d7b56a1

more workaround

e5cc48c

fix test by fixing the seed :)

fd45ad2

STY: fix pep8 errors

a8e6606

Merge branch 'tests_sparse' of https://github.com/wdevazelhes/scikit-…

f4e885b

…learn into tests_sparse

FIX deal with the case that was failing (test_check_estimator_pairwise)

66000f5

wdevazelhes commented Feb 26, 2019

View reviewed changes

FIX fix random seed to make test_check_estimator pass (lines at the e…

f2d8b16

…nd of the test)

wdevazelhes commented Feb 26, 2019

View reviewed changes

William de Vazelhes added 7 commits February 26, 2019 17:47

Fix the test that was checking a ValueError

22af420

FIX: do another way around to pass tests for kernel algorithms (becau…

1e5d8fc

…se same n_samples as n_features created other bugs, and fix pep8 errors

FIX: fix mistake on n_samples, n_features

0f47078

remove spurious comment

a1fc52a

Also convert X_csr

f849d24

Convert X_sp instead

27f2177

Fix typo

926d27f

Merge branch 'master' into tests_sparse

7eb9710

Base automatically changed from master to main January 22, 2021 10:50

wdevazelhes added 2 commits February 26, 2021 10:49

Fix AdditiveChi2Sampler

6bb5e36

fix IncrementalPCA: accept to transform some sparse inputs with index…

1e67503

… not subscriptable

This was referenced Feb 26, 2021

Cython code for PolynomialFeatures should use int64s for indices. #17554

Closed

[MRG] FIX index overflow error in sparse matrix polynomial expansion #16831

Closed

wdevazelhes added 2 commits March 2, 2021 19:56

Fix problems with linear_models

9643a38

Simplify data to remove numerical uncertainties

71ad125

wdevazelhes commented Mar 3, 2021

View reviewed changes

cmarmo added the Waiting for Reviewer label Mar 3, 2021

Merge branch 'master' into tests_sparse

ac27d73

wdevazelhes added 6 commits March 5, 2021 15:15

Refactor dataset use

084b604

use the more extensive _generate_sparse_matrix

25316ad

fix ortho

891f5a9

add changelog

31ecac2

use float because np.float is deprecated

a5bfc98

merge with master

6a67de0

cmarmo removed the Waiting for Reviewer label Jan 24, 2022

jjerphan added this to the 1.1 milestone Mar 4, 2022

adrinjalali removed this from the 1.1 milestone Apr 7, 2022

TomDLT mentioned this pull request Dec 16, 2022

Convergence issues of SAGA solver in LogisticRegression with sparse data - SPARSE_INTERCEPT_DECAY #25198

Open

adrinjalali closed this Apr 17, 2024

Uh oh!

Conversation

wdevazelhes commented Feb 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wdevazelhes commented Feb 25, 2019

Uh oh!

wdevazelhes commented Feb 25, 2019

Uh oh!

agramfort commented Feb 25, 2019 via email

Uh oh!

wdevazelhes commented Feb 25, 2019

Uh oh!

wdevazelhes Feb 26, 2019

Choose a reason for hiding this comment

Uh oh!

wdevazelhes Feb 28, 2019

Choose a reason for hiding this comment

Uh oh!

wdevazelhes Feb 26, 2019

Choose a reason for hiding this comment

Uh oh!

wdevazelhes Feb 26, 2019

Choose a reason for hiding this comment

Uh oh!

wdevazelhes commented Feb 27, 2019

Uh oh!

wdevazelhes commented Feb 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wdevazelhes Mar 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cmarmo commented Mar 3, 2021

Uh oh!

wdevazelhes commented Mar 3, 2021

Uh oh!

adrinjalali commented Apr 7, 2022

Uh oh!

haiatn commented Jul 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

agramfort commented Aug 12, 2023

Uh oh!

haiatn commented Aug 12, 2023

Uh oh!

agramfort commented Aug 12, 2023 via email

Uh oh!

adrinjalali commented Apr 17, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

wdevazelhes commented Feb 25, 2019 •

edited

Loading

wdevazelhes commented Feb 26, 2021 •

edited

Loading

wdevazelhes Mar 3, 2021 •

edited

Loading

haiatn commented Jul 29, 2023 •

edited

Loading