TST Extend tests for `scipy.sparse.*array` in `sklearn/tests/test_random_projection.py` by StefanieSenger · Pull Request #27314 · scikit-learn/scikit-learn

StefanieSenger · 2023-09-07T21:36:45Z

Reference Issues/PRs

Towards #27090.

What does this implement/fix? Explain your changes.

This PR substitutes scipy sparse matrices with the scipy containers introduced in #27095 in the sklearn/tests/test_random_projection.py test file.

Comment

It was a bit tricky, because one of the parametrized functions was re-used outside of test functions (see below). I am not sure why, since it doesn't seem to be re-used afterwards. I made up a solution.

n_samples, n_features = (10, 1000)
n_nonzeros = int(n_samples * n_features / 100.0)
data, data_csr = make_sparse_random_data(n_samples, n_features, n_nonzeros)

github-actions · 2023-09-07T21:38:25Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 6a0971d. Link to the linter CI: here}

StefanieSenger · 2023-09-08T12:26:54Z

The CI is failing with

/io/sklearn/tests/test_random_projection.py:66: in <module>
    sp.coo_array, n_samples, n_features, n_nonzeros
E   AttributeError: module 'scipy.sparse' has no attribute 'coo_array'

because of my attempt to fix the independent function call like this:

n_samples, n_features = (10, 1000)
n_nonzeros = int(n_samples * n_features / 100.0)
data, data_csr = make_sparse_random_data(
    sp.coo_array, n_samples, n_features, n_nonzeros
)

I explain this to myself as the CI testing this file with an older scikit-learn version with a dependency from a scipy version older than 1.8 (where sparse arrays where introduced).(?)

Please let me know, if I should call this function with a coo_container=sp.coo_matrix, or if those four lines can be removed, since they're outside of any test function. Or any other way to fix this.

glemaitre · 2023-09-11T17:17:03Z

sklearn/tests/test_random_projection.py

+data, data_csr = make_sparse_random_data(
+    sp.coo_array, n_samples, n_features, n_nonzeros
+)


Let's remove this data generation. It means that we need to modify the following tests:

test_try_to_transform_before_fit

test_SparseRandomProj_output_representation

test_correct_RandomProjection_dimensions_embedding

test_random_projection_feature_names_out

You can call the make_sparse_random_data in the test itself and we will parametrize outside. For the first test (i.e., test_try_to_transform_before_fit), we don't have the parametrize. I think we can also avoid the parametrization on feature_names_out.

Yes, now I see that the generated data is used in some tests, which I had missed before.

I have implemented it how you hinted me, @glemaitre, not parametrizing the first and the last test, but I couldn't figure out why testing the sparse martrix on them is unnecessary.

glemaitre

Otherwise, the rest is fine.

glemaitre · 2023-09-13T09:42:39Z

@KartikeyBartwal This is already a pull-request and not an issue. Please refer to the original issue #27090 for further information and check the remaining work to be done.

sklearn/tests/test_random_projection.py

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

ogrisel · 2023-09-18T12:22:37Z

I explain this to myself as the CI testing this file with an older scikit-learn version with a dependency from a scipy version older than 1.8 (where sparse arrays where introduced).(?)

Our CI always tests the development code (that is the main branch of a PR targeting main) of scikit-learn against different versions of Python, NumPy, SciPy and in the presence or absence of optional dependencies such as pandas and matplotlib.

The oldest supported versions of our dependencies are defined in sklearn/_min_dependencies.py and indeed, at the time or writing we want scikit-learn to work against SciPy 1.5.0, that is before the introduction scipy.sparse.coo_array (in 1.8). Therefore we should write code that does not assume the existence of this constructor. This is why the COO_CONTAINERS is defined with a conditional statement on the SciPy version in sklearn/utils/fixes.py at the time this module is imported (that is at runtime, on the user's machine). On older SciPy versions it only contains coo_matrix while on versions after 1.8 it contains both coo_matrix and coo_array.

ogrisel · 2023-09-18T12:30:48Z

sklearn/tests/test_random_projection.py

-# Make some random data with uniformly located non zero entries with
-# Gaussian distributed values
-def make_sparse_random_data(n_samples, n_features, n_nonzeros, random_state=0):
+@pytest.mark.parametrize("coo_container", COO_CONTAINERS)


This is not a test function: it does not start with test_, to pytest.mark.* annotations do nothing on this kind of helper functions.

You have to instead parametrize the test functions that call this make_sparse_random_data and pass the coo_container as argument to this function.

I see. Thanks for hinting me.

I believe, that I did a mistake in test_SparseRandomProj_output_representation before that I have now fixed. This test makes data using the datatypes from the coo_container and then converts the data into the datatypes from the csr_container to also do some checks.

ogrisel · 2023-09-18T12:31:07Z

sklearn/tests/test_random_projection.py

-# Make some random data with uniformly located non zero entries with
-# Gaussian distributed values
-def make_sparse_random_data(n_samples, n_features, n_nonzeros, random_state=0):
+@pytest.mark.parametrize("coo_container", COO_CONTAINERS)


Suggested change

@pytest.mark.parametrize("coo_container", COO_CONTAINERS)

ogrisel · 2023-09-18T12:33:18Z

sklearn/tests/test_random_projection.py


 @pytest.mark.parametrize("random_projection_cls", all_RandomProjection)
 def test_random_projection_feature_names_out(random_projection_cls):
+    data, _ = make_sparse_random_data(sp.coo_array, n_samples, n_features, n_nonzeros)


In particular this call needs to be updated.

I did parametrize this test now . In this test, the data made with both datatypes from the COO container is created, fitted and then _feature_names_out are compared to the expected feature names.

Though, @glemaitre was suggesting maybe not to parametrize here, did I get this right? If so, should I put make_sparse_random_data(sp.coo_matrix, n_samples, n_features, n_nonzeros) instead?

glemaitre · 2023-09-25T09:03:09Z

sklearn/tests/test_random_projection.py

    assert johnson_lindenstrauss_min_dim(100, eps=1e-5) == 368416070986


+@pytest.mark.parametrize("coo_container", COO_CONTAINERS)


Indeed, I was thinking to not parametrize since the test is not really intended for that. But since not all the CIs support arrays, we cannot use them by default so this is best to parametrize as you did. Sorry for the confusion.

No worries, I'm grateful for your support.

glemaitre · 2023-09-25T09:04:56Z

sklearn/tests/test_random_projection.py



 def test_try_to_transform_before_fit():
+    data, _ = make_sparse_random_data(sp.coo_matrix, n_samples, n_features, n_nonzeros)


Here, this is is bit the same than for get_feature_names_out, we should probably parametrize even if not really needed.

Because there will be a time when sp.coo_matrix won't be supported by all the dependencies in the CI tests, even though currently sparse matrix is still supported by all of them. Got it.

glemaitre · 2023-09-25T09:08:59Z

sklearn/tests/test_random_projection.py

+    data, _ = make_sparse_random_data(coo_container, 5, n_features, int(n_features / 4))

    for RandomProjection in all_RandomProjection:
        rp_dense = RandomProjection(n_components=3, random_state=1).fit(data)


This is weird here since data is not dense. I would expect instead:

rp_dense = RandomProjection(n_components=3, random_state=1).fit(data.toarray()) rp_spase = RandomProjection(n_components=3, random_state=1).fit(data)

This would be inline with what the name of the variables meant. If this is the case, then we don't need to have both CSR and COO but only one of them.

data here is dense.

This is because make_sparse_random_data returns the data in two types (dense and sparse in csr format). And from all the 9 tests where it is used, only test_inverse_transform uses the sparse data option, too.

This part in make_sparse_random_data:

data_coo = sp.coo_martix(a,(b,c), shape=(n_samples, n_features)) return data_coo.toarray(), data_coo.tocsr()

I wonder why it turns a coo_matrix into a scr_matrix. Is there a reason for this? If not, can I simplify the function?

I'd also like to rename data, _ = make_sparse_random_data(...)
into something like what was used in test_inverse_transform:
X_dense, X_csr = make_sparse_random_data(...)
It's clearer to read.

What do you think, @glemaitre ?

Indeed, the name is a bit ambiguous for this function.

I wonder why it turns a coo_matrix into a scr_matrix. Is there a reason for this? If not, can I simplify the function?

Most of the algorithm in scikit-learn will use CSR matrices and will make COO -> CSR. So returning a CSR matrix will avoid this conversion.

X_dense, X_csr = make_sparse_random_data(...)

Indeed, this is already better because it informs the reader that the function is returning both dense and sparse format.

Another option (but I don't know how much refactoring it requires) is to always return only a single type. Instead of coo_container, we can have a sparse_format: if None, we return a dense matrix, otherwise we pass the type of container that we desired. A bit similar to the changes in #27438.

However, if it changes many lines, then your original proposal is already a good enough proposal @StefanieSenger

Another option (but I don't know how much refactoring it requires) is to always return only a single type. Instead of coo_container, we can have a sparse_format: if None, we return a dense matrix, otherwise we pass the type of container that we desired. A bit similar to the changes in #27438.

@glemaitre
I will push a proposal in a minute, but it doesn't get rid of coo_container as a param. I have tried to substitute coo_container with sparse_format, but now I believe that we have to keep coo_container, because we need to parametrize those datatypes in all of the tests that use make_sparse_random_data. Please let me know if this makes sense.

At least, the function now is much more explicit about what type it returns.

Ah, and there's a function def make_sparse_random_data that is exactly equal in benchmarks/bench_random_projections.py. Should this one be changed the same way?

Ah, and there's a function def make_sparse_random_data that is exactly equal in benchmarks/bench_random_projections.py. Should this one be changed the same way?

You can let the benchmark on its own.

glemaitre

Thanks. I have only some last changes following your improvments. After that it looks fine to me.

sklearn/tests/test_random_projection.py

glemaitre · 2023-09-28T09:07:21Z

sklearn/tests/test_random_projection.py

+    n_samples,
+    n_features,
+    n_nonzeros,
+    random_state=0,


This is also unsualy in our test:

Suggested change

random_state=0,

random_state=None,

Could you then pass random_state=0 in the call from each test functions.

I‘ve tried to use global_random_seed for this, but this lead to failing tests for test_random_projection_embedding_quality. Does this mean this test is sensitive to the random seed it had used before?

glemaitre · 2023-09-28T09:10:09Z

sklearn/tests/test_random_projection.py

        assert isinstance(rp.transform(data), np.ndarray)

-        sparse_data = sp.csr_matrix(data)
+        sparse_data = csr_container(data)


Instead of calling the csr_container here, could you create sparse_data at the beginning by calling twice make_sparse_random_data twice but once with sparse_format=None and another with sparse_format="csr". We don't need to parametrize for both COO and CSR container.

Using the same random_state=0 in both call should lead to the same data.

Thanks for the suggestion, I did it. I have used global_random_seed here, because I believe that assert isinstance(...) orsp.issparsewould pass even if the data was different?
Should I change that to use random_state=0 ?

Yes let's keep the random_state to 0. We can revisit this test in a latter PR to see how much flaky it is by changing the random_state.

Thank you, @glemaitre, I have changed these two data generations to use random_state=0.

glemaitre · 2023-09-28T09:11:09Z

sklearn/tests/test_random_projection.py

        rp_dense = RandomProjection(n_components=3, random_state=1).fit(data)
        rp_sparse = RandomProjection(n_components=3, random_state=1).fit(
-            sp.csr_matrix(data)
+            csr_container(data)


Here, I would as well call twice the make_sparse_random_data and create the same data and data_sparse from the beginning. No need for the product of all possibilities arrays/matrices.

Good you found that, thanks for hinting me. Less testing cases again. :)

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

glemaitre

LGTM. Thanks @StefanieSenger

jjerphan

LGTM!

Thank you, @StefanieSenger!

Just minor nitpicks for getting to a self-documenting code (this modifies the state of the file before your contribution).

sklearn/tests/test_random_projection.py

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

StefanieSenger · 2023-11-11T21:30:38Z

Thank you, @jjerphan, it looks much cleaner now.

jjerphan · 2023-11-12T09:30:51Z

Hi Stefanie,

One last thing: could you add an entry in this section for 1.4's changelog? :)

scikit-learn/doc/whats_new/v1.4.rst

Lines 132 to 182 in 8db3aac

    
           Support for SciPy sparse arrays 
        
           ------------------------------- 
        
           Several estimators are now supporting SciPy sparse arrays. The following functions 
        
           and classes are impacted: 
        
           **Functions:** 
        
           - :func:`cluster.compute_optics_graph` in :pr:`27104` by 
        
             :user:`Maren Westermann <marenwestermann>` and in :pr:`27250` by :user:`Yao Xiao <Charlie-XIAO>`; 
        
           - :func:`cluster.kmeans_plusplus` in :pr:`27179` by :user:`Nurseit Kamchyev <Bncer>`; 
        
           - :func:`decomposition.non_negative_factorization` in :pr:`27100` by 
        
             :user:`Isaac Virshup <ivirshup>`; 
        
           - :func:`feature_selection.f_regression` in :pr:`27239` by 
        
             :user:`Yaroslav Korobko <Tialo>`; 
        
           - :func:`feature_selection.r_regression` in :pr:`27239` by 
        
             :user:`Yaroslav Korobko <Tialo>`; 
        
           - :func:`manifold.trustworthiness` in :pr:`27250` by :user:`Yao Xiao <Charlie-XIAO>`; 
        
           - :func:`manifold.spectral_embedding` in :pr:`27240` by :user:`Yao Xiao <Charlie-XIAO>`; 
        
           - :func:`metrics.pairwise_distances` in :pr:`27250` by :user:`Yao Xiao <Charlie-XIAO>`; 
        
           - :func:`metrics.pairwise_distances_chunked` in :pr:`27250` by 
        
             :user:`Yao Xiao <Charlie-XIAO>`; 
        
           - :func:`metrics.pairwise.pairwise_kernels` in :pr:`27250` by 
        
             :user:`Yao Xiao <Charlie-XIAO>`; 
        
           - :func:`sklearn.utils.multiclass.type_of_target` in :pr:`27274` by 
        
             :user:`Yao Xiao <Charlie-XIAO>`. 
        
           **Classes:** 
        
           - :class:`cluster.HDBSCAN` in :pr:`27250` by :user:`Yao Xiao <Charlie-XIAO>`; 
        
           - :class:`cluster.KMeans` in :pr:`27179` by :user:`Nurseit Kamchyev <Bncer>`; 
        
           - :class:`cluster.MiniBatchKMeans` in :pr:`27179` by :user:`Nurseit Kamchyev <Bncer>`; 
        
           - :class:`cluster.OPTICS` in :pr:`27104` by 
        
             :user:`Maren Westermann <marenwestermann>` and in :pr:`27250` by :user:`Yao Xiao <Charlie-XIAO>`; 
        
           - :class:`decomposition.NMF` in :pr:`27100` by :user:`Isaac Virshup <ivirshup>`; 
        
           - :class:`decomposition.MiniBatchNMF` in :pr:`27100` by 
        
             :user:`Isaac Virshup <ivirshup>`; 
        
           - :class:`feature_extraction.text.TfidfTransformer` in :pr:`27219` by 
        
             :user:`Yao Xiao <Charlie-XIAO>`; 
        
           - :class:`cluster.Isomap` in :pr:`27250` by :user:`Yao Xiao <Charlie-XIAO>`; 
        
           - :func:`manifold.SpectralEmbedding` in :pr:`27240` by :user:`Yao Xiao <Charlie-XIAO>`; 
        
           - :class:`manifold.TSNE` in :pr:`27250` by :user:`Yao Xiao <Charlie-XIAO>`; 
        
           - :class:`impute.SimpleImputer` in :pr:`27277` by :user:`Yao Xiao <Charlie-XIAO>`; 
        
           - :class:`impute.IterativeImputer` in :pr:`27277` by :user:`Yao Xiao <Charlie-XIAO>`; 
        
           - :class:`impute.KNNImputer` in :pr:`27277` by :user:`Yao Xiao <Charlie-XIAO>`; 
        
           - :class:`kernel_approximation.PolynomialCountSketch` in  :pr:`27301` by 
        
             :user:`Lohit SundaramahaLingam <lohitslohit>`; 
        
           - :class:`neural_network.BernoulliRBM` in :pr:`27252` by 
        
             :user:`Yao Xiao <Charlie-XIAO>`; 
        
           - :class:`preprocessing.PolynomialFeatures` in :pr:`27166` by 
        
             :user:`Mohit Joshi <work-mohit>`.

StefanieSenger · 2023-11-12T11:47:19Z

Hey @jjerphan, just did it. Thanks for your support. :)

…dom_projection.py` (scikit-learn#27314) Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com> Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

scipy sparse containers added to test

93aba52

ups, corrected mistakes from yesterday

5c26589

jjerphan mentioned this pull request Sep 9, 2023

TST Extend tests for scipy.sparse.*array #27090

Closed

glemaitre self-requested a review September 11, 2023 17:06

glemaitre added the No Changelog Needed label Sep 11, 2023

glemaitre reviewed Sep 11, 2023

View reviewed changes

put data generation into tests

6bdb314

glemaitre self-requested a review September 14, 2023 16:37

glemaitre reviewed Sep 14, 2023

View reviewed changes

sklearn/tests/test_random_projection.py Outdated Show resolved Hide resolved

StefanieSenger and others added 2 commits September 18, 2023 11:46

Update sklearn/tests/test_random_projection.py

836d515

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

specified warning

d79734d

ogrisel reviewed Sep 18, 2023

View reviewed changes

corrected helper function

201262a

glemaitre reviewed Sep 25, 2023

View reviewed changes

StefanieSenger and others added 5 commits September 26, 2023 13:13

parametrized test_try_to_transform_before_fit, too

c6fe231

Merge branch 'main' into sparse_array_test_random_projection

b68be45

make_sparse_random_data has new parameter 'sparse_format'

cbb6456

satisfying naming convention

5122d5f

easier to read

c2183c8

glemaitre self-requested a review September 28, 2023 08:55

glemaitre reviewed Sep 28, 2023

View reviewed changes

StefanieSenger and others added 2 commits September 29, 2023 11:57

Update sklearn/tests/test_random_projection.py

b0a793d

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Update sklearn/tests/test_random_projection.py

b997fa9

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

StefanieSenger and others added 12 commits September 29, 2023 12:00

Update sklearn/tests/test_random_projection.py

efe2940

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Update sklearn/tests/test_random_projection.py

4e9e112

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Update sklearn/tests/test_random_projection.py

c20c01e

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Update sklearn/tests/test_random_projection.py

9c45e15

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Update sklearn/tests/test_random_projection.py

2121639

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Update sklearn/tests/test_random_projection.py

05f15f2

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Update sklearn/tests/test_random_projection.py

907aae8

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Update sklearn/tests/test_random_projection.py

55bfff9

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Update sklearn/tests/test_random_projection.py

a3db6df

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

changes after review

7f56241

docstring updated

5461837

random_state for test_SparseRandomProj_output_representation

7df075c

glemaitre approved these changes Oct 2, 2023

View reviewed changes

glemaitre added the Waiting for Second Reviewer First reviewer is done, need a second one! label Oct 31, 2023

jjerphan approved these changes Nov 1, 2023

View reviewed changes

StefanieSenger and others added 7 commits November 11, 2023 22:10

Merge branch 'main' into sparse_array_test_random_projection

0b30f35

Update sklearn/tests/test_random_projection.py

fbdf8cd

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

Update sklearn/tests/test_random_projection.py

e96503d

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

Update sklearn/tests/test_random_projection.py

c5c997d

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

Update sklearn/tests/test_random_projection.py

4f6d2f0

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

Update sklearn/tests/test_random_projection.py

5972e4a

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

Update sklearn/tests/test_random_projection.py

9ee078d

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

changelog

6a0971d

jjerphan enabled auto-merge (squash) November 12, 2023 11:51

jjerphan merged commit eb08740 into scikit-learn:main Nov 12, 2023

StefanieSenger deleted the sparse_array_test_random_projection branch April 18, 2024 11:00

		assert johnson_lindenstrauss_min_dim(100, eps=1e-5) == 368416070986


		@pytest.mark.parametrize("coo_container", COO_CONTAINERS)



		def test_try_to_transform_before_fit():
		data, _ = make_sparse_random_data(sp.coo_matrix, n_samples, n_features, n_nonzeros)

Uh oh!

Conversation

StefanieSenger commented Sep 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Comment

Uh oh!

github-actions bot commented Sep 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

StefanieSenger commented Sep 8, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

glemaitre commented Sep 13, 2023

Uh oh!

Uh oh!

ogrisel commented Sep 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

StefanieSenger Sep 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

StefanieSenger Sep 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

StefanieSenger commented Sep 7, 2023 •

edited

Loading

github-actions bot commented Sep 7, 2023 •

edited

Loading

ogrisel commented Sep 18, 2023 •

edited

Loading

StefanieSenger Sep 27, 2023 •

edited

Loading

StefanieSenger Sep 30, 2023 •

edited

Loading