TST Extend tests for scipy.sparse.*array in sklearn/tests/test_random_projection.py#27314
Conversation
|
The CI is failing with because of my attempt to fix the independent function call like this: I explain this to myself as the CI testing this file with an older scikit-learn version with a dependency from a scipy version older than 1.8 (where sparse arrays where introduced).(?) Please let me know, if I should call this function with a |
| data, data_csr = make_sparse_random_data( | ||
| sp.coo_array, n_samples, n_features, n_nonzeros | ||
| ) |
There was a problem hiding this comment.
Let's remove this data generation. It means that we need to modify the following tests:
test_try_to_transform_before_fittest_SparseRandomProj_output_representationtest_correct_RandomProjection_dimensions_embeddingtest_random_projection_feature_names_out
You can call the make_sparse_random_data in the test itself and we will parametrize outside. For the first test (i.e., test_try_to_transform_before_fit), we don't have the parametrize. I think we can also avoid the parametrization on feature_names_out.
There was a problem hiding this comment.
Yes, now I see that the generated data is used in some tests, which I had missed before.
I have implemented it how you hinted me, @glemaitre, not parametrizing the first and the last test, but I couldn't figure out why testing the sparse martrix on them is unnecessary.
glemaitre
left a comment
There was a problem hiding this comment.
Otherwise, the rest is fine.
|
@KartikeyBartwal This is already a pull-request and not an issue. Please refer to the original issue #27090 for further information and check the remaining work to be done. |
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Our CI always tests the development code (that is the The oldest supported versions of our dependencies are defined in |
| # Make some random data with uniformly located non zero entries with | ||
| # Gaussian distributed values | ||
| def make_sparse_random_data(n_samples, n_features, n_nonzeros, random_state=0): | ||
| @pytest.mark.parametrize("coo_container", COO_CONTAINERS) |
There was a problem hiding this comment.
This is not a test function: it does not start with test_, to pytest.mark.* annotations do nothing on this kind of helper functions.
You have to instead parametrize the test functions that call this make_sparse_random_data and pass the coo_container as argument to this function.
There was a problem hiding this comment.
I see. Thanks for hinting me.
I believe, that I did a mistake in test_SparseRandomProj_output_representation before that I have now fixed. This test makes data using the datatypes from the coo_container and then converts the data into the datatypes from the csr_container to also do some checks.
| # Make some random data with uniformly located non zero entries with | ||
| # Gaussian distributed values | ||
| def make_sparse_random_data(n_samples, n_features, n_nonzeros, random_state=0): | ||
| @pytest.mark.parametrize("coo_container", COO_CONTAINERS) |
There was a problem hiding this comment.
| @pytest.mark.parametrize("coo_container", COO_CONTAINERS) |
|
|
||
| @pytest.mark.parametrize("random_projection_cls", all_RandomProjection) | ||
| def test_random_projection_feature_names_out(random_projection_cls): | ||
| data, _ = make_sparse_random_data(sp.coo_array, n_samples, n_features, n_nonzeros) |
There was a problem hiding this comment.
In particular this call needs to be updated.
There was a problem hiding this comment.
I did parametrize this test now . In this test, the data made with both datatypes from the COO container is created, fitted and then _feature_names_out are compared to the expected feature names.
Though, @glemaitre was suggesting maybe not to parametrize here, did I get this right? If so, should I put make_sparse_random_data(sp.coo_matrix, n_samples, n_features, n_nonzeros) instead?
| assert johnson_lindenstrauss_min_dim(100, eps=1e-5) == 368416070986 | ||
|
|
||
|
|
||
| @pytest.mark.parametrize("coo_container", COO_CONTAINERS) |
There was a problem hiding this comment.
Indeed, I was thinking to not parametrize since the test is not really intended for that. But since not all the CIs support arrays, we cannot use them by default so this is best to parametrize as you did. Sorry for the confusion.
There was a problem hiding this comment.
No worries, I'm grateful for your support.
|
|
||
|
|
||
| def test_try_to_transform_before_fit(): | ||
| data, _ = make_sparse_random_data(sp.coo_matrix, n_samples, n_features, n_nonzeros) |
There was a problem hiding this comment.
Here, this is is bit the same than for get_feature_names_out, we should probably parametrize even if not really needed.
There was a problem hiding this comment.
Because there will be a time when sp.coo_matrix won't be supported by all the dependencies in the CI tests, even though currently sparse matrix is still supported by all of them. Got it.
| data, _ = make_sparse_random_data(coo_container, 5, n_features, int(n_features / 4)) | ||
|
|
||
| for RandomProjection in all_RandomProjection: | ||
| rp_dense = RandomProjection(n_components=3, random_state=1).fit(data) |
There was a problem hiding this comment.
This is weird here since data is not dense. I would expect instead:
rp_dense = RandomProjection(n_components=3, random_state=1).fit(data.toarray())
rp_spase = RandomProjection(n_components=3, random_state=1).fit(data)This would be inline with what the name of the variables meant. If this is the case, then we don't need to have both CSR and COO but only one of them.
There was a problem hiding this comment.
data here is dense.
This is because make_sparse_random_data returns the data in two types (dense and sparse in csr format). And from all the 9 tests where it is used, only test_inverse_transform uses the sparse data option, too.
This part in make_sparse_random_data:
data_coo = sp.coo_martix(a,(b,c), shape=(n_samples, n_features))
return data_coo.toarray(), data_coo.tocsr()I wonder why it turns a coo_matrix into a scr_matrix. Is there a reason for this? If not, can I simplify the function?
I'd also like to rename data, _ = make_sparse_random_data(...)
into something like what was used in test_inverse_transform:
X_dense, X_csr = make_sparse_random_data(...)
It's clearer to read.
What do you think, @glemaitre ?
There was a problem hiding this comment.
Indeed, the name is a bit ambiguous for this function.
I wonder why it turns a coo_matrix into a scr_matrix. Is there a reason for this? If not, can I simplify the function?
Most of the algorithm in scikit-learn will use CSR matrices and will make COO -> CSR. So returning a CSR matrix will avoid this conversion.
X_dense, X_csr = make_sparse_random_data(...)
Indeed, this is already better because it informs the reader that the function is returning both dense and sparse format.
Another option (but I don't know how much refactoring it requires) is to always return only a single type. Instead of coo_container, we can have a sparse_format: if None, we return a dense matrix, otherwise we pass the type of container that we desired. A bit similar to the changes in #27438.
However, if it changes many lines, then your original proposal is already a good enough proposal @StefanieSenger
There was a problem hiding this comment.
Another option (but I don't know how much refactoring it requires) is to always return only a single type. Instead of coo_container, we can have a sparse_format: if None, we return a dense matrix, otherwise we pass the type of container that we desired. A bit similar to the changes in #27438.
@glemaitre
I will push a proposal in a minute, but it doesn't get rid of coo_container as a param. I have tried to substitute coo_container with sparse_format, but now I believe that we have to keep coo_container, because we need to parametrize those datatypes in all of the tests that use make_sparse_random_data. Please let me know if this makes sense.
At least, the function now is much more explicit about what type it returns.
Ah, and there's a function def make_sparse_random_data that is exactly equal in benchmarks/bench_random_projections.py. Should this one be changed the same way?
There was a problem hiding this comment.
Ah, and there's a function def make_sparse_random_data that is exactly equal in benchmarks/bench_random_projections.py. Should this one be changed the same way?
You can let the benchmark on its own.
glemaitre
left a comment
There was a problem hiding this comment.
Thanks. I have only some last changes following your improvments. After that it looks fine to me.
| n_samples, | ||
| n_features, | ||
| n_nonzeros, | ||
| random_state=0, |
There was a problem hiding this comment.
This is also unsualy in our test:
| random_state=0, | |
| random_state=None, |
Could you then pass random_state=0 in the call from each test functions.
There was a problem hiding this comment.
I‘ve tried to use global_random_seed for this, but this lead to failing tests for test_random_projection_embedding_quality. Does this mean this test is sensitive to the random seed it had used before?
| assert isinstance(rp.transform(data), np.ndarray) | ||
|
|
||
| sparse_data = sp.csr_matrix(data) | ||
| sparse_data = csr_container(data) |
There was a problem hiding this comment.
Instead of calling the csr_container here, could you create sparse_data at the beginning by calling twice make_sparse_random_data twice but once with sparse_format=None and another with sparse_format="csr". We don't need to parametrize for both COO and CSR container.
Using the same random_state=0 in both call should lead to the same data.
There was a problem hiding this comment.
Thanks for the suggestion, I did it. I have used global_random_seed here, because I believe that assert isinstance(...) orsp.issparsewould pass even if the data was different?
Should I change that to use random_state=0 ?
There was a problem hiding this comment.
Yes let's keep the random_state to 0. We can revisit this test in a latter PR to see how much flaky it is by changing the random_state.
There was a problem hiding this comment.
Thank you, @glemaitre, I have changed these two data generations to use random_state=0.
| rp_dense = RandomProjection(n_components=3, random_state=1).fit(data) | ||
| rp_sparse = RandomProjection(n_components=3, random_state=1).fit( | ||
| sp.csr_matrix(data) | ||
| csr_container(data) |
There was a problem hiding this comment.
Here, I would as well call twice the make_sparse_random_data and create the same data and data_sparse from the beginning. No need for the product of all possibilities arrays/matrices.
There was a problem hiding this comment.
Good you found that, thanks for hinting me. Less testing cases again. :)
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
glemaitre
left a comment
There was a problem hiding this comment.
LGTM. Thanks @StefanieSenger
jjerphan
left a comment
There was a problem hiding this comment.
LGTM!
Thank you, @StefanieSenger!
Just minor nitpicks for getting to a self-documenting code (this modifies the state of the file before your contribution).
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
|
Thank you, @jjerphan, it looks much cleaner now. |
|
Hi Stefanie, One last thing: could you add an entry in this section for 1.4's changelog? :) scikit-learn/doc/whats_new/v1.4.rst Lines 132 to 182 in 8db3aac |
|
Hey @jjerphan, just did it. Thanks for your support. :) |
…dom_projection.py` (scikit-learn#27314) Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com> Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Reference Issues/PRs
Towards #27090.
What does this implement/fix? Explain your changes.
This PR substitutes scipy sparse matrices with the scipy containers introduced in #27095 in the
sklearn/tests/test_random_projection.pytest file.Comment
It was a bit tricky, because one of the parametrized functions was re-used outside of test functions (see below). I am not sure why, since it doesn't seem to be re-used afterwards. I made up a solution.
n_samples, n_features = (10, 1000)
n_nonzeros = int(n_samples * n_features / 100.0)
data, data_csr = make_sparse_random_data(n_samples, n_features, n_nonzeros)