TST Extend tests for scipy.sparse.*array in sklearn/utils/tests/test_param_validation.py#27950
TST Extend tests for scipy.sparse.*array in sklearn/utils/tests/test_param_validation.py#27950Charlie-XIAO wants to merge 4 commits intoscikit-learn:mainfrom
scipy.sparse.*array in sklearn/utils/tests/test_param_validation.py#27950Conversation
scipy.sparse.*array in sklearn/utils/tests/test_param_validation.py
| @validate_params( | ||
| { | ||
| "X": [np.ndarray, "sparse matrix"], | ||
| "X": [np.ndarray, "sparse container"], |
There was a problem hiding this comment.
I think I would keep such strings unchanged and consider their change in another PR since they touch the public API. I would wait for Jeremy, because he is the one who has the authority on it.
This applies here, and in other places as well.
There was a problem hiding this comment.
@jjerphan In fact tests would fail without these changes, unless I do not change the validation for "sparse matrix". This involves some confusing issues and let me make things a bit clearer. I think there are two possibilities for this PR:
-
Let "sparse matrix" and "sparse container" have the same validation, i.e.,
sparse.issparse()instead of changing the validation of "sparse matrix" tosparse.isspmatrix(). In this case I can leave the parameter validations unchanged. -
Change the validation of "sparse matrix" to
sparse.isspmatrix()and add the option "sparse container". This is, however, much more complicated, because we would have to touch the public API to make tests pass. As a matter of fact, I would expect much more tests to fail with this change (meaning that more parameter validations need to be changed to make things pass). However, only a few tests failed for the following reason: "array-like" is covering sparse matrices and sparse arrays, so that if a function/class accepts both "array-like" and "sparse matrix" then there will be no error. In other words, with the current validation system, ["array-like", "sparse matrix"] is the same as ["array-like"], which I do not think is reasonable. To take this option, we may need to: (1) change "array-like" to not accept sparse stuff; (2) with the updated "sparse matrix" validation, find out all tests that are failing (I would expect a lot), and from those failing tests I think we can know which functions/classes have been tests on sparse arrays, and we can change the constraints from "sparse matrix" to "sparse container".
I'm actually not sure which option to take. Option 1 is somehow doing nothing, in which case I think we can just discard this PR. Option 2 is more reasonable, i.e., this PR is not only to extend tests but also to change the public API and fix parameter validations, but perhaps it is too much. Well, perhaps we can first make a PR to fix the problem of "array-like" covering sparse matrices and sparse arrays (if this is indeed not the desired behavior), then do the rest of Option 2 in this PR with the title changed. I do want to know what maintainers think, also @jeremiedbb.
There was a problem hiding this comment.
I think sparse containers are considered to be array-like. So option 2 works for me, but I would wait for the opinion of another maintainer.
|
I am trying to draw attention from maintainers and move the discussion to a more visible place, so I opened #28099. |
|
As explained in #28101 (comment), I think we should not introduce a separate I extracted the addition of test for sparse containers and added it to #28101 so I think we can close this PR. cc/ @glemaitre |
|
Just to add some thoughts: I think it's reasonable to close, in the sense that: though having two constraints can help us find which functions/classes have been tested against sparse arrays and which are not, it may convey a wrong message to users that "sparse array" is intentionally not supported (while the fact is that it is simply not tested). |
That's interesting. I think one day we'll probably want to deprecate sparse matrices and keep only sparse arrays. I expect that it will reveal everything that was not properly tested against sparse arrays. And the things we'll miss will probably be the functions that we already don't test against sparse matrices although we should, but having 2 constraints can't help us there. |
|
I agree that we can close this PR and revisit the concept of sparse array when they actually behave as numpy array. |
Reference Issues/PRs
Towards #27090.
What does this implement/fix? Explain your changes.
Extend validation for sparse containers. This would require changing "sparse matrix" to "sparse container" when necessary. Previous discussion at #27317.
The following are the changes I have made (because tests are failing):
Details
The following are the places where "sparse matrix" option is used in parameter validation:
Details
cluster.ward_treecluster.kmeans_pluspluscluster.k_meanscluster.spectral_clusteringdatasets.dump_svmlight_filedecomposition.non_negative_factorizationfeature_selection.mutual_info_regressionfeature_selection.mutual_info_classiffeature_selection.f_classiffeature_selection.chi2feature_selection.r_regressionfeature_selection.f_regressioninspection.partial_dependencelinear_model.lasso_pathlinear_model.enet_pathlinear_model.ridge_regressionmanifold.trustworthinessmetrics.accuracy_scoremetrics.multilabel_confusion_matrixmetrics.jaccard_scoremetrics.zero_one_lossmetrics.f1_scoremetrics.fbeta_scoremetrics.precision_recall_fscore_supportmetrics.class_likelihood_ratiosmetrics.precision_scoremetrics.recall_scoremetrics.classification_reportmetrics.hamming_lossmetrics.label_ranking_average_precision_scoremetrics.label_ranking_lossmetrics.pairwise.euclidean_distancesmetrics.pairwise_distances_argmin_minmetrics.pairwise_distances_argminmetrics.pairwise.haversine_distancesmetrics.pairwise.manhattan_distancesmetrics.pairwise.cosine_distancesmetrics.pairwise.paired_euclidean_distancesmetrics.pairwise.paired_manhattan_distancesmetrics.pairwise.paired_cosine_distancesmetrics.pairwise.linear_kernelmetrics.pairwise.polynomial_kernelmetrics.pairwise.sigmoid_kernelmetrics.pairwise.rbf_kernelmetrics.pairwise.laplacian_kernelmetrics.pairwise.cosine_similaritymetrics.pairwise_distances_chunkedmetrics.pairwise_distancesmetrics.pairwise.pairwise_kernelsmetrics.mutual_info_scoremetrics.silhouette_scoremetrics.silhouette_samplesmodel_selection.cross_validatemodel_selection.cross_val_scoremodel_selection.cross_val_predictmodel_selection.permutation_test_scoremodel_selection.learning_curvemodel_selection.validation_curvepreprocessing.scalepreprocessing.maxabs_scalepreprocessing.robust_scalepreprocessing.normalizepreprocessing.binarizepreprocessing.add_dummy_featurepreprocessing.quantile_transformsvm.l1_min_cutils.safe_maskutils.class_weight.compute_sample_weightutils.graph.single_source_shortest_path_lengthOne major reason why so many things are tested yet so few errors are raised without changing to "sparse array" is that, "array-like" seems to be covering sparse matrices and sparse arrays since they have the
shapeand__len__attributes. I'm thinking... should "array-like" not include these sparse containers (for instance, by addingand not sparse.issparse(val))? @jjerphanAlso @glemaitre who posted a comment #27317 (comment) in the original PR and @jeremiedbb who is the main author of
_param_validation.py.