[MRG+1] Add check for sparse prediction in cross_val_predict (fixes #5132) by dubstack · Pull Request #5161 · scikit-learn/scikit-learn

dubstack · 2015-08-26T02:11:46Z

This pull request is for #5132.
cross_val_predict should work for sparse 'y'

jnothman · 2015-08-26T02:19:17Z

please add a test

jnothman · 2015-08-26T02:19:42Z

sklearn/cross_validation.py

It should be possible to do with only one vstack call.

Thanks for your suggestion. I have made the appropriate changes.

jnothman · 2015-08-26T02:50:41Z

sklearn/cross_validation.py

Don't you still need this in the sparse case. Please, write a test so we actually have some idea if this is operating correctly!

Sorry for my naivety, I now realize that it would be needed in the sparse case as well.

Although I encounter an efficiency issue while doing the same in the sparse case.
On running the following code snippet I get the following warning.

a = sp.csr_matrix([[0,0],[0,1],[1,1]]) id = numpy.array([2,0,1]) b = a.copy() b[id] = a

/Library/Python/2.7/site-packages/scipy/sparse/compressed.py:739: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient. SparseEfficiencyWarning)

Also on printing b I get the following

print b

(0, 0) 0
(0, 1) 1
(1, 0) 1
(1, 1) 1
(2, 0) 0
(2, 1) 0

which is obviously inefficient.
@jnothman
Can you please suggest any other efficient way of achieving the same or point out if I am going wrong somewhere.

I think for both the numpy array and sparse matrix cases, we're better off inverting locs so that we can use return preds[inv_locs]. You can build inv_locs = np.empty(len(locs), dtype=int); inv_locs[locs] = np.arange(len(locs))

@jnothman
Thanks a lot for suggestion. I use inverted locs to reorder the predictions array.

jnothman · 2015-08-26T09:01:52Z

sklearn/cross_validation.py

Thus is no longer needed

dubstack · 2015-08-26T12:00:34Z

@jnothman

I have added a test for comparing the results of dense vs sparse cross_val_predict.
Is this a good test to cover the changes?

Currently I use make_multilabel_classification from sklearn.datasets for the input data
Should I use some other dataset ?

What n_sample should we have for the test data?
Default value of make_multilabel_classification is 100.

jnothman · 2015-08-26T12:10:11Z

Apart from minor cleanups, this looks good. Thanks!

dubstack · 2015-08-26T12:58:45Z

@jnothman Fixed the indentation issue raised by you.

Thanks a lot for all your help. :)

jnothman · 2015-08-26T13:00:31Z

LGTM

dubstack · 2015-08-26T14:07:03Z

Can't seem to understand why the AppVeyor fails on the latest commit.

ogrisel · 2015-08-27T13:17:37Z

The appveyor failure is unrelated to this PR: it looks like a segfault in sparse PCA...

ogrisel · 2015-08-27T13:22:20Z

LGTM as well. Thanks for the fix.

[MRG+1] Add check for sparse prediction in cross_val_predict (fixes #5132)

ogrisel · 2015-08-27T13:23:26Z

I will add a new whats_new.rst entry + fix a pep8 issue in master directly.

ogrisel · 2015-08-27T13:35:28Z

changelog: e62e9e1
pep8 fixes: ea42c55

dubstack · 2015-08-27T19:50:06Z

Thanks @ogrisel for the fixes and the merge. :)

amueller · 2015-08-27T20:17:26Z

@ogrisel did you see the segfault https://ci.appveyor.com/project/sklearn-ci/scikit-learn/build/1.0.1763/job/jmthejc7o35mtb6g before? Looks very suspicious.

- MAINT remove redundant p variable - Add check for sparse prediction in cross_val_predict

From scikit-learn#5161 - MAINT remove redundant p variable - Add check for sparse prediction in cross_val_predict From scikit-learn#5201 - DOC improve random_state param doc From scikit-learn#5190 - LabelKFold and test From scikit-learn#4583 - LabelShuffleSplit and tests

From scikit-learn#5161 - MAINT remove redundant p variable - Add check for sparse prediction in cross_val_predict From scikit-learn#5201 - DOC improve random_state param doc From scikit-learn#5190 - LabelKFold and test From scikit-learn#4583 - LabelShuffleSplit and tests From scikit-learn#5300 - shuffle the `labels` not the `indxs` in LabelKFold + tests

From scikit-learn#5161 - MAINT remove redundant p variable - Add check for sparse prediction in cross_val_predict From scikit-learn#5201 - DOC improve random_state param doc From scikit-learn#5190 - LabelKFold and test From scikit-learn#4583 - LabelShuffleSplit and tests From scikit-learn#5300 - shuffle the `labels` not the `indxs` in LabelKFold + tests Other minor changes ------------------- Fix cross_validation reference Fix the labels param doc

Squashed commit messages - (For reference) Major ----- * ENH p --> n_labels * FIX *ShuffleSplit: all float/invalid type errors at init and int error at split * FIX make PredefinedSplit accept test_folds in constructor; Cleanup docstrings * ENH+TST KFold: make rng to be generated at every split call for reproducibility * FIX/MAINT KFold: make shuffle a public attr * FIX Make CVIterableWrapper private. * FIX reuse len_cv instead of recalculating it * FIX Prevent adding *SearchCV estimators from the old grid_search module * re-FIX In all_estimators: the sorting to use only the 1st item (name) To avoid collision between the old and the new GridSearch classes. * FIX test_validate.py: Use 2D X (1D X is being detected as a single sample) * MAINT validate.py --> validation.py * MAINT make the submodules private * MAINT Support old cv/gs/lc until 0.19 * FIX/MAINT n_splits --> get_n_splits * FIX/TST test_logistic.py/test_ovr_multinomial_iris: pass predefined folds as an iterable * MAINT expose BaseCrossValidator * Update the model_selection module with changes from master - From scikit-learn#5161 - - MAINT remove redundant p variable - - Add check for sparse prediction in cross_val_predict - From scikit-learn#5201 - DOC improve random_state param doc - From scikit-learn#5190 - LabelKFold and test - From scikit-learn#4583 - LabelShuffleSplit and tests - From scikit-learn#5300 - shuffle the `labels` not the `indxs` in LabelKFold + tests Minor ----- * ENH Make the KFold shuffling test stronger * FIX/DOC Use the higher level model_selection module as ref * DOC in check_cv "y : array-like, optional" * DOC a supervised learning problem --> supervised learning problems * DOC cross-validators --> cross-validation strategies * DOC Correct Olivier Grisel's name ;) * MINOR/FIX cv_indices --> kfold * FIX/DOC Align the 'See also' section of the new KFold, LeaveOneOut * TST/FIX imports on separate lines * FIX use __class__ instead of classmethod * TST/FIX import directly from model_selection * COSMIT Relocate the random_state documentation * COSMIT remove pass * MAINT Remove deprecation warnings from old tests * FIX correct import at test_split * FIX/MAINT Move P_sparse, X, y defns to top; rm unused W_sparse, X_sparse * FIX random state to avoid doctest failure * TST n_splits and split wrapping of _CVIterableWrapper * FIX/MAINT Use multilabel indicator matrix directly * TST/DOC clarify why we conflate classes 0 and 1 * DOC add comment that this was taken from BaseEstimator * FIX use of labels is not needed in stratified k fold * Fix cross_validation reference * Fix the labels param doc

Squashed commit messages - (For reference) Major ----- * ENH p --> n_labels * FIX *ShuffleSplit: all float/invalid type errors at init and int error at split * FIX make PredefinedSplit accept test_folds in constructor; Cleanup docstrings * ENH+TST KFold: make rng to be generated at every split call for reproducibility * FIX/MAINT KFold: make shuffle a public attr * FIX Make CVIterableWrapper private. * FIX reuse len_cv instead of recalculating it * FIX Prevent adding *SearchCV estimators from the old grid_search module * re-FIX In all_estimators: the sorting to use only the 1st item (name) To avoid collision between the old and the new GridSearch classes. * FIX test_validate.py: Use 2D X (1D X is being detected as a single sample) * MAINT validate.py --> validation.py * MAINT make the submodules private * MAINT Support old cv/gs/lc until 0.19 * FIX/MAINT n_splits --> get_n_splits * FIX/TST test_logistic.py/test_ovr_multinomial_iris: pass predefined folds as an iterable * MAINT expose BaseCrossValidator * Update the model_selection module with changes from master - From scikit-learn#5161 - - MAINT remove redundant p variable - - Add check for sparse prediction in cross_val_predict - From scikit-learn#5201 - DOC improve random_state param doc - From scikit-learn#5190 - LabelKFold and test - From scikit-learn#4583 - LabelShuffleSplit and tests - From scikit-learn#5300 - shuffle the `labels` not the `indxs` in LabelKFold + tests - From scikit-learn#5378 - Make the GridSearchCV docs more accurate. Minor ----- * ENH Make the KFold shuffling test stronger * FIX/DOC Use the higher level model_selection module as ref * DOC in check_cv "y : array-like, optional" * DOC a supervised learning problem --> supervised learning problems * DOC cross-validators --> cross-validation strategies * DOC Correct Olivier Grisel's name ;) * MINOR/FIX cv_indices --> kfold * FIX/DOC Align the 'See also' section of the new KFold, LeaveOneOut * TST/FIX imports on separate lines * FIX use __class__ instead of classmethod * TST/FIX import directly from model_selection * COSMIT Relocate the random_state documentation * COSMIT remove pass * MAINT Remove deprecation warnings from old tests * FIX correct import at test_split * FIX/MAINT Move P_sparse, X, y defns to top; rm unused W_sparse, X_sparse * FIX random state to avoid doctest failure * TST n_splits and split wrapping of _CVIterableWrapper * FIX/MAINT Use multilabel indicator matrix directly * TST/DOC clarify why we conflate classes 0 and 1 * DOC add comment that this was taken from BaseEstimator * FIX use of labels is not needed in stratified k fold * Fix cross_validation reference * Fix the labels param doc

Squashed commit messages - (For reference) Major ----- * ENH p --> n_labels * FIX *ShuffleSplit: all float/invalid type errors at init and int error at split * FIX make PredefinedSplit accept test_folds in constructor; Cleanup docstrings * ENH+TST KFold: make rng to be generated at every split call for reproducibility * FIX/MAINT KFold: make shuffle a public attr * FIX Make CVIterableWrapper private. * FIX reuse len_cv instead of recalculating it * FIX Prevent adding *SearchCV estimators from the old grid_search module * re-FIX In all_estimators: the sorting to use only the 1st item (name) To avoid collision between the old and the new GridSearch classes. * FIX test_validate.py: Use 2D X (1D X is being detected as a single sample) * MAINT validate.py --> validation.py * MAINT make the submodules private * MAINT Support old cv/gs/lc until 0.19 * FIX/MAINT n_splits --> get_n_splits * FIX/TST test_logistic.py/test_ovr_multinomial_iris: pass predefined folds as an iterable * MAINT expose BaseCrossValidator * Update the model_selection module with changes from master - From scikit-learn#5161 - - MAINT remove redundant p variable - - Add check for sparse prediction in cross_val_predict - From scikit-learn#5201 - DOC improve random_state param doc - From scikit-learn#5190 - LabelKFold and test - From scikit-learn#4583 - LabelShuffleSplit and tests - From scikit-learn#5300 - shuffle the `labels` not the `indxs` in LabelKFold + tests - From scikit-learn#5378 - Make the GridSearchCV docs more accurate. - From scikit-learn#5458 - Remove shuffle from LabelKFold - From scikit-learn#5466(scikit-learn#4270) - Gaussian Process by Jan Metzen Minor ----- * ENH Make the KFold shuffling test stronger * FIX/DOC Use the higher level model_selection module as ref * DOC in check_cv "y : array-like, optional" * DOC a supervised learning problem --> supervised learning problems * DOC cross-validators --> cross-validation strategies * DOC Correct Olivier Grisel's name ;) * MINOR/FIX cv_indices --> kfold * FIX/DOC Align the 'See also' section of the new KFold, LeaveOneOut * TST/FIX imports on separate lines * FIX use __class__ instead of classmethod * TST/FIX import directly from model_selection * COSMIT Relocate the random_state documentation * COSMIT remove pass * MAINT Remove deprecation warnings from old tests * FIX correct import at test_split * FIX/MAINT Move P_sparse, X, y defns to top; rm unused W_sparse, X_sparse * FIX random state to avoid doctest failure * TST n_splits and split wrapping of _CVIterableWrapper * FIX/MAINT Use multilabel indicator matrix directly * TST/DOC clarify why we conflate classes 0 and 1 * DOC add comment that this was taken from BaseEstimator * FIX use of labels is not needed in stratified k fold * Fix cross_validation reference * Fix the labels param doc

Squashed commit messages - (For reference) Major ----- * ENH p --> n_labels * FIX *ShuffleSplit: all float/invalid type errors at init and int error at split * FIX make PredefinedSplit accept test_folds in constructor; Cleanup docstrings * ENH+TST KFold: make rng to be generated at every split call for reproducibility * FIX/MAINT KFold: make shuffle a public attr * FIX Make CVIterableWrapper private. * FIX reuse len_cv instead of recalculating it * FIX Prevent adding *SearchCV estimators from the old grid_search module * re-FIX In all_estimators: the sorting to use only the 1st item (name) To avoid collision between the old and the new GridSearch classes. * FIX test_validate.py: Use 2D X (1D X is being detected as a single sample) * MAINT validate.py --> validation.py * MAINT make the submodules private * MAINT Support old cv/gs/lc until 0.19 * FIX/MAINT n_splits --> get_n_splits * FIX/TST test_logistic.py/test_ovr_multinomial_iris: pass predefined folds as an iterable * MAINT expose BaseCrossValidator * Update the model_selection module with changes from master - From scikit-learn#5161 - - MAINT remove redundant p variable - - Add check for sparse prediction in cross_val_predict - From scikit-learn#5201 - DOC improve random_state param doc - From scikit-learn#5190 - LabelKFold and test - From scikit-learn#4583 - LabelShuffleSplit and tests - From scikit-learn#5300 - shuffle the `labels` not the `indxs` in LabelKFold + tests - From scikit-learn#5378 - Make the GridSearchCV docs more accurate. - From scikit-learn#5458 - Remove shuffle from LabelKFold - From scikit-learn#5466(scikit-learn#4270) - Gaussian Process by Jan Metzen - From scikit-learn#4826 - Move custom error / warnings into sklearn.exception Minor ----- * ENH Make the KFold shuffling test stronger * FIX/DOC Use the higher level model_selection module as ref * DOC in check_cv "y : array-like, optional" * DOC a supervised learning problem --> supervised learning problems * DOC cross-validators --> cross-validation strategies * DOC Correct Olivier Grisel's name ;) * MINOR/FIX cv_indices --> kfold * FIX/DOC Align the 'See also' section of the new KFold, LeaveOneOut * TST/FIX imports on separate lines * FIX use __class__ instead of classmethod * TST/FIX import directly from model_selection * COSMIT Relocate the random_state documentation * COSMIT remove pass * MAINT Remove deprecation warnings from old tests * FIX correct import at test_split * FIX/MAINT Move P_sparse, X, y defns to top; rm unused W_sparse, X_sparse * FIX random state to avoid doctest failure * TST n_splits and split wrapping of _CVIterableWrapper * FIX/MAINT Use multilabel indicator matrix directly * TST/DOC clarify why we conflate classes 0 and 1 * DOC add comment that this was taken from BaseEstimator * FIX use of labels is not needed in stratified k fold * Fix cross_validation reference * Fix the labels param doc

-------------------- * ENH Reogranize classes/fn from grid_search into search.py * ENH Reogranize classes/fn from cross_validation into split.py * ENH Reogranize cls/fn from cross_validation/learning_curve into validate.py * MAINT Merge _check_cv into check_cv inside the model_selection module * MAINT Update all the imports to point to the model_selection module * FIX use iter_cv to iterate throught the new style/old style cv objs * TST Add tests for the new model_selection members * ENH Wrap the old-style cv obj/iterables instead of using iter_cv * ENH Use scipy's binomial coefficient function comb for calucation of nCk * ENH Few enhancements to the split module * ENH Improve check_cv input validation and docstring * MAINT _get_test_folds(X, y, labels) --> _get_test_folds(labels) * TST if 1d arrays for X introduce any errors * ENH use 1d X arrays for all tests; * ENH X_10 --> X (global var) Minor ----- * ENH _PartitionIterator --> _BaseCrossValidator; * ENH CVIterator --> CVIterableWrapper * TST Import the old SKF locally * FIX/TST Clean up the split module's tests. * DOC Improve documentation of the cv parameter * COSMIT consistently hyphenate cross-validation/cross-validator * TST Calculate n_samples from X * COSMIT Use separate lines for each import. * COSMIT cross_validation_generator --> cross_validator Commits merged manually ----------------------- * FIX Document the random_state attribute in RandomSearchCV * MAINT Use check_cv instead of _check_cv * ENH refactor OVO decision function, use it in SVC for sklearn-like decision_function shape * FIX avoid memory cost when sampling from large parameter grids ENH Major to Minor incremental enhancements to the model_selection Squashed commit messages - (For reference) Major ----- * ENH p --> n_labels * FIX *ShuffleSplit: all float/invalid type errors at init and int error at split * FIX make PredefinedSplit accept test_folds in constructor; Cleanup docstrings * ENH+TST KFold: make rng to be generated at every split call for reproducibility * FIX/MAINT KFold: make shuffle a public attr * FIX Make CVIterableWrapper private. * FIX reuse len_cv instead of recalculating it * FIX Prevent adding *SearchCV estimators from the old grid_search module * re-FIX In all_estimators: the sorting to use only the 1st item (name) To avoid collision between the old and the new GridSearch classes. * FIX test_validate.py: Use 2D X (1D X is being detected as a single sample) * MAINT validate.py --> validation.py * MAINT make the submodules private * MAINT Support old cv/gs/lc until 0.19 * FIX/MAINT n_splits --> get_n_splits * FIX/TST test_logistic.py/test_ovr_multinomial_iris: pass predefined folds as an iterable * MAINT expose BaseCrossValidator * Update the model_selection module with changes from master - From #5161 - - MAINT remove redundant p variable - - Add check for sparse prediction in cross_val_predict - From #5201 - DOC improve random_state param doc - From #5190 - LabelKFold and test - From #4583 - LabelShuffleSplit and tests - From #5300 - shuffle the `labels` not the `indxs` in LabelKFold + tests - From #5378 - Make the GridSearchCV docs more accurate. - From #5458 - Remove shuffle from LabelKFold - From #5466(#4270) - Gaussian Process by Jan Metzen - From #4826 - Move custom error / warnings into sklearn.exception Minor ----- * ENH Make the KFold shuffling test stronger * FIX/DOC Use the higher level model_selection module as ref * DOC in check_cv "y : array-like, optional" * DOC a supervised learning problem --> supervised learning problems * DOC cross-validators --> cross-validation strategies * DOC Correct Olivier Grisel's name ;) * MINOR/FIX cv_indices --> kfold * FIX/DOC Align the 'See also' section of the new KFold, LeaveOneOut * TST/FIX imports on separate lines * FIX use __class__ instead of classmethod * TST/FIX import directly from model_selection * COSMIT Relocate the random_state documentation * COSMIT remove pass * MAINT Remove deprecation warnings from old tests * FIX correct import at test_split * FIX/MAINT Move P_sparse, X, y defns to top; rm unused W_sparse, X_sparse * FIX random state to avoid doctest failure * TST n_splits and split wrapping of _CVIterableWrapper * FIX/MAINT Use multilabel indicator matrix directly * TST/DOC clarify why we conflate classes 0 and 1 * DOC add comment that this was taken from BaseEstimator * FIX use of labels is not needed in stratified k fold * Fix cross_validation reference * Fix the labels param doc FIX/DOC/MAINT Addressing the review comments by Arnaud and Andy COSMIT Sort the members alphabetically COSMIT len_cv --> n_splits COSMIT Merge 2 if; FIX Use kwargs DOC Add my name to the authors :D DOC make labels parameter consistent FIX Remove hack for boolean indices; + COSMIT idx --> indices; DOC Add Returns COSMIT preds --> predictions DOC Add Returns and neatly arrange X, y, labels FIX idx(s)/ind(s)--> indice(s) COSMIT Merge if and else to elif COSMIT n --> n_samples COSMIT Use bincount only once COSMIT cls --> class_i / class_i (ith class indices) --> perm_indices_class_i FIX/ENH/TST Addressing the final reviews COSMIT c --> count FIX/TST make check_cv raise ValueError for string cv value TST nested cv (gs inside cross_val_score) works for diff cvs FIX/ENH Raise ValueError when labels is None for label based cvs; TST if labels is being passed correctly to the cv and that the ValueError is being propagated to the cross_val_score/predict and grid search FIX pass labels to cross_val_score FIX use make_classification DOC Add Returns; COSMIT Remove scaffolding TST add a test to check the _build_repr helper REVERT the old GS/RS should also be tested by the common tests. ENH Add a tuple of all/label based CVS FIX raise VE even at get_n_splits if labels is None FIX Fabian's comments PEP8

Add check for sparse prediction in cross_val_predict

aa2d6bf

dubstack mentioned this pull request Aug 26, 2015

cross_val_predict should work for sparse y #5132

Closed

jnothman reviewed Aug 26, 2015
View reviewed changes

Use a single vstack for concatenating all blocks in prediction matrix

48adf8b

jnothman reviewed Aug 26, 2015
View reviewed changes

Use Inverted locations to reorder the predictions

e35c90c

jnothman reviewed Aug 26, 2015
View reviewed changes

sklearn/cross_validation.py Outdated

Copy link
Copy Markdown

Member

jnothman Aug 26, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thus is no longer needed

dubstack added 2 commits August 26, 2015 16:15

Remove redundant p variable

26d3323

Add test to check sparse predictions in cross_val_predict

0c6b173

Fix minor indentation issues

b68f922

jnothman changed the title ~~[Issue 5132] Add check for sparse prediction in cross_val_predict~~ [MRG+1] Add check for sparse prediction in cross_val_predict (fixes #5132) Aug 26, 2015

ogrisel added a commit that referenced this pull request Aug 27, 2015

Merge pull request #5161 from beepee14/sparse_prediction_check

9486a7a

[MRG+1] Add check for sparse prediction in cross_val_predict (fixes #5132)

ogrisel merged commit 9486a7a into scikit-learn:master Aug 27, 2015

raghavrv mentioned this pull request Sep 10, 2015

[MRG+1] Make cross-validators data independent + Reorganize grid_search, cross_validation and learning_curve into model_selection #4294

Merged

24 tasks

raghavrv added a commit to raghavrv/scikit-learn that referenced this pull request Sep 11, 2015

From scikit-learn#5161

f004394

- MAINT remove redundant p variable - Add check for sparse prediction in cross_val_predict

raghavrv added a commit to raghavrv/scikit-learn that referenced this pull request Sep 14, 2015

From scikit-learn#5161

8ccbc47

- MAINT remove redundant p variable - Add check for sparse prediction in cross_val_predict

raghavrv added a commit to raghavrv/scikit-learn that referenced this pull request Oct 5, 2015

From scikit-learn#5161

47867ce

- MAINT remove redundant p variable - Add check for sparse prediction in cross_val_predict

Uh oh!

Conversation

dubstack commented Aug 26, 2015

Uh oh!

jnothman commented Aug 26, 2015

Uh oh!

jnothman Aug 26, 2015

Choose a reason for hiding this comment

Uh oh!

dubstack Aug 26, 2015

Choose a reason for hiding this comment

Uh oh!

jnothman Aug 26, 2015

Choose a reason for hiding this comment

Uh oh!

dubstack Aug 26, 2015

Choose a reason for hiding this comment

Uh oh!

jnothman Aug 26, 2015

Choose a reason for hiding this comment

Uh oh!

dubstack Aug 26, 2015

Choose a reason for hiding this comment

Uh oh!

jnothman Aug 26, 2015

Choose a reason for hiding this comment

Uh oh!

dubstack commented Aug 26, 2015

Uh oh!

jnothman commented Aug 26, 2015

Uh oh!

dubstack commented Aug 26, 2015

Uh oh!

jnothman commented Aug 26, 2015

Uh oh!

dubstack commented Aug 26, 2015

Uh oh!

ogrisel commented Aug 27, 2015

Uh oh!

ogrisel commented Aug 27, 2015

Uh oh!

ogrisel commented Aug 27, 2015

Uh oh!

ogrisel commented Aug 27, 2015

Uh oh!

dubstack commented Aug 27, 2015

Uh oh!

amueller commented Aug 27, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants