[MRG+1] Make cross-validators data independent + Reorganize grid_search, cross_validation and learning_curve into model_selection#4294
Conversation
|
I'm -1 for moving cross_validation generators if they're getting a rewrite and we want to be able to use the same names. |
c652e0f to
758cdc2
Compare
|
It's fine to do so as long as you ensure that they're released at the same On 26 February 2015 at 00:10, ragv notifications@github.com wrote:
|
Sure! I'll make sure all these go into the
I agree, but that would make reviewing a tad easier, with lesser diffs to look at, I feel :) |
|
Or wait... like @amueller said in an earlier comment, we could put this in a branch and send the next two PRs to that... would it be better? |
|
really as I said in other places, I thought the whole point of doing the move now is having a deprecation path for the cross validation objects. |
|
Agreed Sent from my phone. Please forgive brevity and mis spelling On Feb 25, 2015, 19:34, at 19:34, Andreas Mueller notifications@github.com wrote:
|
|
+1 |
758cdc2 to
5d28b86
Compare
|
As said earlier I am converting the PR as a full fix for #2904 alone ... Though I am not even 20% done with the same goal, It would be helpful to know if there are any oppositions to this early on :) Will this fix (without clubbing together the fix for #1848) be considered for merge? Andreas has expressed his +1 towards the same... Any more suggestions / critiques? |
|
I think it is pretty save to go ahead with that. Bundling multiple changes in a PR is rarely a good idea, as long as they can be separated in a sensible way. |
|
Thanks for the comment :) |
6c4edb4 to
e42cc57
Compare
sklearn/cross_validation.py
Outdated
There was a problem hiding this comment.
@amueller unique_labels and n_unique_labels were previously defined in the __init__ and hence were publicly available as an attribute... Since we deprecate initializing data in the __init__, I've deprecated them this way... Could you kindly let me know if this seems sane? ( Or should we just remove those attributes without any deprecation? )
There was a problem hiding this comment.
You should just use the @deprecated decorator on the properties and not define a new function.
There was a problem hiding this comment.
Otherwise this is the right way.
There was a problem hiding this comment.
This is not so much about initializing in __init__ (which wouldn't be a problem here since these are not estimators) than warning only of someone actually uses them. But you did the right thing.
Does this seem like a good docstring for all the split methods? [Sorry for re-commenting again] |
|
@ogrisel Will this be included for 0.16? (if not is it safe to remove |
d7ea617 to
322ee69
Compare
|
Sorry I am just piling one comment upon another :/ How would you feel if |
322ee69 to
37dd8f5
Compare
sklearn/cross_validation.py
Outdated
There was a problem hiding this comment.
Why is this called y? Maybe "array" or something would be better?
There was a problem hiding this comment.
Data is X. y isn't data, is it? But this would also be applied to y, right? Or either? Sorry, I haven't read enough of the rest of the PR.
|
This will not be included in 0.16, but bootstrap will only be removed in 0.17. |
|
You are right, bootstrap can savely be removed, and will be in #4370 (I'll fix that in a second). |
|
@amueller Thanks a lot for the comments!! :) That cleared up a few things... :) Just one more additional issue - How would you feel if |
There was a problem hiding this comment.
needs fit_params as we discussed. Maybe we want to do the test for all CVS? and use a mock estimator?
|
I went though the code and tried it on my laptop. All works as expected. I could only come up with is those very minor comments. So +1 from me |
|
@fabianp Thanks for the reviews and +1 Fabian :) |
sklearn/model_selection/__init__.py
Outdated
There was a problem hiding this comment.
Those constants shouldn't be defined here.
There was a problem hiding this comment.
Can we have a separate file, _constants.py inside model_selection? (This will also hold the best param dict that will be added by @zermelozf soon?)
Also @amueller
There was a problem hiding this comment.
Have moved this to _validation.py as adviced by Arnaud and Andy IRL.
Main Commits - Major -------------------- * ENH Reogranize classes/fn from grid_search into search.py * ENH Reogranize classes/fn from cross_validation into split.py * ENH Reogranize cls/fn from cross_validation/learning_curve into validate.py * MAINT Merge _check_cv into check_cv inside the model_selection module * MAINT Update all the imports to point to the model_selection module * FIX use iter_cv to iterate throught the new style/old style cv objs * TST Add tests for the new model_selection members * ENH Wrap the old-style cv obj/iterables instead of using iter_cv * ENH Use scipy's binomial coefficient function comb for calucation of nCk * ENH Few enhancements to the split module * ENH Improve check_cv input validation and docstring * MAINT _get_test_folds(X, y, labels) --> _get_test_folds(labels) * TST if 1d arrays for X introduce any errors * ENH use 1d X arrays for all tests; * ENH X_10 --> X (global var) Minor ----- * ENH _PartitionIterator --> _BaseCrossValidator; * ENH CVIterator --> CVIterableWrapper * TST Import the old SKF locally * FIX/TST Clean up the split module's tests. * DOC Improve documentation of the cv parameter * COSMIT consistently hyphenate cross-validation/cross-validator * TST Calculate n_samples from X * COSMIT Use separate lines for each import. * COSMIT cross_validation_generator --> cross_validator Commits merged manually ----------------------- * FIX Document the random_state attribute in RandomSearchCV * MAINT Use check_cv instead of _check_cv * ENH refactor OVO decision function, use it in SVC for sklearn-like decision_function shape * FIX avoid memory cost when sampling from large parameter grids
Squashed commit messages - (For reference)
Major
-----
* ENH p --> n_labels
* FIX *ShuffleSplit: all float/invalid type errors at init and int error at split
* FIX make PredefinedSplit accept test_folds in constructor; Cleanup docstrings
* ENH+TST KFold: make rng to be generated at every split call for reproducibility
* FIX/MAINT KFold: make shuffle a public attr
* FIX Make CVIterableWrapper private.
* FIX reuse len_cv instead of recalculating it
* FIX Prevent adding *SearchCV estimators from the old grid_search module
* re-FIX In all_estimators: the sorting to use only the 1st item (name)
To avoid collision between the old and the new GridSearch classes.
* FIX test_validate.py: Use 2D X (1D X is being detected as a single sample)
* MAINT validate.py --> validation.py
* MAINT make the submodules private
* MAINT Support old cv/gs/lc until 0.19
* FIX/MAINT n_splits --> get_n_splits
* FIX/TST test_logistic.py/test_ovr_multinomial_iris:
pass predefined folds as an iterable
* MAINT expose BaseCrossValidator
* Update the model_selection module with changes from master
- From scikit-learn#5161
- - MAINT remove redundant p variable
- - Add check for sparse prediction in cross_val_predict
- From scikit-learn#5201 - DOC improve random_state param doc
- From scikit-learn#5190 - LabelKFold and test
- From scikit-learn#4583 - LabelShuffleSplit and tests
- From scikit-learn#5300 - shuffle the `labels` not the `indxs` in LabelKFold + tests
- From scikit-learn#5378 - Make the GridSearchCV docs more accurate.
- From scikit-learn#5458 - Remove shuffle from LabelKFold
- From scikit-learn#5466(scikit-learn#4270) - Gaussian Process by Jan Metzen
- From scikit-learn#4826 - Move custom error / warnings into sklearn.exception
Minor
-----
* ENH Make the KFold shuffling test stronger
* FIX/DOC Use the higher level model_selection module as ref
* DOC in check_cv "y : array-like, optional"
* DOC a supervised learning problem --> supervised learning problems
* DOC cross-validators --> cross-validation strategies
* DOC Correct Olivier Grisel's name ;)
* MINOR/FIX cv_indices --> kfold
* FIX/DOC Align the 'See also' section of the new KFold, LeaveOneOut
* TST/FIX imports on separate lines
* FIX use __class__ instead of classmethod
* TST/FIX import directly from model_selection
* COSMIT Relocate the random_state documentation
* COSMIT remove pass
* MAINT Remove deprecation warnings from old tests
* FIX correct import at test_split
* FIX/MAINT Move P_sparse, X, y defns to top; rm unused W_sparse, X_sparse
* FIX random state to avoid doctest failure
* TST n_splits and split wrapping of _CVIterableWrapper
* FIX/MAINT Use multilabel indicator matrix directly
* TST/DOC clarify why we conflate classes 0 and 1
* DOC add comment that this was taken from BaseEstimator
* FIX use of labels is not needed in stratified k fold
* Fix cross_validation reference
* Fix the labels param doc
COSMIT Sort the members alphabetically COSMIT len_cv --> n_splits COSMIT Merge 2 if; FIX Use kwargs DOC Add my name to the authors :D DOC make labels parameter consistent FIX Remove hack for boolean indices; + COSMIT idx --> indices; DOC Add Returns COSMIT preds --> predictions DOC Add Returns and neatly arrange X, y, labels FIX idx(s)/ind(s)--> indice(s) COSMIT Merge if and else to elif COSMIT n --> n_samples COSMIT Use bincount only once COSMIT cls --> class_i / class_i (ith class indices) --> perm_indices_class_i
COSMIT c --> count FIX/TST make check_cv raise ValueError for string cv value TST nested cv (gs inside cross_val_score) works for diff cvs FIX/ENH Raise ValueError when labels is None for label based cvs; TST if labels is being passed correctly to the cv and that the ValueError is being propagated to the cross_val_score/predict and grid search FIX pass labels to cross_val_score FIX use make_classification DOC Add Returns; COSMIT Remove scaffolding TST add a test to check the _build_repr helper REVERT the old GS/RS should also be tested by the common tests. ENH Add a tuple of all/label based CVS FIX raise VE even at get_n_splits if labels is None FIX Fabian's comments PEP8
|
Blame me if anything breaks. |
|
🍻 |
1 similar comment
|
🍻 |
|
Thanks a lot Andy, Vlad, Joel and Arnaud for the reviews and merge 🍻 :) |
|
so how to use now starting from 0? On Fri, Oct 23, 2015 at 11:50 AM, Raghav R V notifications@github.com
|
fixes #1848, #2904
Things to be done after this PR - Issue at #5053 ( PR at #5569 )
TODO
model_selectionmodule.check_cvso both old style and new style classes can be used)model_selctiontests pass[MRG+1] FIX all the examples to use the new cv classes raghavrv/scikit-learn#3merged into [MRG] Model selection documentation raghavrv/scikit-learn#4!MINOR
RenameMoved to [RFC] Changes to model_selection? #5053pto a better name._check_is_partition-->_check_is_permutation?_empty_maskOpen Discussions
labelsarg - Refer discussion here [MRG+1] Make cross-validators data independent + Reorganize grid_search, cross_validation and learning_curve into model_selection #4294 (comment) and here [MRG+1] Make cross-validators data independent + Reorganize grid_search, cross_validation and learning_curve into model_selection #4294 (comment) -labelsadded to the last.__init__... - Now successive split calls return similar results (whenrandom_stateis set)labelsto the inner cv inpermutation_test_score- Refer [MRG+1] Make cross-validators data independent + Reorganize grid_search, cross_validation and learning_curve into model_selection #4294 (comment) - ping @agramfort - For now yes. - ([MRG+1] Make cross-validators data independent + Reorganize grid_search, cross_validation and learning_curve into model_selection #4294 (comment))CVIteratorWrapperprivate? - Refer [MRG+1] Make cross-validators data independent + Reorganize grid_search, cross_validation and learning_curve into model_selection #4294 (comment) - Made privateNOTE:
The current implementation will still not support nesting
EstimatorCVinsideGridSearchCV... This will become possible only after allowingsample_propertiesto pass fromGridSearchCVtoEstimatorCV...PRs whose changes to g_s / c_v / l_c have been manually incorporated into this:
#4714 - Svm decision function shape - 1 commit
#4829 - merge _check_cv into check_cv ... - 1 commit
#4857 - Document missing attribute
random_stateinRandomizedSearchCV- 1 commit#4840 - FIX avoid memory cost when sampling from large parameter grids - 1 commit
#5194 (Refer #5238) - Consistent CV docstring
#5161 check for sparse pred in
cross_val_predict#5201 clarify random state in
KFold#5190
LabelKFold#4583
LabelShuffleSplit#5283 Remove some warnings in grid search tests
#5300 shuffle
labelsnotidxsand tests to ensure it.This PR is slightly based upon @pignacio's work in #3340.
@amueller's hack:
if you want to align diffs you can do this (in ipython notebook)