DOC: clarify model selection issues by stsievert · Pull Request #432 · dask/dask-ml

stsievert · 2018-11-30T01:50:16Z

What does this PR implement?
This PR makes the following changes:

Makes the documentation clear how different model selection problems can be constrained.
Changes the IncrementalSearchCV default decay_rate to decay_rate=0 (but recommends changing decay_rate=1 in lead part of docs)
Adds documentation on determining Hyperbands inputs and a performance comparison
Wording change: *SearchCVs are "estimators" and they receive "models"
Small documentation fixes (clearing warnings, style fixes)
Makes the test_same_models_with_random_state test more stable. It passed with low probability on my machine; now it passes all the time.

The biggest change is clearly explaining the issues in model selection. It reorganizes into this:

Two scenarios can occur during hyper-parameter optimization. The hyper-parameter search can be both

compute constrained

memory constrained

[examples of problems]

[sections mentioning classes best suited for each problem]

...

[section on adaptive hyperparameter optimization]

With this documentation, changing the default decay_rate in IncrementalSearchCV makes sense (see #432 (comment)). If that's the default, it's the only class that fills the "memory constrained by not compute constrained" class of problems (almost; some of dask-searchcv spills in here too).

Other issues/PRs
This PR closes #388.

The documentation refactor has introduced the ability to add new classes, now that it's clear what the limitations of each approach are.

WIP: ENH: add black box to allow adaptive model selection on any class stsievert/dask-ml#3 implements a black box optimizer as mentioned in section 3.4 of the Hyperband paper.

This fits well with the "computationally constrained but not memory constrained" class of model selection problems. I'm leaving it as a separate branch right now because I think ~~#221 is too bloated~~ it needs a lot more work.

This reverts commit 58180d2.

stsievert · 2018-11-30T16:18:36Z

My criteria for changing the default: where would IncrementalSearchCV provide the most value?

I think it provides the most value in "memory but not compute constrained" problems. This means that decay_rate should default to 0 because this isn't compute constrained.

The reason I think it fits best here is because IncrementalSearchCV works better for large datasets. It can go over the entire dataset because it provides access to partial_fit. dask-searchcv can't do this because it only has access to fit, so it fits to each chunk of the data independently (_search.py#L416-L422). I should add to dask-searchcv's docs to reflect this.

Without changing the default, there isn't a class that fits into "memory but not compute constrained". I don't think DaskGridSearchCV(Incremental(model), params) would work, especially if more than one pass through data wanted.

This idea matches the documentation:

dask_ml.wrappers.Incremetnal currently does not work well with hyper-parameter optimization like sklearn.model_selection.GridSearchCV. If you need to do hyper-parameter optimization on larger-than-memory datasets, we recommend dask_ml.model_selection.IncrementalSearch

—http://ml.dask.org/incremental.html

TODO:

add to dask-searchcv documentation to clarify how Dask arrays are handled
change default decay_rate of IncrementalSearchCV
take dask-searchcv off "memory but not compute constrained" (but add a note below).

This reverts commit 8bf3046.

…election-docs

TomAugspurger · 2018-12-20T20:28:08Z

it fits to each chunk of the data independently (_search.py#L416-L422). I should add to dask-searchcv's docs to reflect this.

I don't think that's correct. The algorithms in _search.py mirror scikit-learn. I believe that the Xs and ys you linked to are replicas of the training dataset (e.g.

dask-ml/dask_ml/model_selection/_search.py

Line 321 in cfec774

# Fit the estimator on the training data

). The actual .fit calls will take place on CV splits of the training dataset.

dask_ml/model_selection/_incremental.py

docs/source/hyper-parameter-search.rst

…ask-ml into model-selection-docs

stsievert · 2018-12-29T02:35:11Z

Thanks for the review @TomAugspurger! I found it useful, especially the wording choices.

The actual fit calls will take place on CV splits of the training dataset.

Right. I've written the docs to mention that each CV split is gathered to one machine for most estimators. Can you review this?

I know dask_ml.Incremental aliases fit to call partial_fit on each chunk of the Dask array. Would that work with dask-searchcv's searches? I believe dask-searchcv computes each CV split regardless of what fit can take, but I'm having a hard time seeing that.

stsievert · 2019-03-31T18:31:42Z

This PR now depends on #221 (but implicitly, which is why the Travis CI tests are failing).

stsievert · 2019-05-04T14:43:31Z

I'm drafting a paper for the SciPy 2019 proceedings. In this, I explicitly mention that decay_rate=0 by default. I use this to show that IncrementalSearchCV is best suited for memory-constrained problems.

…election-docs

TomAugspurger · 2020-04-20T15:10:48Z

To the extent possible, I'd recommend keeping documentation changes, additional tests, and behavior changes separate. That makes things much easier for me to review.

This PR is solvely devoted to documentation improvements and changing default decay_rate

stsievert · 2020-04-20T22:09:15Z

I've tried to do that with two PRs:

This PR is focused around documentation improvements to docs/source/hyperparameter.rst. In addition to making these improvements, this PR also...
- Deprecates decay_rate. This makes IncrementalSearchCV better fit in "memory but not compute constrained".
- Implements a new class InverseDecaySearchCV. This class is a wrapper for the previous implementation of IncrementalSearchCV._adapt.
- Makes other small documentation improvements (adding to documentation of dask-searchcv, linking in Hyperband docs, etc)
MAINT: Clean InverseDecaySearchCV #651: cleaning InverseDecaySearchCV. This PR implements and tests improvements to InverseDecaySearchCV._adapt.

Will that be sufficient, or should I remove the IncrementalSearchCV deprecation as well? That'll mean this PR is only modifies *.rst files/docstrings and #651 only modifies *.py files.

TomAugspurger

I'm still not 100% sure on the reasons for deprecaing decay_rate and the inclusion of InverseDecaySearchCV. Can you try re-explaining it to me?

dask_ml/model_selection/_incremental.py

stsievert · 2020-04-22T03:51:39Z

Here are some arguments for deprecating decay_rate:

(documentation) IncrementalSearchCV is the only class to scale "memory- but not compute-constrained" searches. I think these searches is important and common enough to include an implementation that works by default..

I would imagine this is the most prominent usage of Dask-ML's hyperparameter optimization: "memory- but not compute constrained" includes training linear models on large datasets.
(API) decay_rate is a algorithm specific parameter. Other adaptive algorithms won't need decay_rate. Case in point: SuccessiveHalvingSearchCV doesn't take decay_rate. Other adaptive algorithms like (the popular) median elimination won't either. This leads to a violation of Liskov substitution principle.
(weaker API) Without deprecating decay_rate, IncrementalSearchCV will have two useful implementations:
- An implementation best suited for large data with decay_rate=0
- An implementation of an adaptive algorithm with decay_rate=1 (at least after MAINT: Clean InverseDecaySearchCV #651).

I don't know how decay_rate performs if decay_rate not in [0, 1]. I don't even really know how decay_rate=1 performs, but at least know the motivation and have performance results for a related implementation, SuccessiveHalvingSearchCV.

(1) is the motivation for included the deprecation in this PR.

stsievert

I've changed some tests for various tangential reasons (faster, silencing warnings, etc). These comments address these changes, or ask for clarifying comments.

tests/model_selection/test_incremental.py

dask_ml/model_selection/_search.py

tests/model_selection/test_incremental.py

TomAugspurger · 2020-05-06T13:43:37Z

FYI, going to ignore the sklearn dev failures for now. We'll need to add n_features_in_ to each estimator, which will take a bit of work.

stsievert · 2020-05-06T14:00:47Z

It looks like all the CI failures are unrelated. There's a couple other warnings besides the one with n_features_in_:

n_features_in_:

Details

# data = dask.array<array, shape=(10, 4), dtype=float64, chunksize=(5, 4), chunktype=numpy.ndarray>

@pytest.mark.parametrize("data", [X, dX, df, ddf])
def test_fit(data):
    a = sklearn.impute.SimpleImputer()
    b = dask_ml.impute.SimpleImputer()
    
    a.fit(X)
    b.fit(data)
    
    assert_estimator_equal(a, b)

This raises an AssertionError: {'n_features_in_'}. The raised FutureWarning is

As of scikit-learn 0.23, estimators should expose a n_features_in_ attribute, unless the 'no_validation' tag is True. This attribute should be equal to the number of features passed to the fit method. An error will be raised from version 0.25 when calling check_estimator(). See SLEP010: https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/slep010/proposal.html

PartialPassiveAggressiveClassifier.average_intercept_:

Details

# file: TestPassiveAggressiveClassifier.test_basic

def test_basic(self, single_chunk_classification):
    X, y = single_chunk_classification
    a = lm.PartialPassiveAggressiveClassifier(classes=[0, 1], random_state=0, max_iter=100, tol=1e-3)
    b = lm_.PassiveAggressiveClassifier(random_state=0, max_iter=100, tol=1e-3)
    a.fit(X, y)
    b.partial_fit(*dask.compute(X, y), classes=[0, 1])
    assert_estimator_equal(a, b, exclude=["loss_function_"])

This also raises a FutureWarning:

FutureWarning: Attribute average_intercept_ was deprecated in version 0.23 and will be removed in 0.25.

Passing keyword arguments:

Details

    def test_check_cv():
        # No y, classifier=False
       cv = check_cv(3, classifier=False)

FutureWarning: Pass classifier=False as keyword args. From version 0.25 passing these as positional arguments will result in an error

This also shows up in TestPolynomialFeatures:

# ______________________ TestPolynomialFeatures.test_basic _______________________

# self = <tests.preprocessing.test_data.TestPolynomialFeatures object at 0x7ff8848d8a90>

def test_basic(self):
    a = dpp.PolynomialFeatures()

FutureWarning: Pass interaction_only=False, include_bias=True as keyword args. From version 0.25 passing these as positional arguments will result in an error

There are lots of warnings about keyword arguments and n_features_in_.

TomAugspurger · 2020-05-06T14:05:28Z

I fixed the PartialPassiveAggressiveClassifier.average_intercept_ in another branch.

I think this is ready to go. All good on your end?

TomAugspurger · 2020-05-06T14:21:28Z

Going to merge so we can include this in the 1.4.0 release. I can make another release anytime if needed.

stsievert · 2020-05-06T14:35:09Z

I think this is ready to go. All good on your end?

Yup! I'm glad to see this merged.

stsievert added 4 commits November 29, 2018 18:53

DOC: clarify model selection issues

a154da6

API: change IncrementalSearchCV default decay_rate

58180d2

Revert "API: change IncrementalSearchCV default decay_rate"

8bf3046

This reverts commit 58180d2.

MAINT: edits

1996643

stsievert changed the title ~~DOC: clarify model selection issues~~ WIP: DOC: clarify model selection issues Nov 30, 2018

stsievert added 2 commits November 30, 2018 11:12

Revert "Revert "API: change IncrementalSearchCV default decay_rate""

1c867ad

This reverts commit 8bf3046.

DOC: add note on dask-searchcv fit with Dask data

2ab324f

stsievert changed the title ~~WIP: DOC: clarify model selection issues~~ DOC: clarify model selection issues Nov 30, 2018

DOC: make lists more specific

91feb57

stsievert force-pushed the model-selection-docs branch from 23f36e8 to 91feb57 Compare December 1, 2018 03:36

stsievert mentioned this pull request Dec 10, 2018

ENH: Hyperband implementation #221

Merged

7 tasks

Merge remote-tracking branch 'upstream/master' into stsievert-model-s…

e40134c

…election-docs

TomAugspurger reviewed Dec 20, 2018

View reviewed changes

stsievert added 4 commits December 28, 2018 18:05

DOC: add note on motivation for decay_rate==1

6e22802

DOC: clean

38b82a3

DOC: to cross validation splits

f4d7d87

Merge branch 'model-selection-docs' of https://github.com/stsievert/d…

cd7dc58

…ask-ml into model-selection-docs

stsievert added 3 commits March 30, 2019 20:41

DOC: make docs depend on Hyperband

0013d92

DOC: remove outdated depedency warning

122dcd2

Small TODO for blog link

88857d3

stsievert mentioned this pull request Jun 19, 2019

Bump dask-ml version dask/dask-examples#82

Merged

stsievert mentioned this pull request Aug 4, 2019

Adds example of Dask-ML's HyperbandSearchCV dask/dask-examples#92

Merged

stsievert added 2 commits November 29, 2019 11:22

Merge branch 'master' of https://github.com/dask/dask-ml into model-s…

530f394

…election-docs

DOC: significant rewriting

1415ca5

stsievert force-pushed the model-selection-docs branch from 3443ec1 to 1415ca5 Compare April 9, 2020 19:12

stsievert added 7 commits April 18, 2020 18:40

lint, small test changes, small doc changes

680a67b

black

3c2bb3e

silence warnings

12f6960

black

ed83bc6

edits

e36c8a2

MAINT: edit inverse decay cv to respect max_iter

530b3ff

DOC: small doc fixes

3d67583

Clean up PR

a290148

This PR is solvely devoted to documentation improvements and changing default decay_rate

stsievert mentioned this pull request Apr 20, 2020

MAINT: Clean InverseDecaySearchCV #651

Draft

1 task

stsievert added 2 commits April 20, 2020 17:00

Minimize diff

02a905a

Merge commit '02a905a306349fb6' into model-selection-docs

bd1a0d1

TomAugspurger reviewed Apr 21, 2020

View reviewed changes

dask_ml/model_selection/_incremental.py Outdated Show resolved Hide resolved

dask_ml/model_selection/_incremental.py Outdated Show resolved Hide resolved

dask_ml/model_selection/_incremental.py Show resolved Hide resolved

stsievert added 4 commits April 21, 2020 21:56

future warning

8f7724d

Finer warning

43130a4

Remove warning filter

5f36998

black

6075075

stsievert added 2 commits April 21, 2020 22:52

lint

cf2dede

better naming

2bede2a

stsievert commented May 4, 2020

View reviewed changes

stsievert added 2 commits May 4, 2020 11:02

Add comment explanations

1aefed7

Clarify dask_searchcv fit

b1cf403

TomAugspurger merged commit c535a51 into dask:master May 6, 2020

Uh oh!

Conversation

stsievert commented Nov 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stsievert commented Nov 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomAugspurger commented Dec 20, 2018

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stsievert commented Dec 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stsievert commented Mar 31, 2019

Uh oh!

stsievert commented May 4, 2019

Uh oh!

TomAugspurger commented Apr 20, 2020

Uh oh!

stsievert commented Apr 20, 2020

Uh oh!

TomAugspurger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stsievert commented Apr 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stsievert left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TomAugspurger commented May 6, 2020

Uh oh!

stsievert commented May 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomAugspurger commented May 6, 2020

Uh oh!

TomAugspurger commented May 6, 2020

Uh oh!

stsievert commented May 6, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stsievert commented Nov 30, 2018 •

edited

Loading

stsievert commented Nov 30, 2018 •

edited

Loading

stsievert commented Dec 29, 2018 •

edited

Loading

stsievert commented Apr 22, 2020 •

edited

Loading

stsievert commented May 6, 2020 •

edited

Loading