Skip to content

DOC: clarify model selection issues#432

Merged
TomAugspurger merged 68 commits intodask:masterfrom
stsievert:model-selection-docs
May 6, 2020
Merged

DOC: clarify model selection issues#432
TomAugspurger merged 68 commits intodask:masterfrom
stsievert:model-selection-docs

Conversation

@stsievert
Copy link
Copy Markdown
Member

@stsievert stsievert commented Nov 30, 2018

What does this PR implement?
This PR makes the following changes:

  • Makes the documentation clear how different model selection problems can be constrained.
  • Changes the IncrementalSearchCV default decay_rate to decay_rate=0 (but recommends changing decay_rate=1 in lead part of docs)
  • Adds documentation on determining Hyperbands inputs and a performance comparison
  • Wording change: *SearchCVs are "estimators" and they receive "models"
  • Small documentation fixes (clearing warnings, style fixes)
  • Makes the test_same_models_with_random_state test more stable. It passed with low probability on my machine; now it passes all the time.

The biggest change is clearly explaining the issues in model selection. It reorganizes into this:

Two scenarios can occur during hyper-parameter optimization. The hyper-parameter search can be both

  1. compute constrained
  2. memory constrained

[examples of problems]

[sections mentioning classes best suited for each problem]

...

[section on adaptive hyperparameter optimization]

With this documentation, changing the default decay_rate in IncrementalSearchCV makes sense (see #432 (comment)). If that's the default, it's the only class that fills the "memory constrained by not compute constrained" class of problems (almost; some of dask-searchcv spills in here too).

Other issues/PRs
This PR closes #388.

The documentation refactor has introduced the ability to add new classes, now that it's clear what the limitations of each approach are.

This fits well with the "computationally constrained but not memory constrained" class of model selection problems. I'm leaving it as a separate branch right now because I think #221 is too bloated it needs a lot more work.

@stsievert
Copy link
Copy Markdown
Member Author

stsievert commented Nov 30, 2018

My criteria for changing the default: where would IncrementalSearchCV provide the most value?

I think it provides the most value in "memory but not compute constrained" problems. This means that decay_rate should default to 0 because this isn't compute constrained.

The reason I think it fits best here is because IncrementalSearchCV works better for large datasets. It can go over the entire dataset because it provides access to partial_fit. dask-searchcv can't do this because it only has access to fit, so it fits to each chunk of the data independently (_search.py#L416-L422). I should add to dask-searchcv's docs to reflect this.

Without changing the default, there isn't a class that fits into "memory but not compute constrained". I don't think DaskGridSearchCV(Incremental(model), params) would work, especially if more than one pass through data wanted.

This idea matches the documentation:

dask_ml.wrappers.Incremetnal currently does not work well with hyper-parameter optimization like sklearn.model_selection.GridSearchCV. If you need to do hyper-parameter optimization on larger-than-memory datasets, we recommend dask_ml.model_selection.IncrementalSearch

http://ml.dask.org/incremental.html

TODO:

  • add to dask-searchcv documentation to clarify how Dask arrays are handled
  • change default decay_rate of IncrementalSearchCV
  • take dask-searchcv off "memory but not compute constrained" (but add a note below).

@stsievert stsievert changed the title DOC: clarify model selection issues WIP: DOC: clarify model selection issues Nov 30, 2018
@stsievert stsievert changed the title WIP: DOC: clarify model selection issues DOC: clarify model selection issues Nov 30, 2018
@stsievert stsievert force-pushed the model-selection-docs branch from 23f36e8 to 91feb57 Compare December 1, 2018 03:36
@stsievert stsievert mentioned this pull request Dec 10, 2018
7 tasks
@TomAugspurger
Copy link
Copy Markdown
Member

it fits to each chunk of the data independently (_search.py#L416-L422). I should add to dask-searchcv's docs to reflect this.

I don't think that's correct. The algorithms in _search.py mirror scikit-learn. I believe that the Xs and ys you linked to are replicas of the training dataset (e.g.

# Fit the estimator on the training data
). The actual .fit calls will take place on CV splits of the training dataset.

@stsievert
Copy link
Copy Markdown
Member Author

stsievert commented Dec 29, 2018

Thanks for the review @TomAugspurger! I found it useful, especially the wording choices.

The actual fit calls will take place on CV splits of the training dataset.

Right. I've written the docs to mention that each CV split is gathered to one machine for most estimators. Can you review this?

I know dask_ml.Incremental aliases fit to call partial_fit on each chunk of the Dask array. Would that work with dask-searchcv's searches? I believe dask-searchcv computes each CV split regardless of what fit can take, but I'm having a hard time seeing that.

@stsievert
Copy link
Copy Markdown
Member Author

This PR now depends on #221 (but implicitly, which is why the Travis CI tests are failing).

@stsievert
Copy link
Copy Markdown
Member Author

I'm drafting a paper for the SciPy 2019 proceedings. In this, I explicitly mention that decay_rate=0 by default. I use this to show that IncrementalSearchCV is best suited for memory-constrained problems.

@stsievert stsievert force-pushed the model-selection-docs branch from 3443ec1 to 1415ca5 Compare April 9, 2020 19:12
@TomAugspurger
Copy link
Copy Markdown
Member

To the extent possible, I'd recommend keeping documentation changes, additional tests, and behavior changes separate. That makes things much easier for me to review.

This PR is solvely devoted to documentation improvements
and changing default decay_rate
@stsievert
Copy link
Copy Markdown
Member Author

I've tried to do that with two PRs:

  • This PR is focused around documentation improvements to docs/source/hyperparameter.rst. In addition to making these improvements, this PR also...
    • Deprecates decay_rate. This makes IncrementalSearchCV better fit in "memory but not compute constrained".
    • Implements a new class InverseDecaySearchCV. This class is a wrapper for the previous implementation of IncrementalSearchCV._adapt.
    • Makes other small documentation improvements (adding to documentation of dask-searchcv, linking in Hyperband docs, etc)
  • MAINT: Clean InverseDecaySearchCV #651: cleaning InverseDecaySearchCV. This PR implements and tests improvements to InverseDecaySearchCV._adapt.

Will that be sufficient, or should I remove the IncrementalSearchCV deprecation as well? That'll mean this PR is only modifies *.rst files/docstrings and #651 only modifies *.py files.

Copy link
Copy Markdown
Member

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not 100% sure on the reasons for deprecaing decay_rate and the inclusion of InverseDecaySearchCV. Can you try re-explaining it to me?

@stsievert
Copy link
Copy Markdown
Member Author

stsievert commented Apr 22, 2020

Here are some arguments for deprecating decay_rate:

  1. (documentation) IncrementalSearchCV is the only class to scale "memory- but not compute-constrained" searches. I think these searches is important and common enough to include an implementation that works by default..

    I would imagine this is the most prominent usage of Dask-ML's hyperparameter optimization: "memory- but not compute constrained" includes training linear models on large datasets.
  2. (API) decay_rate is a algorithm specific parameter. Other adaptive algorithms won't need decay_rate. Case in point: SuccessiveHalvingSearchCV doesn't take decay_rate. Other adaptive algorithms like (the popular) median elimination won't either. This leads to a violation of Liskov substitution principle.
  3. (weaker API) Without deprecating decay_rate, IncrementalSearchCV will have two useful implementations:

I don't know how decay_rate performs if decay_rate not in [0, 1]. I don't even really know how decay_rate=1 performs, but at least know the motivation and have performance results for a related implementation, SuccessiveHalvingSearchCV.

(1) is the motivation for included the deprecation in this PR.

Copy link
Copy Markdown
Member Author

@stsievert stsievert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed some tests for various tangential reasons (faster, silencing warnings, etc). These comments address these changes, or ask for clarifying comments.

@TomAugspurger
Copy link
Copy Markdown
Member

FYI, going to ignore the sklearn dev failures for now. We'll need to add n_features_in_ to each estimator, which will take a bit of work.

@stsievert
Copy link
Copy Markdown
Member Author

stsievert commented May 6, 2020

It looks like all the CI failures are unrelated. There's a couple other warnings besides the one with n_features_in_:

n_features_in_:

Details
# data = dask.array<array, shape=(10, 4), dtype=float64, chunksize=(5, 4), chunktype=numpy.ndarray>

@pytest.mark.parametrize("data", [X, dX, df, ddf])
def test_fit(data):
    a = sklearn.impute.SimpleImputer()
    b = dask_ml.impute.SimpleImputer()
    
    a.fit(X)
    b.fit(data)
    
    assert_estimator_equal(a, b)

This raises an AssertionError: {'n_features_in_'}. The raised FutureWarning is

As of scikit-learn 0.23, estimators should expose a n_features_in_ attribute, unless the 'no_validation' tag is True. This attribute should be equal to the number of features passed to the fit method. An error will be raised from version 0.25 when calling check_estimator(). See SLEP010: https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/slep010/proposal.html

PartialPassiveAggressiveClassifier.average_intercept_:

Details
# file: TestPassiveAggressiveClassifier.test_basic

def test_basic(self, single_chunk_classification):
    X, y = single_chunk_classification
    a = lm.PartialPassiveAggressiveClassifier(classes=[0, 1], random_state=0, max_iter=100, tol=1e-3)
    b = lm_.PassiveAggressiveClassifier(random_state=0, max_iter=100, tol=1e-3)
    a.fit(X, y)
    b.partial_fit(*dask.compute(X, y), classes=[0, 1])
    assert_estimator_equal(a, b, exclude=["loss_function_"])

This also raises a FutureWarning:

FutureWarning: Attribute average_intercept_ was deprecated in version 0.23 and will be removed in 0.25.

Passing keyword arguments:

Details
    def test_check_cv():
        # No y, classifier=False
       cv = check_cv(3, classifier=False)

FutureWarning: Pass classifier=False as keyword args. From version 0.25 passing these as positional arguments will result in an error

This also shows up in TestPolynomialFeatures:

# ______________________ TestPolynomialFeatures.test_basic _______________________

# self = <tests.preprocessing.test_data.TestPolynomialFeatures object at 0x7ff8848d8a90>

def test_basic(self):
    a = dpp.PolynomialFeatures()

FutureWarning: Pass interaction_only=False, include_bias=True as keyword args. From version 0.25 passing these as positional arguments will result in an error

There are lots of warnings about keyword arguments and n_features_in_.

@TomAugspurger
Copy link
Copy Markdown
Member

I fixed the PartialPassiveAggressiveClassifier.average_intercept_ in another branch.

I think this is ready to go. All good on your end?

@TomAugspurger
Copy link
Copy Markdown
Member

Going to merge so we can include this in the 1.4.0 release. I can make another release anytime if needed.

@TomAugspurger TomAugspurger merged commit c535a51 into dask:master May 6, 2020
@stsievert
Copy link
Copy Markdown
Member Author

I think this is ready to go. All good on your end?

Yup! I'm glad to see this merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

IncrementalSearchCV runs an adaptive algorithm by default

3 participants