DOC: clarify model selection issues#432
Conversation
|
My criteria for changing the default: where would I think it provides the most value in "memory but not compute constrained" problems. This means that The reason I think it fits best here is because Without changing the default, there isn't a class that fits into "memory but not compute constrained". I don't think This idea matches the documentation:
TODO:
|
23f36e8 to
91feb57
Compare
I don't think that's correct. The algorithms in dask-ml/dask_ml/model_selection/_search.py Line 321 in cfec774 .fit calls will take place on CV splits of the training dataset.
|
|
Thanks for the review @TomAugspurger! I found it useful, especially the wording choices.
Right. I've written the docs to mention that each CV split is gathered to one machine for most estimators. Can you review this? I know |
|
This PR now depends on #221 (but implicitly, which is why the Travis CI tests are failing). |
|
I'm drafting a paper for the SciPy 2019 proceedings. In this, I explicitly mention that |
3443ec1 to
1415ca5
Compare
|
To the extent possible, I'd recommend keeping documentation changes, additional tests, and behavior changes separate. That makes things much easier for me to review. |
This PR is solvely devoted to documentation improvements and changing default decay_rate
|
I've tried to do that with two PRs:
Will that be sufficient, or should I remove the |
TomAugspurger
left a comment
There was a problem hiding this comment.
I'm still not 100% sure on the reasons for deprecaing decay_rate and the inclusion of InverseDecaySearchCV. Can you try re-explaining it to me?
|
Here are some arguments for deprecating
I don't know how (1) is the motivation for included the deprecation in this PR. |
stsievert
left a comment
There was a problem hiding this comment.
I've changed some tests for various tangential reasons (faster, silencing warnings, etc). These comments address these changes, or ask for clarifying comments.
|
FYI, going to ignore the sklearn dev failures for now. We'll need to add |
|
It looks like all the CI failures are unrelated. There's a couple other warnings besides the one with
Details# data = dask.array<array, shape=(10, 4), dtype=float64, chunksize=(5, 4), chunktype=numpy.ndarray>
@pytest.mark.parametrize("data", [X, dX, df, ddf])
def test_fit(data):
a = sklearn.impute.SimpleImputer()
b = dask_ml.impute.SimpleImputer()
a.fit(X)
b.fit(data)
assert_estimator_equal(a, b)This raises an
Details# file: TestPassiveAggressiveClassifier.test_basic
def test_basic(self, single_chunk_classification):
X, y = single_chunk_classification
a = lm.PartialPassiveAggressiveClassifier(classes=[0, 1], random_state=0, max_iter=100, tol=1e-3)
b = lm_.PassiveAggressiveClassifier(random_state=0, max_iter=100, tol=1e-3)
a.fit(X, y)
b.partial_fit(*dask.compute(X, y), classes=[0, 1])
assert_estimator_equal(a, b, exclude=["loss_function_"])This also raises a FutureWarning:
Passing keyword arguments: Details def test_check_cv():
# No y, classifier=False
cv = check_cv(3, classifier=False)
This also shows up in # ______________________ TestPolynomialFeatures.test_basic _______________________
# self = <tests.preprocessing.test_data.TestPolynomialFeatures object at 0x7ff8848d8a90>
def test_basic(self):
a = dpp.PolynomialFeatures()
There are lots of warnings about keyword arguments and |
|
I fixed the I think this is ready to go. All good on your end? |
|
Going to merge so we can include this in the 1.4.0 release. I can make another release anytime if needed. |
Yup! I'm glad to see this merged. |
What does this PR implement?
This PR makes the following changes:
IncrementalSearchCVdefaultdecay_ratetodecay_rate=0(but recommends changingdecay_rate=1in lead part of docs)*SearchCVs are "estimators" and they receive "models"test_same_models_with_random_statetest more stable. It passed with low probability on my machine; now it passes all the time.The biggest change is clearly explaining the issues in model selection. It reorganizes into this:
With this documentation, changing the default
decay_rateinIncrementalSearchCVmakes sense (see #432 (comment)). If that's the default, it's the only class that fills the "memory constrained by not compute constrained" class of problems (almost; some of dask-searchcv spills in here too).Other issues/PRs
This PR closes #388.
The documentation refactor has introduced the ability to add new classes, now that it's clear what the limitations of each approach are.
This fits well with the "computationally constrained but not memory constrained" class of model selection problems. I'm leaving it as a separate branch right now because I think
#221 is too bloatedit needs a lot more work.