ENH Add memory=joblib.Memory param to OPTICS#19024
Merged
rth merged 7 commits intoscikit-learn:mainfrom Aug 6, 2021
Merged
Conversation
jnothman
approved these changes
Mar 17, 2021
Member
jnothman
left a comment
There was a problem hiding this comment.
Looks good!
Please report benchmarks just to show it has the desired effect.
Contributor
Author
|
I've added a benchmark now but I'm having trouble running it. But this works: Any tips? |
Member
|
I wasn't necessarily expecting benchmarks to be committed here, but at least to be presented. Thanks. I'd need to look into the asv issues with time I don't have right now! |
Member
|
Quick benchmark adapted from the asv example, from sklearn.datasets import make_blobs
from sklearn.cluster import OPTICS
from sklearn.model_selection import GridSearchCV
X, y = make_blobs(n_samples=500)
optics = OPTICS(
metric="euclidean",
cluster_method="dbscan",
min_samples=10,
)Then, In [20]: %timeit GridSearchCV(optics, {"eps": [0.1, 0.2, 0.3, 0.4, 0.5]}, scoring=lambda x, y: 1.0).fit(X)
3.11 s ± 27.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [21]: %timeit GridSearchCV(optics, {"eps": [0.1, 0.2, 0.3, 0.4, 0.5], "memory": ['/tmp/cache2']}, scoring=lambda x, y: 1.0).fit(X)
22.1 ms ± 310 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)So this works as expected. I removed the asv benchmark as I don't think it would provide useful information for the future. All we want to know is that caching works as expected. |
Member
|
Nice! |
samronsin
pushed a commit
to samronsin/scikit-learn
that referenced
this pull request
Nov 30, 2021
Co-authored-by: Roman Yurchak <rth.yurchak@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Reference Issues/PRs
Addresses, but doesn't not entirely fix #12044
What does this implement/fix? Explain your changes.
This adds a memory parameter to OPTICS. The justification for this is same as for hierarchical clustering. If we want to cut out dendrogram at different points, then we can reuse an intermediate result. The same is true for DBSCAN.
Practically speaking this avoids recomputation of the most expensive step of OPTICS during grid search as long as it is performed in serial.
As discussed in #12044 it would be good to eventually implement
warm_start(...)and then users could run that in parallel as an initial step, before running grid search. A further embellishment would be grid searchers that understood the expensive and non expensive step computed things in the right order.However, that issue has been open for a while now, and this at least solves the serial grid search use-case.
Any other comments?
Let me know if anything is missing.