MNT speed up example plot_digits_pipe.py#21728
MNT speed up example plot_digits_pipe.py#21728adrinjalali merged 2 commits intoscikit-learn:mainfrom
Conversation
examples/compose/plot_digits_pipe.py
Outdated
| # set the tolerance to a large value to make the example faster | ||
| logistic = LogisticRegression(max_iter=10000, tol=0.1) | ||
| logistic = LogisticRegression(max_iter=1000, tol=0.2) | ||
| pipe = Pipeline(steps=[("pca", pca), ("logistic", logistic)]) |
There was a problem hiding this comment.
Actually, we make a 101 rookie mistake here. The data are not scaled. We should not modify anything apart of adding a StandardScaler as a preprocessing stage:
| pipe = Pipeline(steps=[("pca", pca), ("logistic", logistic)]) | |
| from sklearn.preprocessing import StandardScaler | |
| scaler = StandardScaler() | |
| pipe = Pipeline(steps=[("scaler", scaler), ("pca", pca), ("logistic", logistic)]) |
I have a x5 speed up just by scaling the data because the LogisticRegression is converging faster.
There was a problem hiding this comment.
I tried a few different versions, here are the timing results:
- Original:
Best parameter (CV score=0.920):
{'logistic__C': 0.046415888336127774, 'pca__n_components': 45}
real 7.33
user 58.78
sys 6.04
- Using Standard Scaler:
Best parameter (CV score=0.924):
{'logistic__C': 0.046415888336127774, 'pca__n_components': 60}
real 3.46
user 24.25
sys 3.13
- Using Subset + higher tolerance:
Best parameter (CV score=0.942):
{'logistic__C': 1.0, 'pca__n_components': 45}
real 2.74
user 15.70
sys 2.22
Do you think we should proceed with the Standard Scalar?
Also, what does MNT in the title imply(Apologies for the incorrect title, I couldn't find MNT in the contributing docs)?
There was a problem hiding this comment.
Do you think we should proceed with the Standard Scalar?
Yes, when I tried it was working quite well. We might need to check the description if the number of hyperparameter changes.
MNT -> Maintenance
There was a problem hiding this comment.
I tend to use DOC on these PRs though, but no strong feelings.
* Updated plot_digits_pipe * Updated plot_digits_pipe with StandardScaler preprocessing
* Updated plot_digits_pipe * Updated plot_digits_pipe with StandardScaler preprocessing
* Updated plot_digits_pipe * Updated plot_digits_pipe with StandardScaler preprocessing
* Updated plot_digits_pipe * Updated plot_digits_pipe with StandardScaler preprocessing

Reference Issues/PRs
#21598
What does this implement/fix? Explain your changes.
Timed using:
time -p python plot_digits_pipe.pyTime taken by original code (without the plotting part) -
Output:
After updating code [Reduced max iterations, increased tolerance, working with a subset] (without plotting) -
Output:
Any other comments?