Skip to content

MNT Speedup example plot_select_from_model_diabetes.py#21738

Merged
glemaitre merged 5 commits intoscikit-learn:mainfrom
ArthDh:plot_select_new
Nov 25, 2021
Merged

MNT Speedup example plot_select_from_model_diabetes.py#21738
glemaitre merged 5 commits intoscikit-learn:mainfrom
ArthDh:plot_select_new

Conversation

@ArthDh
Copy link
Copy Markdown
Contributor

@ArthDh ArthDh commented Nov 21, 2021

Reference Issues/PRs

#21598

What does this implement/fix? Explain your changes.

Speeds up plot_select_model_diabetes.py by using a subset of features while preserving the original message of the example. The major bottleneck, in this case, was the backward SFS which could be sped up using fewer features.

Features selected by SelectFromModel: ['s1' 's5']
Done in 0.046s
Features selected by forward sequential selection: ['bmi' 's5']
Done in 2.907s
Features selected by backward sequential selection: ['bmi' 's5']
Done in 8.544s
real 15.43
user 13.27
sys 0.96

original_diabetes


Features selected by SelectFromModel: ['bmi' 's5']
Done in 0.033s
Features selected by forward sequential selection: ['bmi' 's5']
Done in 1.407s
Features selected by backward sequential selection: ['bmi' 's5']
Done in 1.848s
real 6.87
user 5.11
sys 0.87

augmented_diabetes

Any other comments?

@ogrisel
Copy link
Copy Markdown
Member

ogrisel commented Nov 22, 2021

From a pedagogical point of view, I find it weird to do an example on automated feature selection on a dataset which was prepared with manual feature selection.

However, since the example does not really rely on the choice of the Lasso penalty, maybe we could use just a faster base linear model, for instance RidgeCV(alphas=np.logspace(-6, 6, num=5)) on the original feature set.

@glemaitre
Copy link
Copy Markdown
Member

However, since the example does not really rely on the choice of the Lasso penalty, maybe we could use just a faster base linear model, for instance RidgeCV(alphas=np.logspace(-6, 6, num=5)) on the original feature set.

I agree with @ogrisel. In the original example, the penalty chosen will let out only a single feature. So using RidgeCV will be much cheaper and keep the description more or less the same. I think that what is really expensive here is indeed the SequentialFeatureSelector. We could reduce to cv=3 and check if this is enough (maybe with an additional note to mention that in practice one should consider to increase the number of folds)

@ArthDh
Copy link
Copy Markdown
Contributor Author

ArthDh commented Nov 22, 2021

Using RidgeCV the results are much faster on the original feature set:

Features selected by SelectFromModel: ['s1' 's5']
Done in 0.001s
Features selected by forward sequential selection: ['bmi' 's5']
Done in 0.116s
Features selected by backward sequential selection: ['bmi' 's5']
Done in 0.305s
real 6.60
user 2.20
sys 0.88

Graph generated:
ridge

Should I move forward and update the file?

@adrinjalali adrinjalali changed the title [MRG] Speedup examples/feature_selection/plot_select_model_diabetes.py [MRG] Speedup example plot_select_from_model_diabetes.py Nov 22, 2021
@adrinjalali adrinjalali mentioned this pull request Nov 22, 2021
41 tasks
@glemaitre
Copy link
Copy Markdown
Member

Using RidgeCV the results are much faster on the original feature set:

Yes this is expected. Can you make the change?

@glemaitre glemaitre changed the title [MRG] Speedup example plot_select_from_model_diabetes.py MNT Speedup example plot_select_from_model_diabetes.py Nov 23, 2021
@ogrisel
Copy link
Copy Markdown
Member

ogrisel commented Nov 24, 2021

The test failure is unrelated.

Copy link
Copy Markdown
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, this is a very nice speed improvement with the same pedagogical message!

Just 2 things to fix:

@@ -46,10 +46,10 @@
# :ref:`sphx_glr_auto_examples_inspection_plot_linear_model_coefficient_interpretation.py`.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment above needs to be updated to replace LassoCV by RidgeCV.

#
# We also note that the features selected by SFS differ from those selected by
# feature importance: SFS selects `bmi` instead of `s1`. This does sound
# feature importance: SFS selects `bmi` instead of `s1`. This does sounds
Copy link
Copy Markdown
Member

@ogrisel ogrisel Nov 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original sentence was grammatically correct.

Suggested change
# feature importance: SFS selects `bmi` instead of `s1`. This does sounds
# feature importance: SFS selects `bmi` instead of `s1`. This does sound

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I missed 'does' in the sentence, I have updated the documentation!

@ArthDh ArthDh requested a review from ogrisel November 24, 2021 16:30
Copy link
Copy Markdown
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@glemaitre glemaitre merged commit 974cd77 into scikit-learn:main Nov 25, 2021
@glemaitre
Copy link
Copy Markdown
Member

Thanks @ArthDh Merging since the failure in the CI is unrelated with your changes.

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Nov 29, 2021
samronsin pushed a commit to samronsin/scikit-learn that referenced this pull request Nov 30, 2021
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Dec 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants