RFE/RFECV step enhancements by hermidalc · Pull Request #12578 · scikit-learn/scikit-learn

hermidalc · 2018-11-13T11:09:11Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Adds two new RFE/RFECV step parameters, step_one_threshold and step_decay.

step_one_threshold (int, default=None) allows to set a threshold number of features where step will revert to 1 if specified step int or float equates to > 1 feature.

step_decay (boolean, default=False) modifies behavior when float step specified. Instead of removing fixed number of features at each iteration calculated from percentage of starting number of features, removes a decaying number of features calculated from percentage of remaining features at each iteration.

Comments

Still need to add tests. Need feedback on parameter names and design if it is consistent with sklearn principles.

amueller · 2018-11-14T00:47:48Z

sklearn/feature_selection/rfe.py

        (rounded down) of features to remove at each iteration.

+    step_one_threshold : int or None (default=None)
+        If specified step int or float equates to > 1 feature then the


This is not very clear.

Yes I agree that’s why I need feedback!

I’ve had my own customized RFE for over a year with these features and they are features that are described by Guyon et al in the original SVM-RFE paper.

I am totally open to param name changes etc. The param here is if you’ve set the step param to an int or float that corresponds to stepping multiple features at a time then you can soecify at some threshold num features where it starts stepping by one feature at a time. Maybe threshold isn’t the right word not sure

I think the effect of this is that "Feature sets sized between n_features_to_select and step_one_threshold, inclusive, will all be considered regardless of the step parameter."

Possible names:

try_all_below

step_if_at_least

amueller · 2018-11-14T00:51:19Z

sklearn/feature_selection/rfe.py

+        threshold number of remaining features when to revert to step = 1
+
+    step_decay : boolean, optional (default=False)
+        If step is a float whether to calculate the percentage of features


How about "If step is a float, whether to remove a step fraction of the original number of features, or a step fraction of the current number of features in each step". Still not very clear though :-/
"Whether to remove a fraction of the original number feature or current number of features at each step?"

If true and step is a float, the number of features removed is calculated as a fraction of the remaining features in that iteration. If false, the number of features removed is constant (a fraction of the original number of features) across iterations.

jnothman · 2018-11-14T09:56:10Z

sklearn/feature_selection/rfe.py

+        threshold number of remaining features when to revert to step = 1
+
+    step_decay : boolean, optional (default=False)
+        If step is a float whether to calculate the percentage of features


If true and step is a float, the number of features removed is calculated as a fraction of the remaining features in that iteration. If false, the number of features removed is constant (a fraction of the original number of features) across iterations.

jnothman · 2018-11-14T10:20:16Z

sklearn/feature_selection/rfe.py

+        If specified step int or float equates to > 1 feature then the
+        threshold number of remaining features when to revert to step = 1
+
+    step_decay : boolean, optional (default=False)


I would consider calling this reducing_step

jnothman · 2018-11-14T10:20:58Z

sklearn/feature_selection/rfe.py

        (rounded down) of features to remove at each iteration.

+    step_one_threshold : int or None (default=None)
+        If specified step int or float equates to > 1 feature then the


I think the effect of this is that "Feature sets sized between n_features_to_select and step_one_threshold, inclusive, will all be considered regardless of the step parameter."

Possible names:

try_all_below

step_if_at_least

jnothman · 2018-11-14T10:22:10Z

sklearn/feature_selection/rfe.py

            self.scores_ = []

+        step_one_threshold = (self.step_one_threshold if
+                              self.step_one_threshold is not None else 0)


Should we be handling a float here too?

Sure why not, we can also make it an int as well. Meaning change the step rate to a different float or int at the threshold that corresponds to fewer features being removed per step. This behavior wasn't mentioned in the Guyon SVM-RFE paper but it doesn't harm I guess (they only mentioned having the behavior where you remove more features at the beginning in a reducing step and at some number of remaining features based on your particular problem and dataset you start stepping by 1)

@jnothman I updated code and doc based on suggestions above, except still don't understand the suggestion here for try_all_below or step_if_at_least? I think maybe a slight misunderstanding.

Suppose you have a dataset with 3000 features. When specifying step=100 (or a float) and step_one_threshold=250 then this is what will happen:

3000, 2900, 2800, ... , 400, 300, 250, 249, 248, 247, ... , n_features_to_select

The goal is to step by the specified (step) larger number of features until you get to the specified threshold number of features (step_one_threshold) where it will start stepping 1 at a time until it gets to n_features_to_select. You can also see to that the way I wrote the code it won't ever step over the threshold no matter how larger the step size.

If step_one_threshold is clearer to you, fine. What I mean by try_all_below is that from 250 down, all feature set sizes will be tried. Similarly, step_if_at_least indicates that if the feature set size is smaller than that value, step is disregarded.

I’m not super happy with step_one_threshold either. But also feel the two suggestions could also be confusing to users. try_all_below makes it seem like it’s going to try but could fail when it will always do all, also the word below means it’s not including the number of feature passed in the param, meaning try all below 250 doesn’t include 250 which it should.

step_if_at_least can be hard for people to understand because I ask myself step what if it at least, it’s not clear by the param name that it means step param is active at at least those remaining features..

But I guess people should read the doc explanations, just thinking we could come up with a param name that is almost self explanatory.

I would lean towards the style of try all below vs step if at least, it’s easier to understand. Maybe try_all_from so it sounds inclusive of the param number?

jnothman · 2018-11-14T10:22:50Z

sklearn/feature_selection/rfe.py

        while np.sum(support_) > n_features_to_select:
            # Remaining features
            features = np.arange(n_features)[support_]
+            n_remaining_features = np.sum(support_)


Is this ever not deterministic? I.e. can we determine the sequence of feature set sizes in advance, or in a separate generator function (given X.shape[1], step, n_features_to_select, step_one_threshold, step_decay)?

RFE is deterministic and doesn't ever change based on iteration intermediate results. I guess it's somewhat separate from this pull request but a great idea.

The idea is that it belongs in this PR because it isolates the added complexity of step size / feature set size calculation from the selection of features. It would mean we don't need perplexingly redundant computation like np.sum(support_).

Sorry what I meant is (and you probably saw this) is all of the code you are talking about, the loop and the redundant computation this is how RFE was before, I didn't add this code. That's why I thought it should be in a second PR but it's fine.

Yes, but until now, the loop looked relatively straightforward :)

The idea is that it belongs in this PR because it isolates the added complexity of step size / feature set size calculation from the selection of features. It would mean we don't need perplexingly redundant computation like np.sum(support_).

Latest commit slightly improves preexisting num remaining features logic. I will refactor though to precompute all the steps in one iterable and then just loop over that in elimination loop.

hermidalc · 2018-11-16T17:23:36Z

@amueller @jnothman quick question, in my other pull request for the univariate_score_params you told me it's important to maintain positional order for backwards compat and always putting a new kwarg at the end. But here I see in sklearn 0.20.0 someone added a new min_features_to_select param that is not at the end and doesn't maintain positional order for backwards compat for people who just specify their params positionally without keywords.

hermidalc · 2018-11-17T17:02:37Z

Latest commit changing step_decay -> reducing_step param name, updated doc using #12578 (comment) above, and most importantly fixed the logic in the loop as previous logic didn't work as intended in all cases.

jnothman · 2018-11-18T09:53:01Z

in my other pull request for the univariate_score_params you told me it's important to maintain positional order for backwards compat and always putting a new kwarg at the end. But here I see in sklearn 0.20.0 someone added a new min_features_to_select param that is not at the end and doesn't maintain positional order for backwards compat for people who just specify their params positionally without keywords.

Indeed... others might not have done that. I generally go by the assumption that after the first two or so parameters we require users to specify by kwarg. Not that we say this loudly, although it is noted in the glossary.

fixed the logic in the loop as previous logic didn't work as intended in all cases.

This is one reason to pull the logic into a separate iterable.

jnothman · 2018-11-22T02:53:18Z

sklearn/feature_selection/rfe.py

                self.scores_.append(step_score(estimator, features))
            support_[features[ranks][:threshold]] = False
            ranking_[np.logical_not(support_)] += 1
+            n_remaining_features = np.sum(support_)


Shouldn't this just be n_remaining_features - threshold?

Yes that’s true right sorry for missing that.

jnothman · 2018-11-22T03:01:57Z

sklearn/feature_selection/rfe.py

            ranks = np.ravel(ranks)

+            # Adjust step using special parameters if specified
+            if self.step_one_threshold is not None:


I still would appreciate making this less nested and having a clearer interface by pulling it out into a method or generator

After thinking about it for a bit and trying to make it more concise i don’t think it’s possible unless I’m missed some possible way to code the logic. In order to perform the logic necessary it actually takes those if statements and nested like that. Remember that it has to handle when only reducing_step is set, or only step_one_threshold, or both parameters.

I will move it out like you said into a separate generator function, or what do you prefer a private method that just generates all the remaining feature sizes in one go and returns a list and then I change the main loop to to loop over this?

I'm not too fussed about this. It is cosmetic. I'm just finding it harder to read than I think it should be.

jnothman · 2018-11-22T03:02:21Z

Please add tests

hermidalc · 2018-11-30T03:04:48Z

@jnothman I'll get back to this soon, it's last couple weeks of semester for me with projects and final exams. Once semester is over we'll finish this.

calemagruder · 2019-01-14T18:17:50Z

@hermidalc any updates? Would love to see this functionality added :-) Thanks!

hermidalc · 2019-01-14T18:58:15Z

@calemagruder I’ll get back to it in a couple days and will finish the implementation and pull request with @jnothman

hermidalc · 2019-03-11T16:54:06Z

Hi all - sorry for the delay... grad school consuming my time. I'm back working on this now give me a couple days to make another commit.

I've made the step change more general that it doesn't have to revert to 1 but a user-specified float or int value. So you can for example step=0.1 then change step to 0.05 or 5 at a specific threshold number of features.

…e_step_enhancements

hermidalc · 2019-03-17T18:13:17Z

I screwed up my forked repository so now the pull request is also messed up... apologies. I'm going to close this and open a new pull request! I will refer back to this in that request.

jnothman · 2019-03-17T22:14:00Z

No problem, thanks

RFE/RFECV step enhancements

b138299

amueller reviewed Nov 14, 2018

View reviewed changes

jnothman reviewed Nov 14, 2018

View reviewed changes

Fix mistake with step variable and check threshold <= 0

f5edb29

Changed step_decay to reducing_step and update docs; fixed logic in loop

51ade89

Improve num remaining features logic

92d45b8

jnothman reviewed Nov 22, 2018

View reviewed changes

Calculate loop n_remaining_features from theshold

3c656e7

Merge branch 'master' of github.com:scikit-learn/scikit-learn into rf…

4075571

…e_step_enhancements

hermidalc closed this Mar 17, 2019

hermidalc mentioned this pull request Mar 18, 2019

RFE/RFECV step enhancements #13470

Closed

Uh oh!

Conversation

hermidalc commented Nov 13, 2018

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Comments

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hermidalc Nov 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hermidalc Nov 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hermidalc Nov 22, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hermidalc Nov 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hermidalc Nov 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hermidalc commented Nov 16, 2018

Uh oh!

hermidalc commented Nov 17, 2018

Uh oh!

jnothman commented Nov 18, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hermidalc Nov 22, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Nov 22, 2018

Uh oh!

hermidalc commented Nov 30, 2018

Uh oh!

calemagruder commented Jan 14, 2019

hermidalc Nov 14, 2018 •

edited

Loading

hermidalc Nov 16, 2018 •

edited

Loading

hermidalc Nov 22, 2018 •

edited

Loading

hermidalc Nov 15, 2018 •

edited

Loading

hermidalc Nov 21, 2018 •

edited

Loading

hermidalc Nov 22, 2018 •

edited

Loading

hermidalc commented Mar 11, 2019 •

edited

Loading

hermidalc commented Mar 17, 2019 •

edited

Loading