[MRG + 2] fix selectFdr bug#7490
Conversation
|
I've inserted "Fixes #7474" in the issue description above. |
|
Tests aren't passing. |
|
Any one can input some comments on this build failed. I am not familiar with this. |
| sv = np.sort(self.pvalues_) | ||
| selected = sv[sv <= float(self.alpha) / n_features | ||
| * np.arange(n_features)] | ||
| * (np.arange(n_features) + 1)] |
There was a problem hiding this comment.
Looks like a pep8 problem. Line break should be after *.
selected = sv[sv <= float(self.alpha) / n_features *
(np.arange(n_features) + 1)]There was a problem hiding this comment.
fwiw, i think pep8 has changed their stance on this, see [1,2]. It does seem to be what's causing the travis failure though (and we should keep consistency with the rest of the codebase).
[1] https://www.python.org/dev/peps/pep-0008/#should-a-line-break-before-or-after-a-binary-operator
[2] http://bugs.python.org/issue26763
There was a problem hiding this comment.
I'd be happy to disable W503. I tend to agree with the more recent convention in terms of readability.
Strange that https://www.python.org/dev/peps/pep-0008 doesn't mention when the document was updated.
There was a problem hiding this comment.
@jnothman , guido van rossum comments in the bug tracker on 4/15 that he updated the pep documentation. Interestingly, I thought that W503 is ignored by default in flake8
There was a problem hiding this comment.
conda flake8 might be outdated.
| false_discovery_rate = np.mean([single_fdr(alpha, n_informative, | ||
| random_state) for | ||
| random_state in range(30)]) | ||
| random_state in range(100)]) |
There was a problem hiding this comment.
FDR = E(FP / (TP + FP)) <= alpha
FDR is count by average many times (N) of "FP/(TP+FP)"
The larger of N, FDR is more accuracy.
Actually, N is the larger the better here.
There was a problem hiding this comment.
Did it not pass without that change?
| selected = sv[sv <= float(self.alpha) / n_features | ||
| * np.arange(n_features)] | ||
| selected = sv[sv <= float(self.alpha) / n_features * | ||
| (np.arange(n_features) + 1)] |
There was a problem hiding this comment.
should be arange(1, n_features+1)
|
In any case, I don't see how this test ensures that your change works. Could you please add a test that clearly would fail at master? |
|
Such a test could be that provided with only one p-value, that the function returns the approriate answer (ie, in this case that if the p-value is small enough but not 0, it should be selected - in master, it would only be selected if the p-value is equal to 0, but now it would be selected if the p-value is smaller than |
|
thanks @NelleV, i agree withyou. this test is very simple and direct, is it necessary |
|
yes it is necessary. |
|
hi @jnothman , sorry for late reply because of national holiday. I propose the following method to show our change works, def test_select_fdr_regression2():
X, y = make_regression(n_samples=150, n_features=20,
n_informative=5, shuffle=False,
random_state=0, noise=10)
with warnings.catch_warnings(record=True):
# Warnings can be raised when no features are selected
# (low alpha or very noisy data)
univariate_filter = SelectFdr(f_regression, alpha=0.01)
univariate_filter.fit(X, y)
support = univariate_filter.get_support()
num_false_positives = np.sum(support[n_informative:] == 1)
num_true_positives = np.sum(support[:n_informative] == 1)
assert_equal(num_false_positives, 5)
assert_equal(num_true_positives, 7)How do you think about it? |
|
@mpjlu please format your code using backticks (I did it for you this time). |
|
I don't understand the test. The FDR in this case is 5/12 which is about 0.5, while the test claims to control it at .01? Or am I misunderstanding something here? |
|
Hi @amueller , the code in the last comment is just an code template, not the real data. Sorry for the misunderstanding. |
|
ah... |
|
Hi @jnothman , do you think I should add a test case (by manually construct the sample data) that clearly would fail at master? Any other work I need do for this PR? |
|
You should construct a test case that ensures that the implementation does what it should. Ideally that would make the off-by-one distinction violated presently. |
|
hi @jnothman , I add a test case for this pr. For the sample data of the test, the master code cannot select one feature, it will select two features or select none. This PR can select 0, 1 or 2 features according to alpha. |
| # by PR #7490, the result is array([True, False]) | ||
| X = np.array([[10, 20], [20, 20], [20, 30]]) | ||
| y = np.array([[1], [0], [0]]) | ||
| univariate_filter = SelectFdr(chi2, alpha=0.1) |
There was a problem hiding this comment.
can you please assert that the p-values are the ones you specify above?
There was a problem hiding this comment.
Yes, the p-values in comments are real p-values.
There was a problem hiding this comment.
Do you mean I should assert the p-values in the test code?
|
sorry, what does "add an entry to what's new" mean? |
|
Hi, since it is a bug fix, I suppose you can add a entry in what's new under bug fixes section. Hope it helps. |
|
thanks @maniteja123 |
doc/whats_new.rst
Outdated
| Bug fixes | ||
| ......... | ||
|
|
||
| - :class:`sklearn.feature_selection.SelectFdr` now correctly works |
There was a problem hiding this comment.
A more descriptive message to tell people what was broken and is now fixed would be helpful. Often we use a message like "Fixed a bug where :class:feature_selection.SelectFdr did not ..."
There was a problem hiding this comment.
how about"Fixed a bug where :class:feature_selection.SelectFdr did not accurately implement Benjamini-Hochberg procedure"
doc/whats_new.rst
Outdated
| ......... | ||
|
|
||
| - Fix a bug where :class:`sklearn.feature_selection.SelectFdr` did not | ||
| exactly implement Benjamini-Hochberg procedure |
There was a problem hiding this comment.
That's better, but indicating that it formerly may have selected fewer features than it should (and now does) would be more helpful for former users of this estimator.
|
Seems there is a build environment problem? any comments for this? Thanks. |
|
Hi, yeah you are right. It looks like that was some appveyor issue. I suppose it is fine since all the other builds succeed. |
|
thanks @maniteja123 |
amueller
left a comment
There was a problem hiding this comment.
Looks good to me apart from adding the link.
| exactly implement Benjamini-Hochberg procedure. It formerly may have | ||
| selected fewer features than it should. | ||
| (`#7490 <https://github.com/scikit-learn/scikit-learn/pull/7490>`_) by | ||
| `Peng Meng`_. |
There was a problem hiding this comment.
You need to add yourself at the bottom of the page if you want your name to be a link.
# Conflicts: # doc/whats_new.rst
# Conflicts: # doc/whats_new.rst
* tag '0.18.1': (144 commits) skip tree-test on 32bit do the warning test as we do it in other places. Replase assert_equal by assert_almost_equal in cosine test version bump 0.18.1 fix merge conflict mess in whatsnew add the python2.6 warning to 0.18.1 fix learning_curve test that I messed up in cherry-picking the "reentrant cv" PR. sync whatsnew with master [MRG] TST Ensure __dict__ is unmodified by predict, transform, etc (scikit-learn#7553) FIX scikit-learn#6420: Cloning decision tree estimators breaks criterion objects (scikit-learn#7680) Add whats new entry for scikit-learn#6282 (scikit-learn#7629) [MGR + 2] fix selectFdr bug (scikit-learn#7490) fixed whatsnew cherry-pick mess (somewhat) [MRG + 2] FIX LogisticRegressionCV to correctly handle string labels (scikit-learn#5874) [MRG + 2] Fixed parameter setting in SelectFromModel (scikit-learn#7764) [MRG+2] DOC adding separate `fit()` methods (and docstrings) for DecisionTreeClassifier and DecisionTreeRegressor (scikit-learn#7824) Fix docstring typo (scikit-learn#7844) n_features --> n_components [MRG + 1] DOC adding :user: role to whats_new (scikit-learn#7818) [MRG+1] label binarizer not used consistently in CalibratedClassifierCV (scikit-learn#7799) DOC : fix docstring of AIC/BIC in GMM ...
* releases: (144 commits) skip tree-test on 32bit do the warning test as we do it in other places. Replase assert_equal by assert_almost_equal in cosine test version bump 0.18.1 fix merge conflict mess in whatsnew add the python2.6 warning to 0.18.1 fix learning_curve test that I messed up in cherry-picking the "reentrant cv" PR. sync whatsnew with master [MRG] TST Ensure __dict__ is unmodified by predict, transform, etc (scikit-learn#7553) FIX scikit-learn#6420: Cloning decision tree estimators breaks criterion objects (scikit-learn#7680) Add whats new entry for scikit-learn#6282 (scikit-learn#7629) [MGR + 2] fix selectFdr bug (scikit-learn#7490) fixed whatsnew cherry-pick mess (somewhat) [MRG + 2] FIX LogisticRegressionCV to correctly handle string labels (scikit-learn#5874) [MRG + 2] Fixed parameter setting in SelectFromModel (scikit-learn#7764) [MRG+2] DOC adding separate `fit()` methods (and docstrings) for DecisionTreeClassifier and DecisionTreeRegressor (scikit-learn#7824) Fix docstring typo (scikit-learn#7844) n_features --> n_components [MRG + 1] DOC adding :user: role to whats_new (scikit-learn#7818) [MRG+1] label binarizer not used consistently in CalibratedClassifierCV (scikit-learn#7799) DOC : fix docstring of AIC/BIC in GMM ... Conflicts: removed sklearn/externals/joblib/__init__.py sklearn/externals/joblib/_parallel_backends.py sklearn/externals/joblib/testing.py
* dfsg: (144 commits) skip tree-test on 32bit do the warning test as we do it in other places. Replase assert_equal by assert_almost_equal in cosine test version bump 0.18.1 fix merge conflict mess in whatsnew add the python2.6 warning to 0.18.1 fix learning_curve test that I messed up in cherry-picking the "reentrant cv" PR. sync whatsnew with master [MRG] TST Ensure __dict__ is unmodified by predict, transform, etc (scikit-learn#7553) FIX scikit-learn#6420: Cloning decision tree estimators breaks criterion objects (scikit-learn#7680) Add whats new entry for scikit-learn#6282 (scikit-learn#7629) [MGR + 2] fix selectFdr bug (scikit-learn#7490) fixed whatsnew cherry-pick mess (somewhat) [MRG + 2] FIX LogisticRegressionCV to correctly handle string labels (scikit-learn#5874) [MRG + 2] Fixed parameter setting in SelectFromModel (scikit-learn#7764) [MRG+2] DOC adding separate `fit()` methods (and docstrings) for DecisionTreeClassifier and DecisionTreeRegressor (scikit-learn#7824) Fix docstring typo (scikit-learn#7844) n_features --> n_components [MRG + 1] DOC adding :user: role to whats_new (scikit-learn#7818) [MRG+1] label binarizer not used consistently in CalibratedClassifierCV (scikit-learn#7799) DOC : fix docstring of AIC/BIC in GMM ...
Reference Issue
Fixes #7474
What does this implement/fix? Explain your changes.
Description
selectFdr in scikit-learn/sklearn/feature_selection/univariate_selection.py:
Should Be:
Because np.arange is start from 0, here it should be start from 1
Any other comments?
This change is