[WIP] Enforce positiveness in tests using enforce_estimator_tags_X#14705
[WIP] Enforce positiveness in tests using enforce_estimator_tags_X#14705wdevazelhes wants to merge 22 commits intoscikit-learn:masterfrom
Conversation
|
I'll also include a test for |
|
I think I'll need to change the message in the Chi2Samplers from ""Entries of X must be non-negative."" into "Negative values in data passed to", to pass |
|
I'm having trouble figuring out what to do to fix the tests: the problem comes from |
|
I guess a solution is to not flag them as |
|
Ah my bad, |
|
Ah my bad again, turns out |
|
I'm back to the following reformulated question then: #14705 (comment)
|
|
In the end, since the Chi squared kernel works for positive points (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.chi2_kernel.html#sklearn.metrics.pairwise.chi2_kernel), I guess it makes sense to forbid passing negative data to |
|
I just added a modification to ensure positivity at fit time too |
|
Actually I saw that |
|
However tests now still fail for |
|
Hm I don't think it's worth creating a tag for that. And it's a kind of weird requirement. Not sure if anyone actually uses this and whether we should just deprecate it... |
|
Actually it's usually applied to positive data so maybe just add the tag asking for positivity, that should be fine. |
Allright, done (see commit 851887d) Otherwise I was also thinking since we're talking about |
| assert np.isfinite(kernel_approx).all(), \ | ||
| 'NaNs found in the approximate Gram matrix' | ||
|
|
||
| # test error is raised on when inputs contains values smaller than -c |
There was a problem hiding this comment.
We should be able to remove this since now we test for positivity and it's already tested by check_fit_non_negative ? Although now that I think about it check_fit_non_negative, would only test the fit part, whereas the below would test the transform part, so I'll let it here but adapt it to check just positivity (and not > -c).
But then it reminds me: should we have a check_transform_non_negative as well in common tests ?
There was a problem hiding this comment.
It's not supposed to raise an error on negative values, only on values smaller than -c.
| raise ValueError("X may not contain entries smaller than" | ||
| " -skewedness.") | ||
|
|
||
| check_non_negative(X, "SkewedChi2Sampler (input X)") |
There was a problem hiding this comment.
Why did you replace the test? Now it will fail with things that previously worked.
There was a problem hiding this comment.
Sorry, I thought you meant to actually change the be behavior of the estimator (and indeed make it more restrictive) here:
Actually it's usually applied to positive data so maybe just add the tag asking for positivity, that should be fine.
Because the thing is AFAIU I can't just add a tag requires_positive_X but still let the estimator accept negative values (>-c), because check_fit_non_negative will then fail (a common test that checks that any estimator that has a requires_positive_X tag throws an error when given negative X) (actually it will fail because of the error message, but even if I change the error message into something like "Negative values in data passed to" (which I could tweak into sth like "Negative values in data passed to SkewedChi2Sampler after adding the constant c", which would be more exact) the test could still fail in principle because it could in principle test only on negative values that are >-c (it does not bc the negative values are random with a variance apparently high enough but logically it could)
There was a problem hiding this comment.
I agree with @amueller I would not change the behavior of the estimator to make the common tests pass. It just says that requires_positive_X is not what is adapted here.
There was a problem hiding this comment.
I agree, so should I create a new safe tag here ? Like requires_X_higher_than_constant or sth, that I could enforce in enforce_estimator_tags_X ? Because we need to find a way for tests that by default have negative values to translate them into >c for SkewedChi2Sampler
There was a problem hiding this comment.
Or maybe an X_types kind of tag ?
| Y[0, 0] = -c / 2. | ||
|
|
||
| # approximate kernel mapping | ||
| transform = SkewedChi2Sampler(skewedness=c, n_components=1000, |
There was a problem hiding this comment.
It's a bad sign if you have to change any tests, because that probably means that you changed behavior and are breaking someone's code.
There was a problem hiding this comment.
I agree, I forgot about those comments... I just put the PR into WIP until there's a better solution (see comment above)
sklearn/utils/estimator_checks.py
Outdated
| if _safe_tags(estimator, "requires_positive_X"): | ||
| # Create strictly positive X. The minimal increment above 0 is 1, as | ||
| # X could be of integer dtype. | ||
| X += 1 + abs(X.min()) |
There was a problem hiding this comment.
why not X -= X.min() - 1e-5 or something like that? I don't know why you would do abs. If your data had a minimum of 2 after this is has a minimum of 5
There was a problem hiding this comment.
I agree X -= X.min() - 1e-5 seems better, I just had taken this line from enforce_estimator_tags_y` above
Will do
There was a problem hiding this comment.
ahh the comment was about dtype int? hm not sure in how far that applies to X here. But if thests are passing then that's fine I guess ;) Is this ready? If so, please rename from WIP to MRG.
There was a problem hiding this comment.
Ah yes, I didn't see that comment, I guess this means if there is an integer X at one point we want to let it integer ? I don't know to what extent it applies indeed, but maybe in case we can let it with the int ? What about X -= X.min() - 1 ? So that it'll still be ok wrt to your comment https://github.com/scikit-learn/scikit-learn/pull/14705/files/c6751b454bdf84699954e3f6bfa7000db2d42d48#r316775147 but it keeps the integer ?
| # Estimators with a `requires_positive_X` tag only accept strictly positive | ||
| # inputs | ||
| if _safe_tags(estimator, "requires_positive_X"): | ||
| # Create strictly positive X. The minimal increment above 0 is 1, as |
There was a problem hiding this comment.
The comment and the code don't really agree, right?
There was a problem hiding this comment.
That's right, I forgot the comment :p (see #14705 (comment))
|
I think if tests pass we can merge |
|
@wdevazelhes it does not pass (yet) |
| X, y = check_X_y(X, y, dtype='int') | ||
| if np.any(X < 0): | ||
| raise ValueError("X must not contain negative values.") | ||
| check_non_negative(X, "CategoricalNB (input X)") |
There was a problem hiding this comment.
By the way, check_non_negative finds the min and checks if it's negative, wouldn't it be faster to use np.any ? (in theory it would just need to iterate until finding one nonnegative element but I don't know if that's what is done by any)
There was a problem hiding this comment.
That's right, with timeit np.min() is faster. I forgot but in np.any we still need to do the X>0 pass, which amounts to one pass on the whole data that's why it'll not be faster... I didn't find any numpy function that would prevent to do this whole pass so I guess np.min is good
In[1]: import numpy as np
In [2]: a = np.random.randn(10000, 10000)
In [3]: %timeit np.any(a)
124 ms ± 808 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [4]: %timeit np.min(a)
56.5 ms ± 69.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)|
Tests are green now (they were failing in |
|
Any update on #14705 (comment) ? |
|
@wdevazelhes you need to revert the API change in SkewedChi2Sampler. Also see how to fix the merge conflicts |
|
this is now made irrelevant after #16286 |
See #14680 (comment)