ENH: stats: add 'alternative' parameter to ansari#13650
ENH: stats: add 'alternative' parameter to ansari#13650mdhaber merged 22 commits intoscipy:masterfrom
Conversation
Ansari-Bradley test was missing the `alternative` and `mode` parameters. As I don't have have access to the book used for the implementation in SciPy, I have referred to another source (`[3]_`) and tried to port the logic to SciPy. To confirm that the new implementation is correct, a large and comprehensive test suite has been added with intuitive tests and tests comparing SciPy's result with that of R. Moreover, I have atempted to write a perspicuous explaination of the new parameters. An example elucidating the `alternative` parameter has been added to the documentation suite.
There was a problem hiding this comment.
Looks good. The expansion of the test suite with ground truth from R is nice. The two most important comments are:
- Is the normal approximation really any more correct than the exact distribution when there are ties? If not, I would prefer simplifying the logic (and documentation and tests) so that if the user specifies
'exact', then exact is what they get - ties or not. We'd just have to document that it is exact under the assumption that there are no ties. - Under what conditions is
indnot an integer and the ceil has to be taken? Can I always think ofcindas the index ofa1associated with the observed value of the test statistic?
| are both less than 55 and there are no ties, otherwise a normal | ||
| approximation for the p-value is used. | ||
| - For ``mode='exact'``, the p-value given is exact when there are | ||
| no ties, otherwise a normal approximation for the p-value is used. |
There was a problem hiding this comment.
Is the normal approximation any more correct when there are ties? If not, should exact always use the exact distribution? We should just warn users that it is exact under the assumption that there are no ties.
There was a problem hiding this comment.
Is the normal approximation any more correct when there are ties?
I have not checked but the exact distribution may take a longer time to generate when ties are present? (I saw this for the wilcoxon signed rank test) Also, R always falls back to the approximation method when ties are present. Not saying that we should blindly inherit that behavior though. I am OK with both:
- always using the exact distribution (irrespective of ties) when
mode='exact' - just give a warning and use the approximation like R
What do you think will be better to do?
Also, for now, I am keeping the part where I throw an error when ties are present and mode='exact'. We need to change that once we decide which option is better. I will also update the documentation and tests accordingly.
There was a problem hiding this comment.
I have not checked but the exact distribution may take a longer time to generate when ties are present?
I think the post from SO means that if ties are considered in the exact calculation, calculating the exact distribution takes longer. Our implementation of the exact distribution doesn't seem to account for ties, so my understanding is that execution time would not be affected. Would you agree?
I think what we should do here depends on whether we do some sort of tie correction in the asymptotic method. For instance, our wilcoxon does what you have here, but I see that the asymptotic method adjusts for ties (see n_zero). From that perspective, I can see why, in the presence of ties, the asymptotic method might be more correct than the exact method. Does the asymptotic method adjustment for ties here?
There was a problem hiding this comment.
I think the post from SO means that if ties are considered in the exact calculation, calculating the exact distribution takes longer. Our implementation of the exact distribution doesn't seem to account for ties, so my understanding is that execution time would not be affected. Would you agree?
Yes, you are right.
I think what we should do here depends on whether we do some sort of tie correction in the asymptotic method. For instance, our
wilcoxondoes what you have here, but I see that the asymptotic method adjusts for ties (seen_zero). From that perspective, I can see why, in the presence of ties, the asymptotic method might be more correct than the exact method. Does the asymptotic method adjustment for ties here?
Ah, I get it now. Thanks! I think the current method does adjust for ties:
2257 if repeats: # adjust variance estimates
2258 # compute np.sum(tj * rj**2,axis=0)
2259 fac = np.sum(symrank**2, axis=0)
2260 if N % 2: # N odd
2261 varAB = m * n * (16*N*fac - (N+1)**4) / (16.0 * N**2 * (N-1))
2262 else: # N even
2263 varAB = m * n * (16*fac - N*(N+2)**2) / (16.0 * N * (N-1))In that case, a sensible thing to do is to raise a warning when ties are present in mode='exact' that says something like: The exact method doesn't correct for ties. Please use mode='auto' or mode='approx' instead if you want to account for ties. and use the exact method only. Or we could always fall back to the approximate method (like R) after giving a warning (like exact method doesn't account for ties. Using approximate method instead.). IMO, the latter option is better but the former also doesn't sound bad. What do you think would be better to do here?
There was a problem hiding this comment.
I think it's preferable to stick with the exact method when the user requests the exact method. The question I would ask is whether to emit a warning or not.
I have the same question in gh-4933. I had assumed
- "auto" would check for ties. if ties are present or both sample sizes are large, the asymptotic method would be used
- neither "exact" nor "asymyptotic" would check for anything (or emit any warnings); they would just respect the user's decision.
But maybe it's worth a message to the mailing list. Would you be willing to send one?
There was a problem hiding this comment.
You can use either, depending on your preference. If you don't know either tool, Pythran is probably easier for this problem. We won't want custom parallelism here, which was the reason to go with Cython for stats.qmc.discrepancy.
There was a problem hiding this comment.
You can use either, depending on your preference. If you don't know either tool, Pythran is probably easier for this problem
Ah, I see. Thanks!
We won't want custom parallelism here, which was the reason to go with Cython for
stats.qmc.discrepancy.
Got it.
There was a problem hiding this comment.
@tirthasheshpatel What do you think of separating the addition of the alternative parameter from the addition of mode and making a separate PR from the latter? I think it would be good to get the alternative parameter done now, while the experience of implementing it in other tests is fresh. Improvements to the exact test will be mostly unrelated, and they will take more time to implement and review, so a separate PR would be good.
(Personally, I still don't think merging this PR as-is would be inherently wrong. This function - like many others in scipy.stats - has been providing asymptotic p-values that may not be appropriate for small samples for almost 20 years. R's equivalent does it, too. Addition of a 'mode' parameter is a fine time to assess whether that's OK or not, but it's not the cause of the problem. This PR just gives users the chance to pick their poison.
As for whether the exact parameter with our current implementation - ignoring ties - should raise an exception or a warning or nothing at all, there are valid points for all. I think the most important thing would be to clearly document the behavior.
But it is fine with me if we want to wait to add mode until we add an exact method that considers ties. Having an exact option considering ties would definitely be an improvement.)
There was a problem hiding this comment.
What do you think of separating the addition of the
alternativeparameter from the addition ofmodeand making a separate PR from the latter?
Agreed.
Improvements to the exact test will be mostly unrelated, and they will take more time to implement and review, so a separate PR would be good.
Yeah, improving the exact method seems too much for this PR. I tried to find something for handling ties in the exact method for ansari in an efficient way but couldn't find anything other than a couple of papers to which I don't have access. Hopefully, this can be taken up in a future PR!
OK, so I will make some final tweaks to this PR:
- removing the mode parameter
- removing relevant tests
- fixing up the null/alternate hypothesis in the documentation
- reverting my
cindchange. - using
_normtest_finish - change
assert_almost_equaltoassert_allclose
Anything I missed?
There was a problem hiding this comment.
@mdhaber I have made all the changes. How does everything look now?
| if ind == cind: | ||
| pval = 2.0 * np.sum(a1[:cind+1], axis=0) / total | ||
| else: | ||
| cind = cind + 1 |
There was a problem hiding this comment.
Under what circumstances is ind not an integer?
The answer may have implications for the logic below. Just to be sure we agree - I think we want to make sure that the p-values always include the probability mass corresponding with the observed value of the test statistic, right? (If that doesn't make sense, let me know and I'll try to rephrase.)
There was a problem hiding this comment.
Under what circumstances is
indnot an integer?
Oh wow, I didn't think of this before. Thank you very much for noticing this! I checked and I think it is always an integer as AB will always be an integer (for mode='exact') and astart is calculated as ASTART = FPOINT((TEST + 1) / 2) * FPOINT(1 + TEST / 2) where TEST is an integer and so all the divisions will be floor divisions. If this is the case, we can get rid of cind and find and just convert the ind itself to integer ind = int(AB - astart). I tested that and it works. So, I will push that for review.
Just to be sure we agree - I think we want to make sure that the p-values always include the probability mass corresponding with the observed value of the test statistic, right? (If that doesn't make sense, let me know and I'll try to rephrase.)
It makes sense and you are right.
There was a problem hiding this comment.
Oops; I didn't mean to ask for a change to that. Presumably the original authors had a reason for what they had? How confident are you? Changing existing code like that changes the scope of review a bit.
What I was going to say is that if cind is the index corresponding with the observed value of the test statistic, I thought we should leave it alone rather than redefine it cind = cind + 1. You had something like
cind = cind + 1
...
pval = 2.0 * np.sum(a1[:cind], axis=0) / totalIt makes more sense to me to only add 1 when appropriate to the calculation.
...
pval = 2.0 * np.sum(a1[:cind+1], axis=0) / totalWhich is what you have now (good).
There was a problem hiding this comment.
Oops; I didn't mean to ask for a change to that. Presumably the original authors had a reason for what they had? How confident are you? Changing existing code like that changes the scope of review a bit.
Oh, Sorry! I am not sure why the original authors did that. There may be a subtle case I may have missed. In that case, I will revert to the original behavior while adding one to cind only when required by the calculation. Does that sound like a correct thing to do here?
There was a problem hiding this comment.
Yes, i think so. Or at least no worse than it is now. And adding +1 when needed is the right way to write it, i think, because the value in cind is more central to the problem than cind+1, so the value in cind is the one that i think should have the variable name attached to it.
* Make the documentation of `alternative` and `mode` more concise. * Add test checking if the sum of the p-values equal 1 + probability mass under the calculated statistic value. * Remove `cind` and `find` as `ind` will ALWAYS be an integer. * Rephrase the example to not include the 'level of significance'. * Add a sample R Code used to get the ground truth values.
* removes the `mode` parameter * removes all the relevant tests * reverts the `cind` change * `ansari` now uses the `_normtest_finish` function
I have changed the inequality null hypothesis to an equality null hypothesis. I have also shifted the description of the alternate hypothesis from the notes section. I forgot to remove the `mode` parameter from notes which has been removed now.
| fligner : A non-parametric test for the equality of k variances | ||
| mood : A non-parametric test for the equality of two scale parameters | ||
|
|
||
| Notes |
There was a problem hiding this comment.
We shouldn't remove this now.
There was a problem hiding this comment.
Ah, yes! Sorry, removed it by mistake.
mdhaber
left a comment
There was a problem hiding this comment.
Looks pretty good. I think we need to remove the part about 'mode' from the title. When I come back to this (soon), I need to take a look at what has changed about the tests, and I need to look at the exact statistic calculation more closely.
| if cind != ind: | ||
| cind = cind - 1 |
There was a problem hiding this comment.
I thought by "reverting my cind change" you meant getting rid of this?
But don't change anything yet; I am going to take a closer look since there has been so much confusion over this part.
There was a problem hiding this comment.
Sorry, @tirthasheshpatel I understand what's going on better now. But this code could have used comments when it was initially written, so I think we need to add them in this PR.
Part of the confusion is that there are two distinct reasons for the + 1s here. It may not be immediately obvious why either one is needed - and much less so when there are two separate reasons and the original and new code achieve them in ways that look different.
This part:
cind = int(ceil(ind))
if cind != ind:
cind = cind - 1and this part:
find = int(floor(ind))
if find != ind:
find = find + 1
are just rounding to the nearest int.
Update: no. the first one seems very similar to int(floor(ind)) and the second seems very similar to int(ceil(ind)))
The purpose was not immediately obvious to me, as one would expect ind to be an int already, since AB is an integer (Update: not when there are ties.) and astart should be an int (but apparently it's returned as a float). The purpose is also obscured by the fact that it is done a different way in each branch. Why floor and add in one branch and ceil and subtract in the other? (Update: the new question is why not just floor or ceil)
Now I know there can be subtleties converting floats to ints, e.g.
In [16]: np.equal(np.rint(2**54 + 1.0), np.rint(2**54))
Out[16]: True
but I have a few questions:
floats can represent all integers within certain bounds exactly. Is there any reason to suspect thatastartwon't be an exact integer returned as afloat- provided that it's within bounds? If it's so big that not all integers can be represented exactly anymore, we're hosed anyway - precision is lost, and careful rounding can't save us.Does these two three-line rounding routines do something better thanint(np.round(ind))?- Just a thought: we could just calculate
astartourselves using Python arithmetic. It looks like ifmis the length ofx, thenastart = floor((m+1)/2)*(1+floor(m/2)).
The second use of +1 is used to index the right part of the discrete distribution returned by gscale. Assuming ind is the index corresponding with the probability mass of the distribution at quantile AB, the "naive" thing to do would be to just calculate, e.g. cdf = a1[:ind+1]).sum() / total and sf = a1[:ind]).sum() / total. Instead, it determines which tail of the distribution the quantile is in and chooses to calculate either the CDF or SF, then potentially take the complement. That's OK, a comment should help explain what's going on. And if I had written it originally, I would have written a class to represent the distribution (e.g. with methods cdf and sf). Not sure if we should do it now, but I think it would make the code easier to read.
Let me take a shot at this, actually. We'll see what you think.
There was a problem hiding this comment.
But this code could have used comments when it was initially written, so I think we need to add them in this PR.
I agree.
floats can represent all integers within certain bounds exactly. Is there any reason to suspect thatastartwon't be an exact integer returned as afloat- provided that it's within bounds? If it's so big that not all integers can be represented exactly anymore, we're hosed anyway - precision is lost, and careful rounding can't save us.
I also think there is no reason for astart not to be an exact integer returned as float. If I understand the FORTRAN code correctly, FPOINT just takes the integer and converts it to float (all the calculations are done as integers so we don't have to worry about getting a float during the calculations). So, there is no reason for it to be a float which is not an exact integer.
Does these two three-line rounding routines do something better than
int(np.round(ind))?
Let me think about this.
Just a thought: we could just calculate
astartourselves using Python arithmetic. It looks like ifmis the length ofx, thenastart = floor((m+1)/2)*(1+floor(m/2)).
I think we will have to use floor division astart = floor((m+1)//2)*(1+floor(m//2)) because that's what FORTRAN does with integers. If we do that, we already know that astart will be an integer and there is no need to floor it.
And if I had written it originally, I would have written a class to represent the distribution (e.g. with methods
cdfandsf). Not sure if we should do it now, but I think it would make the code easier to read.
I agree. This would definitely make it easier to read.
Let me take a shot at this, actually. We'll see what you think.
Sure!
There was a problem hiding this comment.
I wasn't thinking clearly before. They round in different directions unless AB is already integer. But I still don't understand why they're different from floor or ceil.
There was a problem hiding this comment.
Now I see that you refactored some of those parts I was confused about. Can you explain the intent? How is:
cind = int(ceil(ind))
if cind != ind:
cind = cind - 1
different from
cind = int(floor(ind))?
There was a problem hiding this comment.
OK, I gave this a shot in tirthasheshpatel#4. I believe the conventions for cdf and sf in case of ties (non-integer ind) is the same as the original code, and I think it makes sense.
Of course, if we're not going to allow the user to use the exact method in case of ties, this is a moot point. ind will always be an integer (despite being stored as a float), so it doesn't matter which way we round. But for the record:
When calculating p-values, we want to sum the probability masses from the distribution of the test-statistic under the null hypothesis for all statistic values equally extreme or more extreme than the observed value of the statistic. Because the exact distribution implemented here does not consider ties, the probability mass at non-integer values of the statistic is 0. Therefore, when the observed value of the statistic is non-integer, we sum over statistic values strictly more extreme than the observed value. When summing over the left tail, this means we sum from the left up to and including floor(ind). When summing over the right tail, this means we sum from ceil(ind) to the right. That is effectively what the original code does, so I've preserved that behavior.
There was a problem hiding this comment.
If you prefer, we can also leave this as-is, now that I understand what's going on. But if you like the structure of the PR to your branch and agree that it would help with the exact test, feel free to merge it.
There was a problem hiding this comment.
OK, I gave this a shot in tirthasheshpatel#4. I believe the conventions for
cdfandsfin case of ties (non-integerind) is the same as the original code, and I think it makes sense.
I looked at it and it's very helpful, thanks!
When calculating p-values, we want to sum the probability masses from the distribution of the test-statistic under the null hypothesis for all statistic values equally extreme or more extreme than the observed value of the statistic. Because the exact distribution implemented here does not consider ties, the probability mass at non-integer values of the statistic is 0. Therefore, when the observed value of the statistic is non-integer, we sum over statistic values strictly more extreme than the observed value. When summing over the left tail, this means we sum from the left up to and including
floor(ind). When summing over the right tail, this means we sum fromceil(ind)to the right. That is effectively what the original code does, so I've preserved that behavior.
All of this makes sense to me.
If you prefer, we can also leave this as-is, now that I understand what's going on. But if you like the structure of the PR to your branch and agree that it would help with the exact test, feel free to merge it.
I think that tirthasheshpatel#4 is way more readable and the comments help a lot. So, I think it would be very nice to get it merged in this branch!
Add comment explaining the definition of the AB statistic so that the reader doesn't get confused between the way the statictic vs ratio of scales is defined.
Make the documentation of null and alternative hypothesis clearer. Co-authored-by: Matt Haberland <mhaberla@calpoly.edu>
|
Merged master for the documentation build failures to go away. |
|
Nit: if possible, prefer rebase+force-push to merging master (keeps git history cleaner for future bisections) |
|
@ev-br on PRs reviewed by me, that is not done until the end because of the broken history. This will be squash-merged anyway. |
Sure! |
* Use double quotes in the documentation of `_ABW`. * Use `# random` in examples for future deprecation of the seed from them. * Use the `with...` blocks for testing error messages. * Use single backtick for parameters. Co-authored-by: Pamphile ROY <roy.pamphile@gmail.com>
|
Thanks @tupui for the review. I have made all the changes you asked for. Looks good now? |
tupui
left a comment
There was a problem hiding this comment.
LGTM, thanks @tirthasheshpatel 😃
|
I will let @mdhaber merge as he was more involved here and may have a final say. |
Co-authored-by: Pamphile ROY <roy.pamphile@gmail.com>
|
OK now I am definitive 😅 LGTM for real. |
tirthasheshpatel
left a comment
There was a problem hiding this comment.
remove # random #13650 (comment)
Co-authored-by: Matt Haberland <mhaberla@calpoly.edu>
|
@mdhaber Yes 🙌 I've followed from further (last changes) but LGTM too. Thanks. |
That's fine, since it doesn't affect the behavior of the code right now. I would only ask you to get into the weeds with us if we want to add the +2 positive reviews, so merging. Thanks @tirthasheshpatel, @tupui! Update: oops - refguide failure. hmmm... this wasn't supposed to affect anything... holding off on merge. |
|
False alarm. Changing the rounding direction didn't affect anything, as expected. For the first example, we still don't match R down to the precision to which the R result is given (13 digits), but it does not seem to be a bug in the Python code. There's not an off-by-one error or anything. No matter how I take the sums, I get a number that is different from the R result by ~4e-9, so it just seems to be a slight difference in the calculation of the null distribution. I believe it's an inherent inaccuracy because the null distribution is calculated in float32. The largest Changing that is outside the scope of this PR. I found this surprising when I was investigating: np.float32(616410350000000.0) # 616410350000000.0
np.float64(np.float32(616410350000000.0)) # 616410350878720.0
np.float64(616410350000000.0) # 616410350000000.0I would have thought that since the number can be represented exactly in float32 and float64 that the conversion would be exact, but it is not. I'll have to read up on that... |
|
Thanks for addressing the documentation failures @mdhaber. All green now! |
|
Thanks again @tirthasheshpatel @tupui! |
* master: (164 commits) DOC: Add Karl Pearson's reference to chi-square test (scipy#13971) BLD: fix build warnings for causal/anticausal pointers in ndimage MAINT: stats: Fix unused imports and a few other issues related to imports. DOC: fix typo MAINT: Remove duplicate calculations in sokalmichener BUG: spatial: fix weight handling of `distance.sokalmichener`. DOC: update Readme (scipy#13910) MAINT: QMCEngine d input validation (scipy#13940) MAINT: forward port 1.6.3 relnotes REL: add PEP 621 (project metadata in pyproject.toml) support EHN: signal: make `get_window` supports `general_cosine` and `general_hamming` window functions. (scipy#13934) ENH/DOC: pydata sphinx theme polishing (scipy#13814) DOC/MAINT: Add copyright notice to qmc.primes_from_2_to (scipy#13927) BUG: DOC: signal: fix need argument config and add missing doc link for `signal.get_window` DOC: fix subsets docstring (scipy#13926) BUG: signal: fix get_window argument handling and add tests. (scipy#13879) ENH: stats: add 'alternative' parameter to ansari (scipy#13650) BUG: Reactivate conda environment in init MAINT: use dict built-in rather than OrderedDict Revert "CI: Add nightly release of NumPy in linux workflows (scipy#13876)" (scipy#13909) ...
Reference issue
Closes gh-688
Addresses gh-12506
Superseded gh-5012
What does this implement/fix?
Ansari-Bradley test was missing the
alternativeparameter. As I don't have access to the book used for the implementation in SciPy, I have referred to another source ([3]_) and tried to port the logic to SciPy. To confirm that the new implementation is correct, a large and comprehensive test suite has been added with intuitive tests as well as tests comparing SciPy's result with that of R. Moreover, I have attempted to write a perspicuous explanation of the new parameter. An example elucidating thealternativeparameter has also been added to the documentation suite.Additional information
A (stalled) PR gh-5012 has been proposed a long time ago but the implementation is wrong. It also doesn't document or test the new parameters. I hope this serves as an improvement over gh-5012.