MNT Refactor _average_weighted_percentile to avoid double sort#31775
MNT Refactor _average_weighted_percentile to avoid double sort#31775ogrisel merged 23 commits intoscikit-learn:mainfrom
_average_weighted_percentile to avoid double sort#31775Conversation
1 similar comment
ogrisel
left a comment
There was a problem hiding this comment.
Thanks very much @lucyleeow for diving into this. I pushed a commit to make the randomized NumPy equivalence tests stronger and checked locally that they pass with all random seeds:
SKLEARN_TESTS_GLOBAL_RANDOM_SEED="all" pytest -vl sklearn/utils/tests/test_stats.py
It seems that this PR fixes a rare edge case bug found in another PR (in addition to the CPU speed-up and memory improvements): #29641. We could write a changelog entry to document this. However, doing so would require crafting a minimal reproducer to precisely characterize the conditions under which this edge case can be triggered. Maybe we could have a generic changelog entry such as "improve CPU and memory usage in estimators and metric functions that rely on weighted percentiles and better handle edge cases" or something similar.
Please fix the conflicts and feel free to ping me again for the final review.
|
I looked into the failure from #29641 (comment) (I decided it was more relevant to detail here, and better that it does not get lost in that much bigger PR). tl;dr the difference is due to floating point precision error, and the new implementation more closely matches how quantiles are implemented in numpy. Using the test from #29641 (comment), I made a 'minimal' (enough) reproducer. Note that the percentile indices [9, 26, 36] differ between the 2 methods, but the example below just shows def test_weighted_percentiles():
rng = np.random.RandomState(63)
n_samples = 200
X = rng.randn(n_samples, 3)
sw = rng.randint(0, 5, size=n_samples)
# repeat and compute using numpy non-weighted
X_repeated = np.repeat(X, sw, axis=0)
col_data = X_repeated[:, 0]
sort_idx = np.argsort(col_data)
col_data = col_data[sort_idx]
percentiles = np.linspace(0, 100, num=49 + 1)
percentiles = percentiles[1:-1]
np_percentile = np.percentile( percentiles[9], method="averaged_inverted_cdf")
print(f'{np_percentile=}')
# Do not repeat and compute using our weighted_percentiles with weights
col_data = X[:, 0]
nnz_sw = sw != 0
col_data = col_data[nnz_sw]
sw = sw[nnz_sw]
sort_idx = np.argsort(col_data)
col_data = col_data[sort_idx]
sw = sw[sort_idx]
old = _averaged_weighted_percentile(col_data, sw, percentiles[9])
print(f'{old=}')
new = _weighted_percentile(col_data, sw, percentiles[9], average=True)
print(f'{new=}')In this case ([9]), the 'new' method (this PR): adjusted percentile rank (i.e. quantile * cumulative sum) ends up 80.0 exactly (I checked and it matches to 15 decimal places). This matches a value in the weighted_cdf (note sample weights are all integers), so fraction above is 0, and we take the average of the 2 adjacent values. With the old method, when we take The other indices, [26, 36], have a similar problem where the adjusted percentile of It's tricky to compare to numpy because they don't have a weighted implementation, and without weights you can just use _QuantileMethods = {
# --- HYNDMAN and FAN METHODS
# Discrete methods
'inverted_cdf': {
'get_virtual_index': lambda n, quantiles: _inverted_cdf(n, quantiles), # noqa: PLW0108
'fix_gamma': None, # should never be called
},
'averaged_inverted_cdf': {
'get_virtual_index': lambda n, quantiles: (n * quantiles) - 1,
'fix_gamma': lambda gamma, _: _get_gamma_mask(
shape=gamma.shape,
default_value=1.,
conditioned_value=0.5,
where=gamma == 0),
},But ultimately I think this calculation I'll fix the merge conflict and address review now, thanks for review! |
6e4a0b5 to
bb9c800
Compare
There was a problem hiding this comment.
Forgot to mention but I did an overhaul of the tests.
I removed one test because it was duplicated (checking that percentile 50 gives the same result as median) and one test that I added a while back that just did not make sense and was redundant.
Also added a general whats new entry. I think this is ready for review again @ogrisel
|
Thanks @ogrisel ! New commit only adds comments |
|
Thanks for the review again @ogrisel , it reads much better now. |
virchan
left a comment
There was a problem hiding this comment.
LGTM! Thanks @lucyleeow! Just one naive question.
| sw.fill(0.0) | ||
| # XXX: is this really what we want? Shouldn't we raise instead? | ||
| # https://github.com/scikit-learn/scikit-learn/issues/31032 | ||
| def test_weighted_percentile_all_zero_weights(): |
There was a problem hiding this comment.
Should we update this? It looks to me that returning NaN here was already decided in #31032.
There was a problem hiding this comment.
Thanks for the review!
I think we should do it in a separate PR, cleaner git history and makes PRs easier to review, I have already made extensive changes to the test file in this PR.
virchan
left a comment
There was a problem hiding this comment.
CI is green, and the PR looks good to me. I'm also fine with having a third review before we merge.
|
Let me trigger the CI with all random seeds on the new test and then merge if green. |
|
@lucyleeow the new test fails for some random seeds. It might have discovered a real bug? |
Good call! The test failures are the new constant multiplier test with float, due to cumulative sum error, which I did suspect:
We've also seen the same problem here: #30787 (comment) I double checked one case - Looking at the failures, it does pass most of the time, which is nice. I will change the test to check 2 integer multipliers instead. |
|
I've re-triggered the all random seeds |
OmarManzoor
left a comment
There was a problem hiding this comment.
LGTM aside from one point that I mentioned
OmarManzoor
left a comment
There was a problem hiding this comment.
Thank you @lucyleeow
I suggested some improvements in the docstrings and comments.
|
@ogrisel are you happy for this to go in? |
|
@ogrisel gentle ping |
|
Sorry for the slow feedback, thanks very much! |
Reference Issues/PRs
Supercedes #30945
What does this implement/fix? Explain your changes.
Refactor
_average_weighted_percentileso we are not just performing_weighted_percentiletwice, thus avoids sorting and computing cumulative sum twice.#30945 essentially uses the sorted indicies and calculates
_weighted_percentile(-array, 100-percentile_rank)- this was verbose and required computing cumulative sum again on the negative (you could have used symmetry to avoid computing cumulative sum in cases when fraction above is greater than 0 - i.e.,g>0from Hyndman and Fan)I've followed the Hyndman and Fan computation more closely and calculate
gand just usej+1(since we already knowj). This did make handling the case wherej+1had a sample weight of 0 (or when you have sample weight of 0 at the end of the array) more complex.Any other comments?