MNT Refactor `_average_weighted_percentile` to avoid double sort by lucyleeow · Pull Request #31775 · scikit-learn/scikit-learn

lucyleeow · 2025-07-17T11:21:00Z

Reference Issues/PRs

Supercedes #30945

What does this implement/fix? Explain your changes.

Refactor _average_weighted_percentile so we are not just performing _weighted_percentile twice, thus avoids sorting and computing cumulative sum twice.

#30945 essentially uses the sorted indicies and calculates _weighted_percentile(-array, 100-percentile_rank) - this was verbose and required computing cumulative sum again on the negative (you could have used symmetry to avoid computing cumulative sum in cases when fraction above is greater than 0 - i.e., g>0 from Hyndman and Fan)

I've followed the Hyndman and Fan computation more closely and calculate g and just use j+1 (since we already know j). This did make handling the case where j+1 had a sample weight of 0 (or when you have sample weight of 0 at the end of the array) more complex.

Any other comments?

github-actions · 2025-07-17T11:21:58Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: f0e999e. Link to the linter CI: here}

sklearn/utils/stats.py

lucyleeow · 2025-07-21T01:50:31Z

I think this is ready for review, maybe @ogrisel @betatim ?

lucyleeow · 2025-07-21T01:50:35Z

I think this is ready for review, maybe @ogrisel @betatim ?

ogrisel

Thanks very much @lucyleeow for diving into this. I pushed a commit to make the randomized NumPy equivalence tests stronger and checked locally that they pass with all random seeds:

SKLEARN_TESTS_GLOBAL_RANDOM_SEED="all" pytest -vl sklearn/utils/tests/test_stats.py

It seems that this PR fixes a rare edge case bug found in another PR (in addition to the CPU speed-up and memory improvements): #29641. We could write a changelog entry to document this. However, doing so would require crafting a minimal reproducer to precisely characterize the conditions under which this edge case can be triggered. Maybe we could have a generic changelog entry such as "improve CPU and memory usage in estimators and metric functions that rely on weighted percentiles and better handle edge cases" or something similar.

Please fix the conflicts and feel free to ping me again for the final review.

sklearn/utils/tests/test_stats.py

sklearn/utils/stats.py

lucyleeow · 2025-08-20T11:56:49Z

I looked into the failure from #29641 (comment) (I decided it was more relevant to detail here, and better that it does not get lost in that much bigger PR).

tl;dr the difference is due to floating point precision error, and the new implementation more closely matches how quantiles are implemented in numpy.

Using the test from #29641 (comment), I made a 'minimal' (enough) reproducer. Note that the percentile indices [9, 26, 36] differ between the 2 methods, but the example below just shows 9.

def test_weighted_percentiles():
    rng = np.random.RandomState(63)

    n_samples = 200
    X = rng.randn(n_samples, 3)
    sw = rng.randint(0, 5, size=n_samples)

    # repeat and compute using numpy non-weighted
    X_repeated = np.repeat(X, sw, axis=0)
    col_data = X_repeated[:, 0]
    sort_idx = np.argsort(col_data)
    col_data = col_data[sort_idx]
    percentiles = np.linspace(0, 100, num=49 + 1)
    percentiles = percentiles[1:-1]

    np_percentile = np.percentile( percentiles[9], method="averaged_inverted_cdf")
    print(f'{np_percentile=}')

    # Do not repeat and compute using our weighted_percentiles with weights
    col_data = X[:, 0]
    nnz_sw = sw != 0
    col_data = col_data[nnz_sw]
    sw = sw[nnz_sw]
    sort_idx = np.argsort(col_data)
    col_data = col_data[sort_idx]
    sw = sw[sort_idx]

    old = _averaged_weighted_percentile(col_data, sw, percentiles[9])
    print(f'{old=}')

    new = _weighted_percentile(col_data, sw, percentiles[9], average=True)
    print(f'{new=}')

In this case ([9]), the 'new' method (this PR): adjusted percentile rank (i.e. quantile * cumulative sum) ends up 80.0 exactly (I checked and it matches to 15 decimal places). This matches a value in the weighted_cdf (note sample weights are all integers), so fraction above is 0, and we take the average of the 2 adjacent values.

With the old method, when we take _weighted_percentile of 100-percentile, adjusted percentile ends up being 312.00000000000006. Technically it should be 312 (i.e. cumulative sum - adjusted percentile rank of percentile (80)). This extra bit just pushes us to the next index, so _weighted_percentile(percentile) and _weighted_percentile(100-percentile) end up being the same number - but 100-percentile 'should' have been the adjacent value (assuming the forward adjusted rank value was slightly more accurate).

The other indices, [26, 36], have a similar problem where the adjusted percentile of percentile and 100-percentile don't add up exactly to the cumulative sum (weighted_cdf[-1]), due to floating point precision, and this also just happens to cause searchsorted to pick the adjacent index, just because of the values that were in the weighted_cdf.
I guess the extra calculations (100-percentile then the adjusted_percentile_rank = percentile_rank / 100 * weight_cdf[..., -1]) gives more chance for error.

It's tricky to compare to numpy because they don't have a weighted implementation, and without weights you can just use n to calculate - which is what they do (ref):

_QuantileMethods = {
    # --- HYNDMAN and FAN METHODS
    # Discrete methods
    'inverted_cdf': {
        'get_virtual_index': lambda n, quantiles: _inverted_cdf(n, quantiles),  # noqa: PLW0108
        'fix_gamma': None,  # should never be called
    },
    'averaged_inverted_cdf': {
        'get_virtual_index': lambda n, quantiles: (n * quantiles) - 1,
        'fix_gamma': lambda gamma, _: _get_gamma_mask(
            shape=gamma.shape,
            default_value=1.,
            conditioned_value=0.5,
            where=gamma == 0),
    },

But ultimately I think this calculation (n * quantiles) - 1 matches the new implementation more than reversing the array and calculating the 100-percentile.

I'll fix the merge conflict and address review now, thanks for review!

lucyleeow

Forgot to mention but I did an overhaul of the tests.

I removed one test because it was duplicated (checking that percentile 50 gives the same result as median) and one test that I added a while back that just did not make sense and was redundant.

Also added a general whats new entry. I think this is ready for review again @ogrisel

sklearn/utils/stats.py

lucyleeow · 2025-09-04T11:32:28Z

Thanks @ogrisel ! New commit only adds comments

lucyleeow · 2025-09-08T11:07:25Z

Thanks for the review again @ogrisel , it reads much better now.

virchan

LGTM! Thanks @lucyleeow! Just one naive question.

virchan · 2025-09-10T22:53:38Z

sklearn/utils/tests/test_stats.py

-    sw.fill(0.0)
+# XXX: is this really what we want? Shouldn't we raise instead?
+# https://github.com/scikit-learn/scikit-learn/issues/31032
+def test_weighted_percentile_all_zero_weights():


Should we update this? It looks to me that returning NaN here was already decided in #31032.

Thanks for the review!
I think we should do it in a separate PR, cleaner git history and makes PRs easier to review, I have already made extensive changes to the test file in this PR.

lucyleeow · 2025-09-11T01:52:32Z

@ogrisel @virchan I've added the constant multiplier test and re-ran the CUDA CI.

@ogrisel do you think we should ping someone else for review?

virchan

CI is green, and the PR looks good to me. I'm also fine with having a third review before we merge.

…plier

ogrisel · 2025-09-11T12:45:46Z

Let me trigger the CI with all random seeds on the new test and then merge if green.

sklearn/utils/tests/test_stats.py

ogrisel · 2025-09-11T12:53:09Z

@lucyleeow the new test fails for some random seeds. It might have discovered a real bug?

lucyleeow · 2025-09-13T11:16:38Z

Let me trigger the CI with all random seeds on the new test

Good call!

The test failures are the new constant multiplier test with float, due to cumulative sum error, which I did suspect:

it certainly should be true for integer c but for float c, it is possible that floating point precision problems may cause inequality

We've also seen the same problem here: #30787 (comment)

I double checked one case - [93-20-True-0.3] and can confirm.
For the raw percentile calculation adjusted_percentile_rank (percentile_rank / 100 * weight_cdf[..., -1]) is 7.0, falling exactly on a data point.
For the 0.3 multiplied calculation adjusted_percentile_rank is 2.0999999999999996, falling just below 2.1, making is_fraction_above to be True (instead of False, which it was in the first case).

Looking at the failures, it does pass most of the time, which is nice.

I will change the test to check 2 integer multipliers instead.

lucyleeow · 2025-09-13T11:26:47Z

I've re-triggered the all random seeds

OmarManzoor

LGTM aside from one point that I mentioned

sklearn/utils/tests/test_stats.py

OmarManzoor

Thank you @lucyleeow

I suggested some improvements in the docstrings and comments.

sklearn/utils/stats.py

sklearn/utils/tests/test_stats.py

lucyleeow · 2025-09-17T09:54:47Z

@ogrisel are you happy for this to go in?

lucyleeow · 2025-09-22T04:14:13Z

@ogrisel gentle ping

ogrisel · 2025-09-25T12:38:45Z

Sorry for the slow feedback, thanks very much!

lucyleeow added 3 commits July 14, 2025 14:58

try reverse cum sum

8fe6ae2

initial implementation, wip tests

b9c0c7b

fix and add tests, update use

b56fab0

github-actions bot added module:metrics module:preprocessing module:utils labels Jul 17, 2025

lucyleeow mentioned this pull request Jul 17, 2025

Refactor weighted percentile functions to avoid redundant sorting #30945

Closed

lucyleeow commented Jul 17, 2025

View reviewed changes

sklearn/utils/stats.py Show resolved Hide resolved

lucyleeow added 2 commits July 18, 2025 23:51

fixes and add tests

f99366c

simplify zero sample code

ba57727

lucyleeow added the No Changelog Needed label Jul 19, 2025

typos

bba43c4

lucyleeow added the Array API label Jul 24, 2025

ogrisel mentioned this pull request Jul 31, 2025

Added sample weight handling to BinMapper under HGBT #29641

Merged

7 tasks

Stronger tests

bb9c800

ogrisel reviewed Jul 31, 2025

View reviewed changes

sklearn/utils/tests/test_stats.py Outdated Show resolved Hide resolved

sklearn/utils/tests/test_stats.py Outdated Show resolved Hide resolved

sklearn/utils/stats.py Outdated Show resolved Hide resolved

lucyleeow force-pushed the refactor_weighted_percentile branch from 6e4a0b5 to bb9c800 Compare August 22, 2025 05:20

lucyleeow added 3 commits August 22, 2025 16:29

merge main

9bd2403

review

a49c88c

add whats new

77ec0fb

lucyleeow commented Aug 22, 2025

View reviewed changes

sklearn/utils/stats.py Show resolved Hide resolved

ogrisel added the CUDA CI label Sep 4, 2025

github-actions bot removed the CUDA CI label Sep 4, 2025

lucyleeow added 2 commits September 4, 2025 20:56

Merge branch 'main' into refactor_weighted_percentile

bf46975

add comments

1e00288

add link

fd1a4c1

virchan reviewed Sep 10, 2025

View reviewed changes

add constant multiplier test

87fb11c

lucyleeow added the CUDA CI label Sep 11, 2025

github-actions bot removed the CUDA CI label Sep 11, 2025

virchan approved these changes Sep 11, 2025

View reviewed changes

Trigger CI [all random seeds] test_weighted_percentile_constant_multi…

928c0f9

…plier

ogrisel reviewed Sep 11, 2025

View reviewed changes

sklearn/utils/tests/test_stats.py Outdated Show resolved Hide resolved

lucyleeow mentioned this pull request Sep 13, 2025

Remove median_absolute_error from METRICS_WITHOUT_SAMPLE_WEIGHT #30787

Merged

lucyleeow added 2 commits September 13, 2025 21:24

use int constant

9e3f753

Trigger CI [all random seeds] test_weighted_percentile_constant_multi

c078aa1

OmarManzoor reviewed Sep 16, 2025

View reviewed changes

sklearn/utils/tests/test_stats.py Outdated Show resolved Hide resolved

fix test

a7a1017

OmarManzoor approved these changes Sep 17, 2025

View reviewed changes

review

f0e999e

lucyleeow mentioned this pull request Sep 18, 2025

weighted_percentile should error/warn when all sample weights 0 #31032

Closed

j-hendricks mentioned this pull request Sep 18, 2025

MNT Add option to raise when all sample weights are 0 in _check_sample_weight #32212

Merged

ogrisel merged commit 59220e3 into scikit-learn:main Sep 25, 2025
36 checks passed

lucyleeow deleted the refactor_weighted_percentile branch September 26, 2025 02:33

ogrisel mentioned this pull request Sep 26, 2025

Fix: improve speed of trees with MAE criterion from O(n^2) to O(n log n) #32100

Merged

lucyleeow mentioned this pull request Sep 29, 2025

DOC: specify behavior of _weighted_percentile with average=True #32289

Open

Uh oh!

Conversation

lucyleeow commented Jul 17, 2025

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

github-actions bot commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

Uh oh!

lucyleeow commented Jul 21, 2025

Uh oh!

lucyleeow commented Jul 21, 2025

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lucyleeow commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lucyleeow left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lucyleeow commented Sep 4, 2025

Uh oh!

lucyleeow commented Sep 8, 2025

Uh oh!

virchan left a comment

Choose a reason for hiding this comment

Uh oh!

virchan Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

lucyleeow Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

lucyleeow commented Sep 11, 2025

Uh oh!

virchan left a comment

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Sep 11, 2025

Uh oh!

Uh oh!

ogrisel commented Sep 11, 2025

Uh oh!

lucyleeow commented Sep 13, 2025

Uh oh!

lucyleeow commented Sep 13, 2025

Uh oh!

OmarManzoor left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

OmarManzoor left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lucyleeow commented Sep 17, 2025

Uh oh!

lucyleeow commented Sep 22, 2025

Uh oh!

Uh oh!

ogrisel commented Sep 25, 2025

github-actions bot commented Jul 17, 2025 •

edited

Loading

lucyleeow commented Aug 20, 2025 •

edited

Loading

lucyleeow left a comment •

edited

Loading