Skip to content

MNT Refactor _average_weighted_percentile to avoid double sort#31775

Merged
ogrisel merged 23 commits intoscikit-learn:mainfrom
lucyleeow:refactor_weighted_percentile
Sep 25, 2025
Merged

MNT Refactor _average_weighted_percentile to avoid double sort#31775
ogrisel merged 23 commits intoscikit-learn:mainfrom
lucyleeow:refactor_weighted_percentile

Conversation

@lucyleeow
Copy link
Copy Markdown
Member

Reference Issues/PRs

Supercedes #30945

What does this implement/fix? Explain your changes.

Refactor _average_weighted_percentile so we are not just performing _weighted_percentile twice, thus avoids sorting and computing cumulative sum twice.

#30945 essentially uses the sorted indicies and calculates _weighted_percentile(-array, 100-percentile_rank) - this was verbose and required computing cumulative sum again on the negative (you could have used symmetry to avoid computing cumulative sum in cases when fraction above is greater than 0 - i.e., g>0 from Hyndman and Fan)

I've followed the Hyndman and Fan computation more closely and calculate g and just use j+1 (since we already know j). This did make handling the case where j+1 had a sample weight of 0 (or when you have sample weight of 0 at the end of the array) more complex.

Any other comments?

@github-actions
Copy link
Copy Markdown

github-actions bot commented Jul 17, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: f0e999e. Link to the linter CI: here

@lucyleeow
Copy link
Copy Markdown
Member Author

I think this is ready for review, maybe @ogrisel @betatim ?

1 similar comment
@lucyleeow
Copy link
Copy Markdown
Member Author

I think this is ready for review, maybe @ogrisel @betatim ?

Copy link
Copy Markdown
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much @lucyleeow for diving into this. I pushed a commit to make the randomized NumPy equivalence tests stronger and checked locally that they pass with all random seeds:

SKLEARN_TESTS_GLOBAL_RANDOM_SEED="all" pytest -vl sklearn/utils/tests/test_stats.py 

It seems that this PR fixes a rare edge case bug found in another PR (in addition to the CPU speed-up and memory improvements): #29641. We could write a changelog entry to document this. However, doing so would require crafting a minimal reproducer to precisely characterize the conditions under which this edge case can be triggered. Maybe we could have a generic changelog entry such as "improve CPU and memory usage in estimators and metric functions that rely on weighted percentiles and better handle edge cases" or something similar.

Please fix the conflicts and feel free to ping me again for the final review.

@lucyleeow
Copy link
Copy Markdown
Member Author

lucyleeow commented Aug 20, 2025

I looked into the failure from #29641 (comment) (I decided it was more relevant to detail here, and better that it does not get lost in that much bigger PR).

tl;dr the difference is due to floating point precision error, and the new implementation more closely matches how quantiles are implemented in numpy.

Using the test from #29641 (comment), I made a 'minimal' (enough) reproducer. Note that the percentile indices [9, 26, 36] differ between the 2 methods, but the example below just shows 9.

def test_weighted_percentiles():
    rng = np.random.RandomState(63)

    n_samples = 200
    X = rng.randn(n_samples, 3)
    sw = rng.randint(0, 5, size=n_samples)

    # repeat and compute using numpy non-weighted
    X_repeated = np.repeat(X, sw, axis=0)
    col_data = X_repeated[:, 0]
    sort_idx = np.argsort(col_data)
    col_data = col_data[sort_idx]
    percentiles = np.linspace(0, 100, num=49 + 1)
    percentiles = percentiles[1:-1]

    np_percentile = np.percentile( percentiles[9], method="averaged_inverted_cdf")
    print(f'{np_percentile=}')

    # Do not repeat and compute using our weighted_percentiles with weights
    col_data = X[:, 0]
    nnz_sw = sw != 0
    col_data = col_data[nnz_sw]
    sw = sw[nnz_sw]
    sort_idx = np.argsort(col_data)
    col_data = col_data[sort_idx]
    sw = sw[sort_idx]

    old = _averaged_weighted_percentile(col_data, sw, percentiles[9])
    print(f'{old=}')

    new = _weighted_percentile(col_data, sw, percentiles[9], average=True)
    print(f'{new=}')

In this case ([9]), the 'new' method (this PR): adjusted percentile rank (i.e. quantile * cumulative sum) ends up 80.0 exactly (I checked and it matches to 15 decimal places). This matches a value in the weighted_cdf (note sample weights are all integers), so fraction above is 0, and we take the average of the 2 adjacent values.

With the old method, when we take _weighted_percentile of 100-percentile, adjusted percentile ends up being 312.00000000000006. Technically it should be 312 (i.e. cumulative sum - adjusted percentile rank of percentile (80)). This extra bit just pushes us to the next index, so _weighted_percentile(percentile) and _weighted_percentile(100-percentile) end up being the same number - but 100-percentile 'should' have been the adjacent value (assuming the forward adjusted rank value was slightly more accurate).

The other indices, [26, 36], have a similar problem where the adjusted percentile of percentile and 100-percentile don't add up exactly to the cumulative sum (weighted_cdf[-1]), due to floating point precision, and this also just happens to cause searchsorted to pick the adjacent index, just because of the values that were in the weighted_cdf.
I guess the extra calculations (100-percentile then the adjusted_percentile_rank = percentile_rank / 100 * weight_cdf[..., -1]) gives more chance for error.

It's tricky to compare to numpy because they don't have a weighted implementation, and without weights you can just use n to calculate - which is what they do (ref):

_QuantileMethods = {
    # --- HYNDMAN and FAN METHODS
    # Discrete methods
    'inverted_cdf': {
        'get_virtual_index': lambda n, quantiles: _inverted_cdf(n, quantiles),  # noqa: PLW0108
        'fix_gamma': None,  # should never be called
    },
    'averaged_inverted_cdf': {
        'get_virtual_index': lambda n, quantiles: (n * quantiles) - 1,
        'fix_gamma': lambda gamma, _: _get_gamma_mask(
            shape=gamma.shape,
            default_value=1.,
            conditioned_value=0.5,
            where=gamma == 0),
    },

But ultimately I think this calculation (n * quantiles) - 1 matches the new implementation more than reversing the array and calculating the 100-percentile.

I'll fix the merge conflict and address review now, thanks for review!

@lucyleeow lucyleeow force-pushed the refactor_weighted_percentile branch from 6e4a0b5 to bb9c800 Compare August 22, 2025 05:20
Copy link
Copy Markdown
Member Author

@lucyleeow lucyleeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to mention but I did an overhaul of the tests.

I removed one test because it was duplicated (checking that percentile 50 gives the same result as median) and one test that I added a while back that just did not make sense and was redundant.

Also added a general whats new entry. I think this is ready for review again @ogrisel

@ogrisel ogrisel added the CUDA CI label Sep 4, 2025
@github-actions github-actions bot removed the CUDA CI label Sep 4, 2025
@lucyleeow
Copy link
Copy Markdown
Member Author

Thanks @ogrisel ! New commit only adds comments

@lucyleeow
Copy link
Copy Markdown
Member Author

Thanks for the review again @ogrisel , it reads much better now.

Copy link
Copy Markdown
Member

@virchan virchan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @lucyleeow! Just one naive question.

sw.fill(0.0)
# XXX: is this really what we want? Shouldn't we raise instead?
# https://github.com/scikit-learn/scikit-learn/issues/31032
def test_weighted_percentile_all_zero_weights():
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we update this? It looks to me that returning NaN here was already decided in #31032.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review!
I think we should do it in a separate PR, cleaner git history and makes PRs easier to review, I have already made extensive changes to the test file in this PR.

@lucyleeow
Copy link
Copy Markdown
Member Author

@ogrisel @virchan I've added the constant multiplier test and re-ran the CUDA CI.

@ogrisel do you think we should ping someone else for review?

Copy link
Copy Markdown
Member

@virchan virchan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI is green, and the PR looks good to me. I'm also fine with having a third review before we merge.

@ogrisel
Copy link
Copy Markdown
Member

ogrisel commented Sep 11, 2025

Let me trigger the CI with all random seeds on the new test and then merge if green.

@ogrisel
Copy link
Copy Markdown
Member

ogrisel commented Sep 11, 2025

@lucyleeow the new test fails for some random seeds. It might have discovered a real bug?

@lucyleeow
Copy link
Copy Markdown
Member Author

Let me trigger the CI with all random seeds on the new test

Good call!

The test failures are the new constant multiplier test with float, due to cumulative sum error, which I did suspect:

it certainly should be true for integer c but for float c, it is possible that floating point precision problems may cause inequality

We've also seen the same problem here: #30787 (comment)

I double checked one case - [93-20-True-0.3] and can confirm.
For the raw percentile calculation adjusted_percentile_rank (percentile_rank / 100 * weight_cdf[..., -1]) is 7.0, falling exactly on a data point.
For the 0.3 multiplied calculation adjusted_percentile_rank is 2.0999999999999996, falling just below 2.1, making is_fraction_above to be True (instead of False, which it was in the first case).

Looking at the failures, it does pass most of the time, which is nice.

I will change the test to check 2 integer multipliers instead.

@lucyleeow
Copy link
Copy Markdown
Member Author

I've re-triggered the all random seeds

Copy link
Copy Markdown
Contributor

@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM aside from one point that I mentioned

Copy link
Copy Markdown
Contributor

@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @lucyleeow

I suggested some improvements in the docstrings and comments.

@lucyleeow
Copy link
Copy Markdown
Member Author

@ogrisel are you happy for this to go in?

@lucyleeow
Copy link
Copy Markdown
Member Author

@ogrisel gentle ping

@ogrisel ogrisel merged commit 59220e3 into scikit-learn:main Sep 25, 2025
36 checks passed
@ogrisel
Copy link
Copy Markdown
Member

ogrisel commented Sep 25, 2025

Sorry for the slow feedback, thanks very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants