ENH Add sample_weight support to NearestCentroid#33477
Open
prem-479 wants to merge 5 commits intoscikit-learn:mainfrom
Open
ENH Add sample_weight support to NearestCentroid#33477prem-479 wants to merge 5 commits intoscikit-learn:mainfrom
prem-479 wants to merge 5 commits intoscikit-learn:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Reference Issues/PRs
Fixes #33457
What does this implement/fix? Explain your changes.
Proposed Changes
This PR introduces
sample_weightsupport to theNearestCentroidclassifier, allowing for weighted means (Euclidean) and weighted medians (Manhattan), with strict parity between dense and sparse inputs.What was built:
NearestCentroid.fit(): Now acceptssample_weightand applies it correctly through all four code paths: weighted Euclidean mean (dense and sparse) and weighted median (dense Manhattan via_weighted_percentileand sparse Manhattan via a new utility).m-parameter.csc_weighted_median_axis_0and_get_weighted_mediantosklearn/utils/sparsefuncs.pyto handle implicit zeros correctly without spurious zero insertion.Key guarantees and edge cases handled:
sum(weights) > n_classes.ValueErrorwith a clear message to prevent downstream NaNs.sum_w_zerosto prevent tiny negative values caused by floating-point inaccuracies.np.ptpwithX.max(axis=0) - X.min(axis=0).Files Changed:
sklearn/neighbors/_nearest_centroid.pysklearn/utils/sparsefuncs.pysklearn/neighbors/tests/test_nearest_centroid.pysklearn/utils/tests/test_sparsefuncs.pyTest Coverage:
Added
test_csc_weighted_median_axis_0andtest_nearest_centroid_sample_weightto verify identical behavior between dense/sparse inputs, correct centroid shifting, and proper error handling. All 24 tests pass cleanly.AI usage disclosure
I used AI assistance for:
Any other comments?
All 24 tests pass locally, including parity checks between sparse and dense formats and against zero-weight edge cases. Looking forward to any feedback!