Fix MinibatchKMeans minibatch_indices creation by snath-xoc · Pull Request #30751 · scikit-learn/scikit-learn

snath-xoc · 2025-02-02T18:46:49Z

Reference Issues/PRs

Tries (although not successfully) to fix #30750

What does this implement/fix? Explain your changes.

When creating minibatch_indices before the mini_batch_step we employ weighted resampling (with replacement)

Any other comments?

This does not solve the issue, I am still getting histograms similar to as shown in the issue, even when using init="random". I did not change the sample weight passing into the mini_batch_step, so currently they are double accounted for. This is probably an issue however I see that the sample weight it used in the _minibatch_update_dense function. Any further thoughts on this would help.

TO DO:

sample weights are double accounted for as passed on to the mini batch step after selecting mini-batch indices. Need further discussion to see if we can leave them out of the mini batch step altorgether.
I had to add dummy sample weights of ones to minibatch step, otherwise I was getting errors and exits during testing. It turns out the _check_sample_weight returns an array of ones with the X dtype when sample_weight is None. Please check if the current implementation makes sense now.
Test is still not returning similar results to KMeans, with init="random" both methods return results not seemingly respecting sample weight equivalence.
test_scaled_weights is broken now, need to fix

github-actions · 2025-02-02T18:47:59Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 980f146. Link to the linter CI: here}

ogrisel · 2025-02-03T15:20:56Z

I did not change the sample weight passing into the mini_batch_step, so currently they are double accounted for.

I agree, let's not reuse the weights in the computation of the step as they are already used to sample the minibatch. Let's remove this and simplify the code of _minibatch_step accordingly.

jeremiedbb

This slightly changes the behavior of MiniBatchKMeans, even without using sample weights so it needs 2 changelog entries. One for the bug fix and one the change of behavior.

Some doctests fail because of the change. You just need to use the new results values that these snippets produce as expected results.

sklearn/cluster/_kmeans.py

jeremiedbb · 2025-02-04T17:50:28Z

sklearn/cluster/_kmeans.py

+                # Note, I am not sure how sample weights are used here
+                # So left it in, it seems like the weight sums are updated using
+                # sample weights so need some help here to understand the
+                # _minibatch_update_dense/sparse code


Like in KBinsDiscretizer, we should not use sample weights after using them for weighted sampling.
So let's pass an array of ones as sample weights here.

I am not sure how sample weights are used here
So left it in, it seems like the weight sums are updated using
sample weights so need some help here to understand

weight_sum is the sum of weights of all points belonging to each clusters. It's used to track clusters where there are very few points (more precisely points that add up to a small weight) and reassign them to a different cluster.

I don't think we have to modify _mini_batch_step. It's still useful that it handles sample weights because it's also used in partial_fit where there is no sampling so sample weights must be passed.

I don't think we have to modify _mini_batch_step. It's still useful that it handles sample weights because it's also used in partial_fit where there is no sampling so sample weights must be passed.

That's a very good point indeed.

jeremiedbb

I think that the convergence check is also broken (_mini_batch_convergence). It uses n_samples instead of sample_weight.sum(). Let's first make sure that the rest is fixed. You can disable convergence check by setting max_no_improvement=None and tol=0.

Then when are confident that sample weights are correctly handled by the core of the algorithm, we'll enable early convergence check again and fix it.

sklearn/cluster/_kmeans.py

Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai>

sklearn/cluster/_kmeans.py

ogrisel · 2025-02-07T13:59:08Z

I think that the convergence check is also broken (_mini_batch_convergence). It uses n_samples instead of sample_weight.sum(). Let's first make sure that the rest is fixed. You can disable convergence check by setting max_no_improvement=None and tol=0.

Indeed, it depends both on n_samples and self._batch_size. I wonder if the value of self._batch_size should be rescaled by X.shape[0] / sample_weight.sum() before we start the iteration.

jeremiedbb · 2025-02-07T14:07:31Z

I wonder if the value of self._batch_size should be rescaled

I don't think so: since we're sampling with weight, then the self._batch_size sampled points have unit weight so their total weight is self._batch_size.

EDIT:

X.shape[0] / sample_weight.sum() before we start the iteration

I read your comment too quickly. I don't think so either: since we're doing weighted sampling, the sampled points have unit weights, so we need the same batch size to be equivalent to the repeated case

ogrisel · 2025-02-07T17:40:16Z

I tried to run our statistical testing notebook against this branch and the test passes (while it fails on main)!

✅ MiniBatchKMeans: (min_p_value: 0.239, mean_p_value=0.724)

EDIT: using the following config:

    MiniBatchKMeans: {"reassignment_ratio": 0.9},

ogrisel · 2025-02-07T17:46:51Z

But surprisingly, it fails with:

❌ MiniBatchKMeans: (min_p_value: 0.001, mean_p_value=0.369)

with

    MiniBatchKMeans: {"max_no_improvement":None, "tol": 0},

or:

❌ MiniBatchKMeans: (min_p_value: 0.016, mean_p_value=0.616)

with the default hparams (default convergence criterion and default reassignment_ratio=0.01).

but it passes (barely) with:

✅ MiniBatchKMeans: (min_p_value: 0.071, mean_p_value=0.537)

with

    MiniBatchKMeans: {"max_no_improvement":None, "tol": 0, "reassignment_ratio": 0.9},

so there might still be a problem only visible with lower values of "reassignment_ratio".

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

jeremiedbb

I found another issue with the current implementation:

We use n_samples to compute the number of steps to run, i.e. the number of minibatches to process.
The max_iter parameter says that we run the necessary number of steps in order to loop through the whole dataset max_iter times.
Using n_samples to compute the total number of steps leads to a smaller number of steps in the weighted case than in the repeated case.

The suggestion below ensures that both run the same number of steps. It requires some adjustments to compute the fitted attributes at the end, n_iter_ and inertia_.

One major drawback is that it breaks the equivalence with scaled sample weights, i.e equivalence between fit(X, sample_weight=1) and fit'(sample_weight=2). I haven't been able to find a way to preserve both equivalence properties. Unless changing the meaning of max_iter or something.

Another thing. With this modification and the following parameters: max_no_improvement=None, tol=0, n_init=1, reassignment_ratio=0, init_size=100000 (to make sure to take all points); the statistical test passes, but with a min pvalue quite small, around 0.10 to 0.30, so not that great. It maybe means that we're still missing something or it could be inherent to MiniBatchKMeans that is not a convex problem and it's easy that small modifications in the input end up in a different local minimum.

sklearn/cluster/_kmeans.py

Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai>

snath-xoc · 2025-02-13T15:47:59Z

Thank you @jeremiedbb for the insights

One major drawback is that it breaks the equivalence with scaled sample weights, i.e equivalence between fit(X, sample_weight=1) and fit'(sample_weight=2). I haven't been able to find a way to preserve both equivalence properties. Unless changing the meaning of max_iter or something.

EDIT: yes I see the tests failing, how would you suggest changing max_iter?

Another thing. With this modification and the following parameters: max_no_improvement=None, tol=0, n_init=1, reassignment_ratio=0, init_size=100000 (to make sure to take all points); the statistical test passes, but with a min pvalue quite small, around 0.10 to 0.30, so not that great. It maybe means that we're still missing something or it could be inherent to MiniBatchKMeans that is not a convex problem and it's easy that small modifications in the input end up in a different local minimum.

Hmmmm interesting, passes on my side as well. It may be what you say, although I am wondering how the random_reassign interacts with sample weight as we do n_since_last_reassign+=batch_size. This should be O.K. since we resample with weights. Can't really think of anything else that I see in the code that is causing this discrepancy

EDIT: I did some tests it is not a problem with random_reassign

snath-xoc · 2026-01-23T14:37:15Z

Oh no sorry I seem to have pushed and broke the linter as well

sklearn/cluster/_kmeans.py

snath-xoc · 2026-01-23T15:37:02Z

Agh so I added the sample weight scaling test to the estimator checks a while ago for this PR and it's failing on a lot of estimators, perhaps I should remove it for now... I think we had discussed perhaps adding the scaling relationship to the sample-weight-audit-nondet repo?

jeremiedbb · 2026-02-02T13:56:25Z

sklearn/cluster/_kmeans.py

+        # Rescaling step for sample weights otherwise doesn not pass test_scaled_weights
+        n_steps = int((self.max_iter * n_effective_samples)) // (self._batch_size)


There's actually an easy way to make MinibatchKMeans pass the scaled weights test: make n_steps independent of the weights as it was before (since max_iter doesn't take weights into account).

That is

n_steps = (self.max_iter * n_samples) // self._batch_size

note that this is the same as

max_iter * sum(sample_weigths) / (batch_size * mean(sample_weights))

which is a reasonable expectation.

That way, the name number of batches are processed no matter the scaling of sample weights.
The counterpart is that the total weight seen during fit is scaled by the sample weight scaling but I don't think that it's an issue.

The counterpart is that the total weight seen during fit is scaled by the sample weight scaling but I don't think that it's an issue.

Even that is expected to me actually. If we scale weights by a factor of 2, the total weight of the full dataset is multiplied by 2 and so a full iteration should see twice the total weight.

So I actually think that there's no issue with defining n_steps independently of sample weights

Thank you @jeremiedbb for that, glad to see that the weights scaling test can now pass with this.

jeremiedbb · 2026-02-03T17:54:00Z

I pushed a commit to implement what I tried to explain in my previous comment. I wanted to test https://github.com/snath-xoc/sample-weight-audit-nondet against this branch but there's a bug for clusterers. We should merge snath-xoc/sample-weight-audit-nondet#36 first.

jeremiedbb · 2026-02-04T11:11:36Z

Here's the result of the sample weight test on the fixed branch (snath-xoc/sample-weight-audit-nondet#36)

jeremiedbb · 2026-02-04T11:17:04Z

I think this PR is good to go. It just needs a changelog entry.

snath-xoc · 2026-02-04T11:21:14Z

Thank you @jeremiedbb just merged the branch, this looks good to go now?

snath-xoc · 2026-02-04T11:24:01Z

@ogrisel and @adrinjalali this should be good to merge now?

jeremiedbb · 2026-02-04T11:25:32Z

It just needs a changelog entry before merging

ogrisel · 2026-02-04T13:35:55Z

@snath-xoc the CI is still red because of the missing changelog entry: https://github.com/scikit-learn/scikit-learn/blob/main/doc/whats_new/upcoming_changes/README.md

snath-xoc · 2026-02-04T15:36:23Z

It just needs a changelog entry before merging

Sorry I had forgot about that added it in now

ogrisel

LGTM besides the following:

ogrisel · 2026-01-26T10:28:20Z

sklearn/cluster/_kmeans.py


-        n_steps = (self.max_iter * n_samples) // self._batch_size
-
+        n_effective_samples = np.sum(sample_weight)


I would rather rename this variable to "sum_of_weights" as it's more descriptive.

n_effective_samples could be interpreted differently in different contexts, in particular in the presence of repeated data points.

jeremiedbb

I just pushed a commit to clean-up the formatting and removed a bit of implementation detail.

LGTM. Thanks !

snath-xoc · 2026-02-06T07:43:06Z

renaming all done, this should be good to merge @adrinjalali

first attempt at solving (although does not help

4f9edc9

github-actions bot added the module:cluster label Feb 2, 2025

jeremiedbb reviewed Feb 4, 2025

View reviewed changes

snath-xoc and others added 5 commits February 5, 2025 21:14

Merge branch 'main' into fix_minibatchkmeans

f9ceeb5

remove sample weight in mini_batch_step

4bf55ea

second try

fc438a9

second try

c656fc0

add docs

59f833f

jeremiedbb reviewed Feb 6, 2025

View reviewed changes

sklearn/cluster/_kmeans.py Outdated Show resolved Hide resolved

sklearn/cluster/_kmeans.py Outdated Show resolved Hide resolved

sklearn/cluster/_kmeans.py Outdated Show resolved Hide resolved

snath-xoc and others added 3 commits February 7, 2025 01:43

Apply suggestions from code review

6e503fb

Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai>

Merge branch 'main' into fix_minibatchkmeans

4876722

correct typo

23309c9

ogrisel reviewed Feb 7, 2025

View reviewed changes

sklearn/cluster/_kmeans.py Outdated Show resolved Hide resolved

ogrisel mentioned this pull request Feb 7, 2025

Add ridge pipeline to transformers snath-xoc/sample-weight-audit-nondet#12

Merged

1 task

snath-xoc and others added 2 commits February 8, 2025 11:52

Update sklearn/cluster/_kmeans.py

3b5843c

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

Merge branch 'main' into fix_minibatchkmeans

1483cd0

jeremiedbb reviewed Feb 12, 2025

View reviewed changes

sklearn/cluster/_kmeans.py Show resolved Hide resolved

Update sklearn/cluster/_kmeans.py

96474ac

Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai>

snath-xoc and others added 5 commits February 13, 2025 18:57

Merge branch 'main' into fix_minibatchkmeans

f13c3ca

add n_effective_samples

2e89d72

kmeans

0ea317e

try solve scaling issue

5532196

fix rescaling

1a0461e

ogrisel reviewed Jan 23, 2026

View reviewed changes

sklearn/cluster/_kmeans.py Outdated Show resolved Hide resolved

remove scaling factor

f92554e

github-actions bot removed the CI:Linter failure The linter CI is failing on this PR label Jan 23, 2026

snath-xoc added 2 commits January 23, 2026 18:52

remove scaling tests

a4cb946

remove test

8ea07bb

jeremiedbb reviewed Feb 2, 2026

View reviewed changes

compute n_steps independently of sample weights

a8cd9d5

Merge branch 'main' into fix_minibatchkmeans

39fad05

snath-xoc and others added 2 commits February 4, 2026 18:33

add changelog

12842bf

Merge branch 'main' into fix_minibatchkmeans

de0f772

fix typo

9b023e4

ogrisel approved these changes Feb 5, 2026

View reviewed changes

format + cln

010e014

jeremiedbb approved these changes Feb 5, 2026

View reviewed changes

snath-xoc and others added 2 commits February 6, 2026 10:41

remain n_effective_samples

7a9859d

Merge branch 'main' into fix_minibatchkmeans

6bc1281

jeremiedbb merged commit 0932d7e into scikit-learn:main Feb 6, 2026
40 checks passed

github-project-automation bot moved this from In progress to Done in Labs Feb 6, 2026

snath-xoc mentioned this pull request Feb 26, 2026

Quantile transformer sample weight #32761

Open

		# Rescaling step for sample weights otherwise doesn not pass test_scaled_weights
		n_steps = int((self.max_iter * n_effective_samples)) // (self._batch_size)


		n_steps = (self.max_iter * n_samples) // self._batch_size

		n_effective_samples = np.sum(sample_weight)

Uh oh!

Conversation

snath-xoc commented Feb 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

github-actions bot commented Feb 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

ogrisel commented Feb 3, 2025

Uh oh!

jeremiedbb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jeremiedbb Feb 4, 2025

Choose a reason for hiding this comment

Uh oh!

ogrisel Feb 5, 2025

Choose a reason for hiding this comment

Uh oh!

jeremiedbb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ogrisel commented Feb 7, 2025

Uh oh!

jeremiedbb commented Feb 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Feb 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Feb 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeremiedbb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

snath-xoc commented Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

snath-xoc commented Jan 23, 2026

Uh oh!

Uh oh!

snath-xoc commented Jan 23, 2026

Uh oh!

jeremiedbb Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

jeremiedbb Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

snath-xoc Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

jeremiedbb commented Feb 3, 2026

Uh oh!

jeremiedbb commented Feb 4, 2026

Uh oh!

jeremiedbb commented Feb 4, 2026

Uh oh!

snath-xoc commented Feb 4, 2026

Uh oh!

snath-xoc commented Feb 4, 2026

Uh oh!

jeremiedbb commented Feb 4, 2026

Uh oh!

ogrisel commented Feb 4, 2026

Uh oh!

snath-xoc commented Feb 4, 2026

Uh oh!

ogrisel left a comment

snath-xoc commented Feb 2, 2025 •

edited

Loading

github-actions bot commented Feb 2, 2025 •

edited

Loading

jeremiedbb commented Feb 7, 2025 •

edited

Loading

ogrisel commented Feb 7, 2025 •

edited

Loading

ogrisel commented Feb 7, 2025 •

edited

Loading

snath-xoc commented Feb 13, 2025 •

edited

Loading