Fix MinibatchKMeans minibatch_indices creation#30751
Fix MinibatchKMeans minibatch_indices creation#30751jeremiedbb merged 40 commits intoscikit-learn:mainfrom
Conversation
I agree, let's not reuse the weights in the computation of the step as they are already used to sample the minibatch. Let's remove this and simplify the code of |
jeremiedbb
left a comment
There was a problem hiding this comment.
This slightly changes the behavior of MiniBatchKMeans, even without using sample weights so it needs 2 changelog entries. One for the bug fix and one the change of behavior.
Some doctests fail because of the change. You just need to use the new results values that these snippets produce as expected results.
sklearn/cluster/_kmeans.py
Outdated
| # Note, I am not sure how sample weights are used here | ||
| # So left it in, it seems like the weight sums are updated using | ||
| # sample weights so need some help here to understand the | ||
| # _minibatch_update_dense/sparse code |
There was a problem hiding this comment.
Like in KBinsDiscretizer, we should not use sample weights after using them for weighted sampling.
So let's pass an array of ones as sample weights here.
I am not sure how sample weights are used here
So left it in, it seems like the weight sums are updated using
sample weights so need some help here to understand
weight_sum is the sum of weights of all points belonging to each clusters. It's used to track clusters where there are very few points (more precisely points that add up to a small weight) and reassign them to a different cluster.
I don't think we have to modify _mini_batch_step. It's still useful that it handles sample weights because it's also used in partial_fit where there is no sampling so sample weights must be passed.
There was a problem hiding this comment.
I don't think we have to modify _mini_batch_step. It's still useful that it handles sample weights because it's also used in partial_fit where there is no sampling so sample weights must be passed.
That's a very good point indeed.
jeremiedbb
left a comment
There was a problem hiding this comment.
I think that the convergence check is also broken (_mini_batch_convergence). It uses n_samples instead of sample_weight.sum(). Let's first make sure that the rest is fixed. You can disable convergence check by setting max_no_improvement=None and tol=0.
Then when are confident that sample weights are correctly handled by the core of the algorithm, we'll enable early convergence check again and fix it.
Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai>
Indeed, it depends both on |
I don't think so: since we're sampling with weight, then the EDIT:
I read your comment too quickly. I don't think so either: since we're doing weighted sampling, the sampled points have unit weights, so we need the same batch size to be equivalent to the repeated case |
|
I tried to run our statistical testing notebook against this branch and the test passes (while it fails on EDIT: using the following config: |
|
But surprisingly, it fails with: with or: with the default hparams (default convergence criterion and default but it passes (barely) with: with so there might still be a problem only visible with lower values of |
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
jeremiedbb
left a comment
There was a problem hiding this comment.
I found another issue with the current implementation:
- We use
n_samplesto compute the number of steps to run, i.e. the number of minibatches to process. - The
max_iterparameter says that we run the necessary number of steps in order to loop through the whole datasetmax_itertimes. - Using
n_samplesto compute the total number of steps leads to a smaller number of steps in the weighted case than in the repeated case.
The suggestion below ensures that both run the same number of steps. It requires some adjustments to compute the fitted attributes at the end, n_iter_ and inertia_.
One major drawback is that it breaks the equivalence with scaled sample weights, i.e equivalence between fit(X, sample_weight=1) and fit'(sample_weight=2). I haven't been able to find a way to preserve both equivalence properties. Unless changing the meaning of max_iter or something.
Another thing. With this modification and the following parameters: max_no_improvement=None, tol=0, n_init=1, reassignment_ratio=0, init_size=100000 (to make sure to take all points); the statistical test passes, but with a min pvalue quite small, around 0.10 to 0.30, so not that great. It maybe means that we're still missing something or it could be inherent to MiniBatchKMeans that is not a convex problem and it's easy that small modifications in the input end up in a different local minimum.
Co-authored-by: Jérémie du Boisberranger <jeremie@probabl.ai>
|
Thank you @jeremiedbb for the insights
EDIT: yes I see the tests failing, how would you suggest changing max_iter?
Hmmmm interesting, passes on my side as well. It may be what you say, although I am wondering how the random_reassign interacts with sample weight as we do n_since_last_reassign+=batch_size. This should be O.K. since we resample with weights. Can't really think of anything else that I see in the code that is causing this discrepancy EDIT: I did some tests it is not a problem with random_reassign |
|
Oh no sorry I seem to have pushed and broke the linter as well |
|
Agh so I added the sample weight scaling test to the estimator checks a while ago for this PR and it's failing on a lot of estimators, perhaps I should remove it for now... I think we had discussed perhaps adding the scaling relationship to the sample-weight-audit-nondet repo? |
sklearn/cluster/_kmeans.py
Outdated
| # Rescaling step for sample weights otherwise doesn not pass test_scaled_weights | ||
| n_steps = int((self.max_iter * n_effective_samples)) // (self._batch_size) |
There was a problem hiding this comment.
There's actually an easy way to make MinibatchKMeans pass the scaled weights test: make n_steps independent of the weights as it was before (since max_iter doesn't take weights into account).
That is
n_steps = (self.max_iter * n_samples) // self._batch_size
note that this is the same as
max_iter * sum(sample_weigths) / (batch_size * mean(sample_weights))
which is a reasonable expectation.
That way, the name number of batches are processed no matter the scaling of sample weights.
The counterpart is that the total weight seen during fit is scaled by the sample weight scaling but I don't think that it's an issue.
There was a problem hiding this comment.
The counterpart is that the total weight seen during fit is scaled by the sample weight scaling but I don't think that it's an issue.
Even that is expected to me actually. If we scale weights by a factor of 2, the total weight of the full dataset is multiplied by 2 and so a full iteration should see twice the total weight.
So I actually think that there's no issue with defining n_steps independently of sample weights
There was a problem hiding this comment.
Thank you @jeremiedbb for that, glad to see that the weights scaling test can now pass with this.
|
I pushed a commit to implement what I tried to explain in my previous comment. I wanted to test https://github.com/snath-xoc/sample-weight-audit-nondet against this branch but there's a bug for clusterers. We should merge snath-xoc/sample-weight-audit-nondet#36 first. |
|
Here's the result of the sample weight test on the fixed branch (snath-xoc/sample-weight-audit-nondet#36) |
|
I think this PR is good to go. It just needs a changelog entry. |
|
Thank you @jeremiedbb just merged the branch, this looks good to go now? |
|
@ogrisel and @adrinjalali this should be good to merge now? |
|
It just needs a changelog entry before merging |
|
@snath-xoc the CI is still red because of the missing changelog entry: https://github.com/scikit-learn/scikit-learn/blob/main/doc/whats_new/upcoming_changes/README.md |
Sorry I had forgot about that added it in now |
sklearn/cluster/_kmeans.py
Outdated
|
|
||
| n_steps = (self.max_iter * n_samples) // self._batch_size | ||
|
|
||
| n_effective_samples = np.sum(sample_weight) |
There was a problem hiding this comment.
I would rather rename this variable to "sum_of_weights" as it's more descriptive.
n_effective_samples could be interpreted differently in different contexts, in particular in the presence of repeated data points.
jeremiedbb
left a comment
There was a problem hiding this comment.
I just pushed a commit to clean-up the formatting and removed a bit of implementation detail.
LGTM. Thanks !
|
renaming all done, this should be good to merge @adrinjalali |

Reference Issues/PRs
Tries (although not successfully) to fix #30750
What does this implement/fix? Explain your changes.
When creating minibatch_indices before the mini_batch_step we employ weighted resampling (with replacement)
Any other comments?
This does not solve the issue, I am still getting histograms similar to as shown in the issue, even when using init="random". I did not change the sample weight passing into the mini_batch_step, so currently they are double accounted for. This is probably an issue however I see that the sample weight it used in the _minibatch_update_dense function. Any further thoughts on this would help.
TO DO: