Skip to content

Conversation

@glevv
Copy link
Contributor

@glevv glevv commented Mar 3, 2023

Fixes #25527

@glevv
Copy link
Contributor Author

glevv commented Mar 3, 2023

For reasons unknown (for now) bisect kmeans does not pass tests:
Should be:

Iteration 0, inertia 144.0.
Iteration 1, inertia 78.0.
Converged at iteration 1: strict convergence.
New centroids from bisection: [[ 3. 1.] [-6. -2.]]
Iteration 0, inertia 28.0.
Iteration 1, inertia 16.0.
Converged at iteration 1: strict convergence.
New centroids from bisection: [[ 3. -2.] [ 3. 4.]]

expected_centers = [[10, 2], [10, 8], [1, 2]]

Got:

Iteration 0, inertia 96.0.
Iteration 1, inertia 78.0.
Converged at iteration 1: strict convergence.
New centroids from bisection: [[-6. -2.] [ 3. 1.]]
Iteration 0, inertia 28.0.
Iteration 1, inertia 22.0.
Converged at iteration 1: strict convergence.
New centroids from bisection: [[ 3. 3.] [ 3. -3.]]

bisect_means.cluster_centers_ = [[ 1., 2.], [10., 7.], [10., 1.]]

And assertion error as the result. This happens without any weighting of the input at all.

@jeremiedbb
Copy link
Member

The test was not robust and very sensitive to rng. Since your changes impact the initialization (even without sample weights), this test started to fail.

I pushed a more robust version of the test that is not subject to rng. The clusters will always be the same no matter the initialization (up to a permutation).

@glevv glevv marked this pull request as ready for review March 6, 2023 14:15
@glevv glevv changed the title [WIP] KMeans random initialization account for sample weights [ENH] KMeans random initialization account for sample weights Mar 6, 2023
@glemaitre glemaitre self-requested a review March 15, 2023 16:39
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few documentation nitpicks. Otherwise, it starts to look good.

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more nitpicks

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cln

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I directly pushed a last round of nitpicks.

LGTM. Thanks @glevv !

@glemaitre glemaitre merged commit 69c8489 into scikit-learn:main Mar 17, 2023
@glemaitre
Copy link
Member

LGTM. Thanks @glevv

@glemaitre
Copy link
Member

We could in a subsequent PR add the possibility to pass sample_weight with a callable. We should be careful to be backwards compatible so we might need to look at the signature before trying to pass the weights.

@glevv If you feel like you want to have a look, feel free to propose a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

KMeans initialization does not use sample weights

3 participants