[WIP] Number of threads in KMeans should not be bigger than number of chunks by jeremiedbb · Pull Request #17210 · scikit-learn/scikit-learn

jeremiedbb · 2020-05-13T15:53:42Z

related to #17208

When the number of chunks is smaller than the number of cores (i.e. very small datasets), KMeans launches as many threads as there are cores anyway. It should use n_chunks threads instead.

adrinjalali · 2020-05-13T17:35:40Z

curious, how much of a gain is this? and should it go in 0.23.1?

glemaitre · 2020-05-14T08:05:57Z

curious, how much of a gain is this? and should it go in 0.23.1?

It is a regression in #17208. It seems x10 on very small dataset.

adrinjalali

Then if you've tested and this is fixing the issue, I'm happy. I don't think we can easily write a test for it.

jeremiedbb · 2020-05-14T08:49:33Z

It's an attempt to fix 17208, but it's still wip. I can reproduce the slowdown on my laptop and this pr fixes it but it seems to not work for the person who opened the issue. I need to investigate further with him

rth · 2020-05-14T08:52:43Z

The fix sounds like a good improvement in any case. Though he has 4 cores, so I would have imagined spawning 8 threads shouldn't be too costly performance wise?

jeremiedbb · 2020-05-14T09:05:09Z

The fix sounds like a good improvement in any case

I agree

Though he has 4 cores, so I would have imagined spawning 8 threads shouldn't be too costly performance wise?

In the reproducible snippet, there are only 150 samples, which means there will only be one chunk. On my laptop with 4 cores, it spawns 4 threads and it makes a huge diff. The thing is that it only concerns very small datasets for which the whole fit time is ~ 0.005 sec. So I guess that the overhead of thread creation become non negligible.

rth · 2020-05-14T09:07:56Z

Let's merge this as is? As if necessary you could create a new PR with more improvements? Please add a what's new entry.

BTW, can this affect other parts of the code that do parallel chunking where we could also apply this fix ?

jeremiedbb · 2020-05-14T09:14:40Z

BTW, can this affect other parts of the code that do parallel chunking where we could also apply this fix ?

Not that I can think about.

…s-small-data

jeremiedbb · 2020-05-14T11:14:21Z

Let's merge this as is?

It can't hurt and It's still an improvement

jeremiedbb · 2020-05-14T11:16:18Z

Some profiling showed that it's threadpoolctl that takes 90% of the time on these very small problems. It's called at each iteration. I'll make a pr to move it outside of the loop.

adrinjalali · 2020-05-15T07:31:30Z

I guess @rth is also happy to have this merged. Merging, hopefully the other PR improving the threadpoolctl overhead would get in quickly too.

…hunks (scikit-learn#17210) * num threads not bigger than num chunks * what's new

…hunks (#17210) * num threads not bigger than num chunks * what's new

…hunks (scikit-learn#17210) * num threads not bigger than num chunks * what's new

num threads not bigger than num chunks

c72e4fb

github-actions bot added the module:cluster label May 13, 2020

jnothman added this to the 0.23.1 milestone May 14, 2020

adrinjalali approved these changes May 14, 2020

View reviewed changes

jeremiedbb added 2 commits May 14, 2020 11:15

Merge remote-tracking branch 'upstream/master' into fix-kmeans-thread…

7bcd819

…s-small-data

what's new

787d615

jeremiedbb mentioned this pull request May 14, 2020

KMeans singnificantly slower on 0.23 #17208

Closed

adrinjalali merged commit 90d00da into scikit-learn:master May 15, 2020

gio8tisu pushed a commit to gio8tisu/scikit-learn that referenced this pull request May 15, 2020

FIX Number of threads in KMeans should not be bigger than number of c…

cb42aec

…hunks (scikit-learn#17210) * num threads not bigger than num chunks * what's new

jeremiedbb mentioned this pull request May 15, 2020

ENH Move threadpoolctl outside of iteration loop in KMeans #17235

Merged

adrinjalali pushed a commit to adrinjalali/scikit-learn that referenced this pull request May 18, 2020

FIX Number of threads in KMeans should not be bigger than number of c…

745d741

…hunks (scikit-learn#17210) * num threads not bigger than num chunks * what's new

adrinjalali pushed a commit that referenced this pull request May 19, 2020

FIX Number of threads in KMeans should not be bigger than number of c…

15716da

…hunks (#17210) * num threads not bigger than num chunks * what's new

viclafargue pushed a commit to viclafargue/scikit-learn that referenced this pull request Jun 26, 2020

FIX Number of threads in KMeans should not be bigger than number of c…

3410848

…hunks (scikit-learn#17210) * num threads not bigger than num chunks * what's new

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Number of threads in KMeans should not be bigger than number of chunks#17210

[WIP] Number of threads in KMeans should not be bigger than number of chunks#17210
adrinjalali merged 3 commits intoscikit-learn:masterfrom
jeremiedbb:fix-kmeans-threads-small-data

jeremiedbb commented May 13, 2020 •

edited

Loading

Uh oh!

adrinjalali commented May 13, 2020

Uh oh!

glemaitre commented May 14, 2020

Uh oh!

adrinjalali left a comment

Uh oh!

jeremiedbb commented May 14, 2020

Uh oh!

rth commented May 14, 2020

Uh oh!

jeremiedbb commented May 14, 2020 •

edited

Loading

Uh oh!

rth commented May 14, 2020

Uh oh!

jeremiedbb commented May 14, 2020

Uh oh!

jeremiedbb commented May 14, 2020

Uh oh!

jeremiedbb commented May 14, 2020

Uh oh!

adrinjalali commented May 15, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

jeremiedbb commented May 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adrinjalali commented May 13, 2020

Uh oh!

glemaitre commented May 14, 2020

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

jeremiedbb commented May 14, 2020

Uh oh!

rth commented May 14, 2020

Uh oh!

jeremiedbb commented May 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rth commented May 14, 2020

Uh oh!

jeremiedbb commented May 14, 2020

Uh oh!

jeremiedbb commented May 14, 2020

Uh oh!

jeremiedbb commented May 14, 2020

Uh oh!

adrinjalali commented May 15, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jeremiedbb commented May 13, 2020 •

edited

Loading

jeremiedbb commented May 14, 2020 •

edited

Loading