[MRG] Block-wise silhouette calculation to avoid memory consumption by jnothman · Pull Request #7177 · scikit-learn/scikit-learn

jnothman · 2016-08-11T11:45:37Z

Superseding #1976 which breaks the problem down by cluster, here we simply break the input down by fixed-size blocks of rows. Reports of memory issues when calculating silhouette include #7175 and mailing list. Also incorporates #6089's fix for #5988; and adds a test for silhouette_samples self-plagiarised from #4087.

potential enhancements:

reverted the comment Resolved merge conflicts

jnothman · 2016-08-11T13:51:43Z

Benchmarks with block_size=None, attempting to follow #6151:

import numpy as np
from sklearn.metrics.cluster import silhouette_score
n_features = 100
for n_samples in [10000, 1000, 100]:
    for n_labels in [4, 6, 100]:
        print('n_features', n_features, 'n_samples', n_samples, 'n_labels', n_labels)
        X = np.random.rand(n_samples, n_features)
        y = np.random.randint(n_labels, size=n_samples)
        %timeit silhouette_score(X, y)

master                                        block_size=None                               block_size=100
n_features 100 n_samples 10000 n_labels 4     n_features 100 n_samples 10000 n_labels 4     n_features 100 n_samples 10000 n_labels 4
1 loop, best of 3: 3.28 s per loop            1 loop, best of 3: 4.03 s per loop            1 loop, best of 3: 1.61 s per loop
n_features 100 n_samples 10000 n_labels 6     n_features 100 n_samples 10000 n_labels 6     n_features 100 n_samples 10000 n_labels 6
1 loop, best of 3: 3.16 s per loop            1 loop, best of 3: 4.12 s per loop            1 loop, best of 3: 1.64 s per loop
n_features 100 n_samples 10000 n_labels 100   n_features 100 n_samples 10000 n_labels 100   n_features 100 n_samples 10000 n_labels 100
1 loop, best of 3: 3.48 s per loop            1 loop, best of 3: 4.05 s per loop            1 loop, best of 3: 1.61 s per loop
n_features 100 n_samples 1000 n_labels 4      n_features 100 n_samples 1000 n_labels 4      n_features 100 n_samples 1000 n_labels 4
100 loops, best of 3: 16.9 ms per loop        10 loops, best of 3: 30.2 ms per loop         100 loops, best of 3: 15.8 ms per loop
n_features 100 n_samples 1000 n_labels 6      n_features 100 n_samples 1000 n_labels 6      n_features 100 n_samples 1000 n_labels 6
100 loops, best of 3: 18.3 ms per loop        10 loops, best of 3: 30 ms per loop           100 loops, best of 3: 15.7 ms per loop
n_features 100 n_samples 1000 n_labels 100    n_features 100 n_samples 1000 n_labels 100    n_features 100 n_samples 1000 n_labels 100
1 loop, best of 3: 226 ms per loop            10 loops, best of 3: 30.8 ms per loop         100 loops, best of 3: 16.7 ms per loop
n_features 100 n_samples 100 n_labels 4       n_features 100 n_samples 100 n_labels 4       n_features 100 n_samples 100 n_labels 4
1000 loops, best of 3: 688 µs per loop        1000 loops, best of 3: 570 µs per loop        1000 loops, best of 3: 608 µs per loop
n_features 100 n_samples 100 n_labels 6       n_features 100 n_samples 100 n_labels 6       n_features 100 n_samples 100 n_labels 6
1000 loops, best of 3: 1.01 ms per loop       1000 loops, best of 3: 572 µs per loop        1000 loops, best of 3: 600 µs per loop
n_features 100 n_samples 100 n_labels 100     n_features 100 n_samples 100 n_labels 100     n_features 100 n_samples 100 n_labels 100
10 loops, best of 3: 57.9 ms per loop         1000 loops, best of 3: 591 µs per loop        1000 loops, best of 3: 587 µs per loop

defaulting to block_size=100 is looking pretty good too...

and add comments

raghavrv · 2016-08-12T18:09:22Z

sklearn/metrics/cluster/unsupervised.py


+    block_size : int, optional
+        The number of rows to process at a time to limit memory usage to
+        O(block_size * n_samples). Default is n_samples.


nitpick: Double backticks

jnothman · 2016-08-13T11:45:14Z

Instead of allowing the user to specify size as a unit of memory usage, I am wondering if block_size should specify maximum number of cells created by the pairwise_distances operation, rather than number of rows. I.e. the latter is the former divided by n_samples.

jnothman · 2016-08-13T12:19:44Z

I've now tried to just have the user specify the pairwise distance memory consumption in bytes...

raghavrv · 2016-08-13T20:40:56Z

I've now tried to just have the user specify the pairwise distance memory consumption in bytes...

If it needs to be in memory units, could we have it in MBs as users will need to enter a smaller number?

jnothman · 2016-08-13T21:08:42Z

If it needs to be in memory units, could we have it in MBs as users will need to enter a smaller number?

It doesn't need to be, but I think it is a more practical data-invariant measure. Yes, can have in MiB

jnothman · 2016-08-14T01:54:52Z

@raghavrv now in MiB

tguillemot · 2016-08-16T07:47:52Z

Instead of allowing the user to specify size as a unit of memory usage, I am wondering if block_size should specify maximum number of cells created by the pairwise_distances operation, rather than number of rows. I.e. the latter is the former divided by n_samples.

@jnothman I'm a little bit late but +1 to have the memory usage. I think it's an easier measure to specify. And also +1 for MiB.

jnothman · 2016-08-16T10:01:23Z

You're earlier than everyone but @raghavrv. Thanks for the input. That's where it's converged anyway.

tguillemot · 2016-08-16T10:24:50Z

@jnothman I do a review in a few moment.

tguillemot · 2016-08-16T11:46:08Z

sklearn/metrics/cluster/unsupervised.py

+
+
+def _process_block(X, labels, start, block_n_rows, block_range, add_at,
+                   label_freqs, metric, kwds):


Just a niptick : _process_block -> _silhouette_process_block.
I know it's a private function but maybe it will be easier for comprehension if you put a little docstring of the arguments ???

Fair enough. Will do

tguillemot · 2016-08-16T12:02:17Z

The performances are similar on my computer.

raghavrv · 2016-08-16T13:15:37Z

sklearn/metrics/cluster/tests/test_unsupervised.py

+    y = [1, 1, 2]
+    assert_raise_message(ValueError, 'block_size should be at least n_samples '
+                         '* 8 bytes = 1 MiB, got 0',
+                         silhouette_score, X, y, block_size=0)


Should we instead have the block_size as 0 and let it denote the master setup of use all the memory?

Or -1 like n_jobs?

For what benefit? The default 64MB will run a 2896 sample problem or smaller in a single block. For problems much larger than that, you're likely to benefit from splitting the problem up as suggested by our benchmark which shows 2x speedup from "use all memory" for a dataset less than 4x that size (and >9x the number of pairwise calculations). Yes, this is only my machine, but it's hard to imagine why we should suggest using all memory possible to the user.

Not dissimilar to n_jobs=-2 often being a better choice than n_jobs=-1

Why do not select automatically block_size as the min(64MB, np.ceil(n_samples * BYTES_PER_FLOAT * 2 ** -20) and don't let the user choose a specific value ?

That would be identical to just setting it to 64, no? That's tempting, especially because this isn't a learning routine. I don't expect my benchmarks to be optimal, certainly not for any particular platform, or with n_jobs != 1 (using the default or some other parallel backend); hence my inclination to keep it public. I'll have to chew on this.

jnothman · 2016-08-17T01:57:27Z

I'd appreciate another opinion on this: should we make a memory efficiency parameter constant, given that we know the constant is sensible on some modern platforms (and hard to explain to the user), or directly configurable by the user? In the context of a learning algorithm, configurability seems more valuable, but for an evaluation metric, is it sensible to just say: 64 MiB is the max pairwise distance evaluation available. Also, should this be 64 // n_jobs? (perhaps I need to benchmark the interaction with n_jobs.)

Not sure who to ask: @GaelVaroquaux, @MechCoder, @amueller?

Also use threading for parallelism

jnothman · 2016-08-17T14:24:48Z

Fixed a problem where n_jobs wasn't being used, and changed to use threading backend for parallelism (only marginal gains for euclidean, but I've not seen multiprocessing be worth the overhead here).

raghavrv · 2016-08-17T16:07:30Z

Sorry for piggy backing on this conversation... How do you usually definitively decide between multiprocessing and threading? As joblib memmaps the data I don't see why multiprocessing is any worse than threading given that it is usually a bit faster than threading (for larger datasets) and much faster when gil is not released... Or am I wrong?

jnothman · 2016-08-17T21:01:41Z

I don't have any good answers for you, @raghavrv, but for the scale of problem and processor I've tried, multiprocessing always results in slower times. Afaik, threading is a good option when there's a lot of work in non-GIL operations, and when copying the inputs and outputs competes with the calculation for cost. I'm considering just dropping n_jobs support as ineffectual until someone can show me it actually benchmarks a more-than-marginal benefit.

amueller · 2016-08-26T20:11:31Z

@jnothman I think for blocked computations, we set "reasonable" defaults and allow configuration, IIRC. Do you want a review on this? It doesn't seem high priority to me. [OT: why are you using this score?]

jnothman · 2016-08-27T11:14:23Z

I'd like review on this, but no I'm not using silhouette personally. I worked on it because an issue came in, and because I wanted to ascertain that #1976 could be improved upon. I also think we should be reviewing the potential blockwise computations everywhere that pairwise_distances is used, such as KNN. Perhaps I should make an issue for that.

jnothman · 2016-08-27T11:59:44Z

To be sure, it's not high priority, except that it closes a few issues and avoids more appearing over the next release.

jnothman · 2016-08-27T12:01:02Z

Also, #5988 should be fixed.

raghavrv · 2016-11-01T19:00:36Z

Could you refactor #7438 out of this PR and make a rebase. I can try and take a deeper look at this before the weekend :)

raghavrv · 2016-12-12T00:14:25Z

@jnothman Ping :)

Sentient07 and others added 6 commits January 27, 2016 19:26

Reverted the change, added regression test

498ccce

reverted the comment Resolved merge conflicts

ENH block_size for memory efficiency in silhouette

3a4dd68

Merge remote-tracking branch 'upstream/pr/6089' into silhouette-chunks

d45bb78

DOC add versionadded to new parameter

c6edfbb

FIX use bincount instead of np.add.at for old numpy

3b726aa

ENH use unary pairwise_distances where possible

2301fde

jnothman added 2 commits August 12, 2016 00:07

DOC explicit block_size parameter in silhouette_score

85d5971

TST test silhouette_samples explicitly

53fa8d9

jnothman changed the title ~~[WIP] Block-wise silhouette calculation to avoid memory consumption~~ [MRG] Block-wise silhouette calculation to avoid memory consumption Aug 11, 2016

ENH support n_jobs in silhouette_score

6828646

and add comments

jnothman added Waiting for Reviewer Enhancement labels Aug 12, 2016

DOC update block_size description given n_jobs

969eab3

raghavrv reviewed Aug 12, 2016
View reviewed changes

DOC docstring formatting

bfbde51

ENH specify silhouette block size in bytes

03a73ab

block_size specified in MiB

51640c0

tguillemot reviewed Aug 16, 2016
View reviewed changes

raghavrv reviewed Aug 16, 2016
View reviewed changes

document parameters to silhouette helper

eb4619d

FIX pass n_jobs from silhouette_score

71ac994

Also use threading for parallelism

jnothman mentioned this pull request Aug 29, 2016

Batch pairwise_distances in neighbors to reduce memory consumption #7287

Closed

raghavrv added this to the 0.19 milestone Nov 1, 2016

dalmia mentioned this pull request Dec 6, 2016

[MRG] ENH: Added block_size parameter for lesser memory consumption #7979

Closed

4 tasks

jnothman mentioned this pull request Feb 20, 2017

System freeze on Silhouette scoring #7175

Closed

jnothman mentioned this pull request May 14, 2017

There was a “ValueError:array is too big...” when computing silhouette_samples of KMeans on large amounts of data #8878

Closed

jnothman modified the milestones: 0.20, 0.19 Jun 13, 2017

jnothman mentioned this pull request Nov 8, 2017

[WIP] Providing stable implementation for euclidean_distances #10069

Closed

This was referenced Dec 10, 2017

[MRG+1] ENH Add working_memory global config for chunked operations #10280

Merged

MemoryError from sklearn.metrics.silhouette_samples #10279

Closed

jnothman mentioned this pull request May 25, 2018

[MRG+2] ENH use pairwise_distances_chunked in silhouette_score #11135

Merged

jnothman closed this in #11135 Jun 2, 2018



		def _process_block(X, labels, start, block_n_rows, block_range, add_at,
		label_freqs, metric, kwds):

Uh oh!

Conversation

jnothman commented Aug 11, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Aug 11, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Aug 13, 2016

Uh oh!

jnothman commented Aug 13, 2016

Uh oh!

raghavrv commented Aug 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Aug 13, 2016

Uh oh!

jnothman commented Aug 14, 2016

Uh oh!

tguillemot commented Aug 16, 2016

Uh oh!

jnothman commented Aug 16, 2016

Uh oh!

tguillemot commented Aug 16, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tguillemot commented Aug 16, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Aug 17, 2016

Uh oh!

jnothman commented Aug 17, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

raghavrv commented Aug 17, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Aug 17, 2016

Uh oh!

amueller commented Aug 26, 2016

Uh oh!

jnothman commented Aug 27, 2016

Uh oh!

jnothman commented Aug 27, 2016

Uh oh!

jnothman commented Aug 27, 2016

Uh oh!

raghavrv commented Nov 1, 2016

Uh oh!

raghavrv commented Dec 12, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jnothman commented Aug 11, 2016 •

edited

Loading

raghavrv commented Aug 13, 2016 •

edited

Loading

tguillemot commented Aug 16, 2016 •

edited

Loading

jnothman commented Aug 17, 2016 •

edited

Loading

raghavrv commented Aug 17, 2016 •

edited

Loading