[MRG] Block-wise silhouette calculation to avoid memory consumption#7177
[MRG] Block-wise silhouette calculation to avoid memory consumption#7177jnothman wants to merge 15 commits intoscikit-learn:masterfrom
Conversation
reverted the comment Resolved merge conflicts
|
Benchmarks with import numpy as np
from sklearn.metrics.cluster import silhouette_score
n_features = 100
for n_samples in [10000, 1000, 100]:
for n_labels in [4, 6, 100]:
print('n_features', n_features, 'n_samples', n_samples, 'n_labels', n_labels)
X = np.random.rand(n_samples, n_features)
y = np.random.randint(n_labels, size=n_samples)
%timeit silhouette_score(X, y)defaulting to |
and add comments
|
|
||
| block_size : int, optional | ||
| The number of rows to process at a time to limit memory usage to | ||
| O(block_size * n_samples). Default is n_samples. |
|
Instead of allowing the user to specify size as a unit of memory usage, I am wondering if |
|
I've now tried to just have the user specify the pairwise distance memory consumption in bytes... |
If it needs to be in memory units, could we have it in MBs as users will need to enter a smaller number? |
It doesn't need to be, but I think it is a more practical data-invariant measure. Yes, can have in MiB |
|
@raghavrv now in MiB |
@jnothman I'm a little bit late but +1 to have the memory usage. I think it's an easier measure to specify. And also +1 for MiB. |
|
You're earlier than everyone but @raghavrv. Thanks for the input. That's where it's converged anyway. |
|
@jnothman I do a review in a few moment. |
|
|
||
|
|
||
| def _process_block(X, labels, start, block_n_rows, block_range, add_at, | ||
| label_freqs, metric, kwds): |
There was a problem hiding this comment.
Just a niptick : _process_block -> _silhouette_process_block.
I know it's a private function but maybe it will be easier for comprehension if you put a little docstring of the arguments ???
|
The performances are similar on my computer. |
| y = [1, 1, 2] | ||
| assert_raise_message(ValueError, 'block_size should be at least n_samples ' | ||
| '* 8 bytes = 1 MiB, got 0', | ||
| silhouette_score, X, y, block_size=0) |
There was a problem hiding this comment.
Should we instead have the block_size as 0 and let it denote the master setup of use all the memory?
There was a problem hiding this comment.
For what benefit? The default 64MB will run a 2896 sample problem or smaller in a single block. For problems much larger than that, you're likely to benefit from splitting the problem up as suggested by our benchmark which shows 2x speedup from "use all memory" for a dataset less than 4x that size (and >9x the number of pairwise calculations). Yes, this is only my machine, but it's hard to imagine why we should suggest using all memory possible to the user.
There was a problem hiding this comment.
Not dissimilar to n_jobs=-2 often being a better choice than n_jobs=-1
There was a problem hiding this comment.
Why do not select automatically block_size as the min(64MB, np.ceil(n_samples * BYTES_PER_FLOAT * 2 ** -20) and don't let the user choose a specific value ?
There was a problem hiding this comment.
That would be identical to just setting it to 64, no? That's tempting, especially because this isn't a learning routine. I don't expect my benchmarks to be optimal, certainly not for any particular platform, or with n_jobs != 1 (using the default or some other parallel backend); hence my inclination to keep it public. I'll have to chew on this.
|
I'd appreciate another opinion on this: should we make a memory efficiency parameter constant, given that we know the constant is sensible on some modern platforms (and hard to explain to the user), or directly configurable by the user? In the context of a learning algorithm, configurability seems more valuable, but for an evaluation metric, is it sensible to just say: 64 MiB is the max pairwise distance evaluation available. Also, should this be Not sure who to ask: @GaelVaroquaux, @MechCoder, @amueller? |
Also use threading for parallelism
|
Fixed a problem where |
|
Sorry for piggy backing on this conversation... How do you usually definitively decide between multiprocessing and threading? As joblib memmaps the data I don't see why multiprocessing is any worse than threading given that it is usually a bit faster than threading (for larger datasets) and much faster when gil is not released... Or am I wrong? |
|
I don't have any good answers for you, @raghavrv, but for the scale of problem and processor I've tried, multiprocessing always results in slower times. Afaik, threading is a good option when there's a lot of work in non-GIL operations, and when copying the inputs and outputs competes with the calculation for cost. I'm considering just dropping |
|
@jnothman I think for blocked computations, we set "reasonable" defaults and allow configuration, IIRC. Do you want a review on this? It doesn't seem high priority to me. [OT: why are you using this score?] |
|
I'd like review on this, but no I'm not using silhouette personally. I worked on it because an issue came in, and because I wanted to ascertain that #1976 could be improved upon. I also think we should be reviewing the potential blockwise computations everywhere that |
|
To be sure, it's not high priority, except that it closes a few issues and avoids more appearing over the next release. |
|
Also, #5988 should be fixed. |
|
Could you refactor #7438 out of this PR and make a rebase. I can try and take a deeper look at this before the weekend :) |
|
@jnothman Ping :) |
Superseding #1976 which breaks the problem down by cluster, here we simply break the input down by fixed-size blocks of rows. Reports of memory issues when calculating silhouette include #7175 and mailing list. Also incorporates #6089's fix for #5988; and adds a test for
silhouette_samplesself-plagiarised from #4087.potential enhancements:
np.ufunc.atin numpy<1.8silhouette_samples(none exists!); could perhaps be a separate PR...block_sizeby target memory usage, e.g.block_size='1G'. Probably out of scope of this PR.block_size=100orblock_size='1G'or some other sensible constant as the default to avoid memory errors with default parameters, and to improve default speed.