Skip to content

[MRG] Blockwise parallel silhouette computation#1976

Closed
AlexandreAbraham wants to merge 9 commits intoscikit-learn:masterfrom
AlexandreAbraham:blockwise_silhouette
Closed

[MRG] Blockwise parallel silhouette computation#1976
AlexandreAbraham wants to merge 9 commits intoscikit-learn:masterfrom
AlexandreAbraham:blockwise_silhouette

Conversation

@AlexandreAbraham
Copy link
Copy Markdown
Contributor

This pull request introduces a blockwise computation of the silhouette, using less memory, but slightly slower.

First, the vocabulary can be debated. I have chosen the word "global" for the original silhouette strategy (as the global distance matrix is computed, I could not come with a better word) and "blockwise" for my method.

The other point is that I have 3 implementations of the silhouette:

  • the original one
  • a blockwise single threaded version
  • a blockwise multi threaded version (which uses a bit more memory than the single threaded version, even if n_jobs=1)

I have decided to leave the original version untouched, as the code is far more readable than mine. Then I have not kept the most efficient blockwise single threaded version because the memory gain is not worth the code complication (I think).

It is in fact possible to have "one implementation to rule them all". This could be done by keeping the same code skeleton and using a "precomputed" distance matrix for the original version. But I think that the code would become too opaque.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compute -> computes :)

and maybe add a the before blockwise

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approximately should maybe be between by and the squared in the next line.
So it reads: ...dividing memory consumption by approximately the squared number of clusters.
That make sense? currently its a bit hard to read

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Plus, it is a bit superfluous. Do you think I should remove this part and just that it lowers memory consumption without giving an order of magnitude ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's okay as is now 👍

@jaquesgrobler
Copy link
Copy Markdown
Member

I had a quick look. Thanks for the add-on, @AlexandreAbraham. This seems very useful considering the conversation on the mailing list here

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring should be {'global', 'blockwise'}, and not 'string'.

@GaelVaroquaux
Copy link
Copy Markdown
Member

Looks good to me.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would remove the parameter documentation and leave only the short description above. Those two functions are private and short. The documentation takes more space than the actual implementation.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would remove the parameter documentation and leave only the
short description above. Those two functions are private and short. The
documentation takes more space than the actual implementation.

I was actually thinking the same. So +1 on this remark

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I have also removed the parameters documentation for the other private methods (original method).

@jaquesgrobler
Copy link
Copy Markdown
Member

+1 for merge

@mblondel
Copy link
Copy Markdown
Member

mblondel commented Jan 5, 2014

Hum this PR has two +1 for merge and is still open?!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why remove the docstrings?

@AlexandreAbraham
Copy link
Copy Markdown
Contributor Author

I fixed the only remaining remark (docstring removed for no reason) and rebased. This should be good to go.

@AlexandreAbraham AlexandreAbraham changed the title Blockwise parallel silhouette computation [MRG] Blockwise parallel silhouette computation Oct 19, 2015
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest that we rename this argument to 'blockwise', and have as options False, True, and 'auto', where 'auto' is the default, and you do some clever check on the size of the input to decided whether blocking is good or not.

@AlexandreAbraham
Copy link
Copy Markdown
Contributor Author

I used this code to bench the silhouette computation:

import time

from sklearn import datasets
from sklearn.metrics.cluster.unsupervised import silhouette_score


def test_silhouette(n_samples, n_clusters):
    X, y = datasets.make_blobs(n_samples=n_samples, n_features=500,
                              centers=n_clusters)
    t0 = time.time()
    for i in range(20):
        silhouette_score(X, y, metric='euclidean',
                         blockwise=False)
    t_global = time.time() - t0
    t0 = time.time()
    for i in range(20):
        silhouette_score(X, y, metric='euclidean',
                         blockwise=True)
    t_block = time.time() - t0

    return t_global, t_block

for n_clusters in [50, 100, 200, 300]:
    for n_samples in [200, 300, 400, 500, 1000]:
        if n_clusters >= n_samples - 1:
            continue
        t_g, t_b = test_silhouette(n_samples, n_clusters)
        print('%d,%d,%f,%f' % (n_clusters, n_samples, t_g, t_b))

On my machine, blockwise is always faster below 50 clusters. Above 50 clusters, the global version is faster if there is less then 5 * n_labels samples. I set up a heuristic in the code accordingly.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need the auto logic covered too (just to explore all codepaths).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 'auto' case is covered below. But I can add an extra verification here if you want.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlexandreAbraham
Copy link
Copy Markdown
Contributor Author

Yes, given the last comments, I'm working on the docs.

@GaelVaroquaux
Copy link
Copy Markdown
Member

Your tests are failing under windows because of the added parallelism. You need to disable it in the test (check appveyor to see the problem).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add versionadded to both new parameters.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you add the flag to a parameter? I didn't manage to do it :-/

@ogrisel
Copy link
Copy Markdown
Member

ogrisel commented Oct 21, 2015

The failure under windows might be random and not necessarily related to this specific PR. We had seemingly similar random failure in the past on completely unrelated PRs and the failure disappeared without having us do anything.

Apparently here we first get a WindowsError: [Error 5] Access is denied and then a series of failures that reference "Attempting to do parallel computing without protecting your import" but which are probably acutally triggered by a problem in the runtime environment on the CI server.

Let me relaunch appveyor manually on that PR.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here just after this line add a .. versionadded::0.17 not sure if it needs a newline before it though

Copy link
Copy Markdown

@krishnateja614 krishnateja614 May 24, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ogrisel , I am getting the windows error 5 access is denied inconsistently (but like 8 out of 10 times) on an Azure VM but not my PC. This is occurring after gridsearchcv fits all the folds but before it can calculate the best parameters and best scores. I raised an issue with all the details here #6820 . The VM is Windows Server 2012 R2. Is it something related to VM environment or is there something we can do?

@AlexandreAbraham
Copy link
Copy Markdown
Contributor Author

The CircleCI failure is not due to the PR. Should I amend and re-launch the tests?

@GaelVaroquaux
Copy link
Copy Markdown
Member

GaelVaroquaux commented Oct 27, 2015 via email

@AlexandreAbraham
Copy link
Copy Markdown
Contributor Author

Done.

@AlexandreAbraham
Copy link
Copy Markdown
Contributor Author

I'll rebase again but I don't get why CircleCI is failing.

@jnothman
Copy link
Copy Markdown
Member

jnothman commented Aug 11, 2016

#7175 suggest this would still be of interest to users.

However, I have my doubts about how much time we save by doing this block computation cluster-wise, where clusters are very unevenly sized. If we simply processed b samples at a time, we could accumulate as follows:

def process_block(start, intra_cluster):
    """compute minimal inter-cluster distance for points in block and add to intra-cluster totals"""
    block_dists = pairwise_distances(X[start:start+b], X)
    cluster_dists = np.zeros((b, n_clusters))
    # Following is equivalent to cluster_dists = np.vstack([np.bincount(y, row, n_clusters) for row in block_dists])
    np.add.at(cluster_dists, (np.repeat(np.arange(b), len(y)), np.tile(y, b)), block_dists.ravel())
    np.add.at(intra_cluster, y[start:start+b], cluster_dists[np.arange(b), y[start:start+b]])
    cluster_dists[np.arange(b), y[start:start+b]] = np.inf
    return (cluster_dists / np.bincount(y)).min(axis=1)

This would also much more easily benefit from parallelism (obviously without sharing intra_cluster accumulators), I think.

@jnothman
Copy link
Copy Markdown
Member

That code is wrong (I forgot the definition of intra-cluster distance), but the idea still applies.

@jnothman
Copy link
Copy Markdown
Member

jnothman commented Jun 6, 2017

FWIW, I think this is still in the pipeline to fix, via #7979

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants