[MRG] Blockwise parallel silhouette computation#1976
[MRG] Blockwise parallel silhouette computation#1976AlexandreAbraham wants to merge 9 commits intoscikit-learn:masterfrom AlexandreAbraham:blockwise_silhouette
Conversation
There was a problem hiding this comment.
compute -> computes :)
and maybe add a the before blockwise
There was a problem hiding this comment.
approximately should maybe be between by and the squared in the next line.
So it reads: ...dividing memory consumption by approximately the squared number of clusters.
That make sense? currently its a bit hard to read
There was a problem hiding this comment.
I agree. Plus, it is a bit superfluous. Do you think I should remove this part and just that it lowers memory consumption without giving an order of magnitude ?
There was a problem hiding this comment.
I think it's okay as is now 👍
|
I had a quick look. Thanks for the add-on, @AlexandreAbraham. This seems very useful considering the conversation on the mailing list here |
There was a problem hiding this comment.
The docstring should be {'global', 'blockwise'}, and not 'string'.
|
Looks good to me. |
There was a problem hiding this comment.
I think I would remove the parameter documentation and leave only the short description above. Those two functions are private and short. The documentation takes more space than the actual implementation.
There was a problem hiding this comment.
I think I would remove the parameter documentation and leave only the
short description above. Those two functions are private and short. The
documentation takes more space than the actual implementation.
I was actually thinking the same. So +1 on this remark
There was a problem hiding this comment.
Done. I have also removed the parameters documentation for the other private methods (original method).
|
+1 for merge |
|
Hum this PR has two +1 for merge and is still open?! |
|
I fixed the only remaining remark (docstring removed for no reason) and rebased. This should be good to go. |
There was a problem hiding this comment.
I suggest that we rename this argument to 'blockwise', and have as options False, True, and 'auto', where 'auto' is the default, and you do some clever check on the size of the input to decided whether blocking is good or not.
|
I used this code to bench the silhouette computation: On my machine, blockwise is always faster below 50 clusters. Above 50 clusters, the global version is faster if there is less then |
There was a problem hiding this comment.
We need the auto logic covered too (just to explore all codepaths).
There was a problem hiding this comment.
The 'auto' case is covered below. But I can add an extra verification here if you want.
There was a problem hiding this comment.
|
Yes, given the last comments, I'm working on the docs. |
|
Your tests are failing under windows because of the added parallelism. You need to disable it in the test (check appveyor to see the problem). |
There was a problem hiding this comment.
can you add versionadded to both new parameters.
There was a problem hiding this comment.
How do you add the flag to a parameter? I didn't manage to do it :-/
|
The failure under windows might be random and not necessarily related to this specific PR. We had seemingly similar random failure in the past on completely unrelated PRs and the failure disappeared without having us do anything. Apparently here we first get a Let me relaunch appveyor manually on that PR. |
There was a problem hiding this comment.
here just after this line add a .. versionadded::0.17 not sure if it needs a newline before it though
There was a problem hiding this comment.
@ogrisel , I am getting the windows error 5 access is denied inconsistently (but like 8 out of 10 times) on an Azure VM but not my PC. This is occurring after gridsearchcv fits all the folds but before it can calculate the best parameters and best scores. I raised an issue with all the details here #6820 . The VM is Windows Server 2012 R2. Is it something related to VM environment or is there something we can do?
|
The CircleCI failure is not due to the PR. Should I amend and re-launch the tests? |
|
The CircleCI failure is not due to the PR. Should I amend and re-launch the
tests?
How about rebasing on master
|
|
Done. |
|
I'll rebase again but I don't get why CircleCI is failing. |
|
#7175 suggest this would still be of interest to users. However, I have my doubts about how much time we save by doing this block computation cluster-wise, where clusters are very unevenly sized. If we simply processed def process_block(start, intra_cluster):
"""compute minimal inter-cluster distance for points in block and add to intra-cluster totals"""
block_dists = pairwise_distances(X[start:start+b], X)
cluster_dists = np.zeros((b, n_clusters))
# Following is equivalent to cluster_dists = np.vstack([np.bincount(y, row, n_clusters) for row in block_dists])
np.add.at(cluster_dists, (np.repeat(np.arange(b), len(y)), np.tile(y, b)), block_dists.ravel())
np.add.at(intra_cluster, y[start:start+b], cluster_dists[np.arange(b), y[start:start+b]])
cluster_dists[np.arange(b), y[start:start+b]] = np.inf
return (cluster_dists / np.bincount(y)).min(axis=1)This would also much more easily benefit from parallelism (obviously without sharing |
|
That code is wrong (I forgot the definition of intra-cluster distance), but the idea still applies. |
|
FWIW, I think this is still in the pipeline to fix, via #7979 |
This pull request introduces a blockwise computation of the silhouette, using less memory, but slightly slower.
First, the vocabulary can be debated. I have chosen the word "global" for the original silhouette strategy (as the global distance matrix is computed, I could not come with a better word) and "blockwise" for my method.
The other point is that I have 3 implementations of the silhouette:
I have decided to leave the original version untouched, as the code is far more readable than mine. Then I have not kept the most efficient blockwise single threaded version because the memory gain is not worth the code complication (I think).
It is in fact possible to have "one implementation to rule them all". This could be done by keeping the same code skeleton and using a "precomputed" distance matrix for the original version. But I think that the code would become too opaque.