[MRG+1] ENH Add working_memory global config for chunked operations#10280
[MRG+1] ENH Add working_memory global config for chunked operations#10280jnothman merged 105 commits intoscikit-learn:masterfrom
Conversation
reverted the comment Resolved merge conflicts
and add comments
Also use threading for parallelism
I see. It's fine, it was just a side comment not very critical to this PR... There is a minor conflict in the what's new. LGTM. Now this would need a second review... maybe @lesteve or @qinhanmin2014 ? :) |
doc/whats_new/v0.20.rst
Outdated
|
|
||
| - A new configuration parameter, ``working_memory`` was added to control memory | ||
| consumption limits in chunked operations, such as the new | ||
| :func:`metrics.pairwise_distances_chunked`. See :ref:`working_memory`. |
There was a problem hiding this comment.
You seem to have forgotten the glossary entry.
There was a problem hiding this comment.
I think a reference to the User Guide is most relevant in what's new.
I can add a glossary entry, though I'm not sure how it will help beyond the user guide and the config_context docstring.
There was a problem hiding this comment.
Fair enough, I just wonder what is the goal of the syntax :ref:, which does not render a link.
Did you mean :func:set_config? Or maybe you need a label in doc/modules/computational_performance.rst?
There was a problem hiding this comment.
The latter. A glossary reference would be :term:, not :ref: which references sections.
sklearn/metrics/pairwise.py
Outdated
| ``reduce_func``. | ||
|
|
||
| Examples | ||
| ------- |
There was a problem hiding this comment.
You need one more dash to have proper rendering.
| assert isinstance(S_chunks, GeneratorType) | ||
| S_chunks = list(S_chunks) | ||
| assert len(S_chunks) > 1 | ||
| # atol is for diagonal where S is explcitly zeroed on the diagonal |
| min_block_mib = np.array(X).shape[0] * 8 * 2 ** -20 | ||
|
|
||
| for block in blockwise_distances: | ||
| memory_used = len(block) * 8 |
There was a problem hiding this comment.
You should use memory_used = block.size * 8 to have the correct memory used in the block.
|
|
||
| for block in blockwise_distances: | ||
| memory_used = len(block) * 8 | ||
| assert memory_used <= min(working_memory, min_block_mib) * 2 ** 20 |
There was a problem hiding this comment.
And the min should be a max, shouldn't it?
| metric='euclidean') | ||
| # Test small amounts of memory | ||
| for power in range(-16, 0): | ||
| check_pairwise_distances_chunked(X, None, working_memory=2 ** power, |
There was a problem hiding this comment.
This line raises a lot of warnings:
UserWarning: Could not adhere to working_memory config. Currently 0MiB, 1MiB required.
We should silence them as they are expected.
Great that this is happening! |
|
Thanks @TomDLT for the review and the approval! I've addressed all your comments except for the glossary one, which I'm not sure is necessitated at this point. |
|
I suppose we could include a new Global Configuration section of the glossary. But I'm not sure how that goes beyond the |
|
Is there a plan to in the future also use this in |
|
pairwise_distances_chunked is only useful if it can be reduced, so no,
there's no point to including this in pairwise_distances. (But there might
be particular distance functions that can be calculated in a chunked way to
avoid n_samples * n_samples * n_features arrays, which may be what you are
thinking of)
If you would like to merge this, I'll happily post follow-ups to close more
issues!
|
|
py3.6 fails ;) |
|
Merging to enable downstream PRs. Let me know if there are further quibbles! Thanks for the reviews, Roman and Tom! |
|
.. and also #11135 for chunking silhouette_score calculations |
We often get issues related to memory consumption and don't deal with them particularly well. Indeed, Scikit-learn should be at home on commodity hardware like developer/researcher laptops.
Some operations can be performed chunked, so that the result is computed in constant (or
O(n)) memory relative to some currentO(n)(O(n^2)) consumption. Examples include: getting the argmin and min of all pairwise distances (currently done with an ad-hoc parameter topairwise_distances_argmin_min), calculating silhouette score (#1976), getting nearest neighbors with brute force (#7287), calculating standard deviation of every feature (#5651).It's not very helpful to provide this "how much constant memory" parameter in each function (because they're often called within nested code), so this PR instead makes a global config parameter of it. The optimisation is then transparent to the user, but still configurable.
At @rth's request, this PR has been cut back. The proposed changes to silhouette and neighbors can be seen here.
This PR (building upon my work with @dalmia) will therefore:
set_config(working_memory=n_mib)pairwise_distances_chunkednearest neighbors, silhouette andpairwise_distances_argmin_minbatch_sizeinpairwise_distances_argmin_minand thus:
TODO:
get_chunk_n_rows_check_chunk_sizeand any others forpairwise_distances_chunkedgen_batchesto see if they can use this parameter to be more principled in the choice of batch size: in most cases the batch size affects the model, so we can't just change it. And in other cases, it may make little difference. We could change mean_shift's use from fixed batch_size=500 to something sensitive to working_memory if we wish.pairwise_distances_chunkedthat uses start and a tuple return