[MRG+1] Add 'axis' argument to sparsefuncs.mean_variance_axis#3622
[MRG+1] Add 'axis' argument to sparsefuncs.mean_variance_axis#3622untom wants to merge 3 commits intoscikit-learn:masterfrom
Conversation
8036a2e to
f09a94f
Compare
sklearn/utils/sparsefuncs.py
Outdated
There was a problem hiding this comment.
Can you tell to the user what was the given axis?
This is useful for introspecting code and debugging.
There was a problem hiding this comment.
Apparently, you also accept -1 and -2.
There was a problem hiding this comment.
True, out of consistency with other methods in sklearn (and scipy in general) that handle the axis argument this way as well (e.g. count_nonzero in the same file), but those function don't document that usage, either. I assumed this is an sklearn convention.
There was a problem hiding this comment.
e.g. see also https://github.com/scipy/scipy/blob/master/scipy/sparse/compressed.py which uses the same convention thoughout, but never documents it.
There was a problem hiding this comment.
Whereas the numpy.matrix.std has a docstring that says:
Refer to `numpy.std` for full documentation.
There was a problem hiding this comment.
grepping through the numpy and scipy codebases, it seems like the most common way is to describe this as "axis : int" without specifying which values are allowed (which makes sense for numpy given that an ndarray can have any number of axis), while the scipy.sparse module explicitly lists 0 and 1 as valid arguments (never -1 and -2, although the functions in questions do accept those values as well). Personally I think the way I documented it makes sense, as it's consistent with scipy.sparse
There was a problem hiding this comment.
count_nonzero is a backport from NumPy. We don't generally accept funny axes, since data is assumed to be 2-d almost everywhere.
There was a problem hiding this comment.
So what do you suggest would be the right thing to do? Remove -2/-1 as accepted values?
There was a problem hiding this comment.
Yes, I'd get rid of those. They're unlikely to be more useful than confusing.
There was a problem hiding this comment.
Perhaps more to the point, unlike scipy.sparse, utils here are not public.
On 3 September 2014 04:21, Lars Buitinck notifications@github.com wrote:
In sklearn/utils/sparsefuncs.py:
@@ -61,6 +61,9 @@ def mean_variance_axis0(X):
X: CSR or CSC sparse matrix, shape (n_samples, n_features)
Input data.
- axis: int (either 0 or 1)
Axis along which the axis should be computed.Yes, I'd get rid of those. They're unlikely to be more useful than
confusing.—
Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/pull/3622/files#r17004979.
99c8f48 to
50e20a0
Compare
|
I've removed the support for |
|
Travis CI error was due to out-of-error on Python 3.4, I don't think this is related to my patch |
|
+1 for merge. Also, thank you for taking the time to split up #2514. I would still like to see (most of) that merged. |
sklearn/utils/sparsefuncs.py
Outdated
There was a problem hiding this comment.
Style / nitpick: this could be written as if axis not in (0, 1):
But ok, it's not really important.
|
looks good to merge |
|
I'll fix arjoly's last comment, that should also kick off Travis CI again, just to be sure |
This PR adds an 'axis' argument to sparsefuncs.mean_variance_axis, making it easier to calculate the columnwise mean/variance of sparse matrices.
While switching the codebase over to the new functionality, I noticed that
VarianceThresholdneedlessly converted CSC matrices to CSR. This PR also includes a commit to fix this, as the change fits naturally with the other changes.(If the immediate need for the new functionality is not obvious: This PR is the first in a series that tries to split up #2514 into a few smaller, easy-to-digest PRs. I plan to commit other PRs that will make use of this one)