[MRG+1] Added support for sparse multilabel y for Nearest neighbor classifiers#8096
[MRG+1] Added support for sparse multilabel y for Nearest neighbor classifiers#8096dalmia wants to merge 19 commits intoscikit-learn:masterfrom
Conversation
jnothman
left a comment
There was a problem hiding this comment.
The point of a sparse datastructure is that if truly sparse, it takes much less memory than the dense equivalent. Imagine we have vary many outputs in a multilabel problem. Storing the n_outputs x n_samples matrix is wasteful. Here you use dense versions as well as sparse versions, so this solution is not helpful at all.
sklearn/neighbors/base.py
Outdated
|
|
||
| self._issparse = issparse(y) | ||
| if(issparse(y) and self.outputs_2d_): | ||
| y = y.toarray() |
There was a problem hiding this comment.
we should not make a sparse array dense unnecessarily.
sklearn/neighbors/classification.py
Outdated
| for (pl, w) | ||
| in zip(pred_labels[inliers], weights[inliers])], | ||
| in zip(pred_labels[inliers], | ||
| weights[inliers])], |
|
@jnothman Thank you for giving me such a clear explanation. I'll think of some other way to implement the same. |
|
When a sparse multilabel y is passed to classes, self._y[:, k] = np.unique(y[:, k], return_inverse=True)The above line creates a lot of problems: >>> np.unique(y_train[:, 3], return_inverse=True)
(array([ <75x1 sparse matrix of type '<type 'numpy.int64'>'
with 36 stored elements in Compressed Sparse Row format>], dtype=object),
array([0]))Thereby assigning |
|
Thought of something. It was while implementing, that I clearly understood what you had mentioned while introducing the issue. Learnt something new today :) |
|
Travis is showing some unexpected error. Tests pass locally. aman@aman:/media/aman/BE66ECBA66EC7515/Open Source/scikit-learn$ nosetests sklearn/neighbors/tests/test_neighbors.py
...............................................
----------------------------------------------------------------------
Ran 47 tests in 2.668s
OK |
|
Yes, that code won't work neatly for both sparse and dense
…On 22 December 2016 at 00:49, Aman Dalmia ***@***.***> wrote:
When a sparse multilabel y is passed to fit:
classes, self._y[:, k] = np.unique(y[:, k], return_inverse=True)
The above line creates a lot of problems:
>>> np.unique(y_train[:, 3], return_inverse=True)
(array([ <75x1 sparse matrix of type '<type 'numpy.int64'>'
with 36 stored elements in Compressed Sparse Row format>], dtype=object),
array([0]))
Thereby assigning classes the value of the particular column of y_train
and thus, the value of y is not stored as self._y. For the sparse
multilabel case, I am thus trying to do something else separately.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8096 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz62l-MJwdEZNX8cehMZQyPUUT3OlBks5rKS5zgaJpZM4LSzU6>
.
|
|
But you can use np.unique() on subsets of data in a CSC. Go read how CSC
works internally and see if you can puzzle it out.
…On 22 December 2016 at 07:30, Joel Nothman ***@***.***> wrote:
Yes, that code won't work neatly for both sparse and dense
On 22 December 2016 at 00:49, Aman Dalmia ***@***.***>
wrote:
> When a sparse multilabel y is passed to fit:
>
> classes, self._y[:, k] = np.unique(y[:, k], return_inverse=True)
>
> The above line creates a lot of problems:
>
> >>> np.unique(y_train[:, 3], return_inverse=True)
> (array([ <75x1 sparse matrix of type '<type 'numpy.int64'>'
> with 36 stored elements in Compressed Sparse Row format>], dtype=object),
> array([0]))
>
> Thereby assigning classes the value of the particular column of y_train
> and thus, the value of y is not stored as self._y. For the sparse
> multilabel case, I am thus trying to do something else separately.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#8096 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AAEz62l-MJwdEZNX8cehMZQyPUUT3OlBks5rKS5zgaJpZM4LSzU6>
> .
>
|
|
@jnothman Yes, I had spent the whole of yesterday figuring out what was happening behind the scenes. I think it might have missed your notice, I have indeed made another commit fixing the issue. Now, I am not using conversion to dense at any point of time and I think might just work. Please review. |
|
Travis says you're constructing a matrix with |
jnothman
left a comment
There was a problem hiding this comment.
What about RadiusNeighborsClassifier?
sklearn/neighbors/base.py
Outdated
| self.classes_.append(classes) | ||
| if self._issparse and self.outputs_2d_: | ||
| self._y = y | ||
| for k in range(self._y.shape[1]): |
There was a problem hiding this comment.
just self.classes_ = [np.array([0, 1], dtype=np.int)] * range(y.shape[1])?
There was a problem hiding this comment.
I don't think range should be used here. Other than that, this is much better. Thanks.
sklearn/neighbors/classification.py
Outdated
| if self._issparse: | ||
| y_pred_sparse = [] | ||
|
|
||
| for k, classes_k in enumerate(classes_): |
There was a problem hiding this comment.
I think you want two separate code-paths for this routine. Putting the condition inside the loop makes the code pretty messy.
sklearn/neighbors/classification.py
Outdated
|
|
||
| if self._issparse and self.outputs_2d_: | ||
| if weights is None: | ||
| mode, _ = stats.mode(_y[neigh_ind, k].toarray(), axis=1) |
There was a problem hiding this comment.
It might be interesting to implement a sparse (weighted) mode, but this is fine for now. Requires memory O(n_samples x n_neighbors) where sparse is being used to avoid memory O(n_samples x n_outputs)
|
I thought that once we have a layout for |
sklearn/neighbors/base.py
Outdated
|
|
||
| check_classification_targets(y) | ||
|
|
||
| self._issparsemultilabel = issparse(y) and self.outputs_2d_ |
There was a problem hiding this comment.
I don't get why this should be stored
There was a problem hiding this comment.
Or perhaps we should store outputs_2d_ = 'sparse'
There was a problem hiding this comment.
Yes, we could store that in this way. Does save the need of using another variable.
sklearn/neighbors/classification.py
Outdated
| y_pred = y_pred.ravel() | ||
|
|
||
| if self._issparsemultilabel: | ||
| y_pred = hstack(y_pred_sparse_multilabel) |
sklearn/neighbors/classification.py
Outdated
| mode = np.asarray(mode.ravel(), dtype=np.intp) | ||
| y_pred[:, k] = classes_k.take(mode) | ||
|
|
||
| if not self.outputs_2d_: |
sklearn/neighbors/classification.py
Outdated
| @@ -155,18 +156,39 @@ def predict(self, X): | |||
| weights = _get_weights(neigh_dist, self.weights) | |||
|
|
|||
| y_pred = np.empty((n_samples, n_outputs), dtype=classes_[0].dtype) | |||
There was a problem hiding this comment.
you shouldn't be constructing this dense output array. this defies the purpose of using a sparse (low-memory) structure.
|
The nosetests run fine locally. Could you please tell me as to why do they differ here? |
|
From the look of which tests are failing (not checked the log as I'm on a slow connection), could it be that your code is not Python 2-friendly? |
|
That shouldn't be the case as I ran my tests on Python2.7. |
|
The problem here is related to indexing in sparse matrix. But I've been unable to reproduce the error, so it's a bit hard to debug it. |
sklearn/neighbors/classification.py
Outdated
| if not self.outputs_2d_: | ||
| y_pred = y_pred.ravel() | ||
| if weights is None: | ||
| mode, _ = stats.mode(_y[neigh_ind, k].toarray(), axis=1) |
There was a problem hiding this comment.
could you please confirm what neigh_ind and k are exactly so that we can debug what's not working?
There was a problem hiding this comment.
neigh_ind here is the list of indices of the kneighbors and k signifies the kth column of the y passed to fit.
|
One of the issues may derive from `neigh_ind` having 0 elements, and that
case not being handled in scipy (indeed, we may still support a version of
scipy where sparse matrices with 0 elements on one axis were not supported
at all).
…On 8 January 2017 at 01:09, Aman Dalmia ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In sklearn/neighbors/classification.py
<#8096>:
>
- if not self.outputs_2d_:
- y_pred = y_pred.ravel()
+ if weights is None:
+ mode, _ = stats.mode(_y[neigh_ind, k].toarray(), axis=1)
neigh_ind here is the list of indices of the kneighbors and k signifies
the kth column of the y passed to fit.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8096>, or mute the
thread
<https://github.com/notifications/unsubscribe-auth/AAEz6xkcCndOpDTn9-WDyj2j39LJP8VWks5rP5yagaJpZM4LSzU6>
.
|
|
Yes, that is what is indeed creating these errors. I logged the value of |
|
Any workaround that you may for getting it to work? |
|
Added failure for scipy<0.13. Please review. |
jnothman
left a comment
There was a problem hiding this comment.
I think there should be a way to avoid so much repetition in the code but still have it elegant... I'm just too tired to suggest something specific. :|
sklearn/neighbors/classification.py
Outdated
| mode, _ = weighted_mode(_y[neigh_ind, k], weights, axis=1) | ||
| if self.outputs_2d_ == 'sparse': | ||
| if StrictVersion(version.version) < StrictVersion('0.13.0'): | ||
| raise EnvironmentError('Sparse multilabel y passed in fit. ' |
| clf.fit(X_train, y_train) | ||
|
|
||
| if (name == 'KNeighborsClassifier' and | ||
| StrictVersion(version.version) < StrictVersion('0.13.0')): |
There was a problem hiding this comment.
StrictVersion(scipy.__version__) would be clearer
|
I'll try to come up with something then. |
|
Added calculation for mode in a function which removes the repetitive code greatly. Please have a look. |
jnothman
left a comment
There was a problem hiding this comment.
yes, making _mode a helper is clearer.
sklearn/neighbors/classification.py
Outdated
|
|
||
| if not self.outputs_2d_: | ||
| y_pred = y_pred.ravel() | ||
| y_pred = hstack(y_pred_sparse_multilabel) |
There was a problem hiding this comment.
I'd rather see this as sparse.hstack. Thus from scipy import sparse
sklearn/neighbors/classification.py
Outdated
| y_pred_k[inliers] = classes_k.take(mode) | ||
|
|
||
| y_pred[inliers, k] = classes_k.take(mode) | ||
| if outliers: |
There was a problem hiding this comment.
Make this if outliers and self.outlier_label != 0
sklearn/neighbors/classification.py
Outdated
| pred_labels = np.array([_y[ind, k].toarray() | ||
| for ind in neigh_ind[inliers]], | ||
| dtype=object) | ||
| y_pred_k = np.zeros(n_samples) |
sklearn/neighbors/classification.py
Outdated
| dtype=object) | ||
| y_pred_k = np.zeros(n_samples) | ||
| mode = self._mode(pred_labels, weights, inliers) | ||
| y_pred_k[inliers] = classes_k.take(mode) |
There was a problem hiding this comment.
I think in the multilabel case, classes should be [0, 1]... I might be wrong.
sklearn/neighbors/classification.py
Outdated
|
|
||
| if outliers: | ||
| y_pred[outliers, :] = self.outlier_label | ||
| y_pred_sparse_multilabel.append(csc_matrix(y_pred_k).T) |
There was a problem hiding this comment.
this doesn't make sense in terms of efficient data structures. I think you mean csr_matrix(y_pred_k).T, or you mean csc_matrix(y_pred_k.reshape(-1, 1))
sklearn/neighbors/classification.py
Outdated
| for k, classes_k in enumerate(classes_): | ||
| mode = self._mode(_y[neigh_ind, k].toarray(), weights) | ||
| y_pred_sparse_multilabel.append( | ||
| csc_matrix(classes_k.take(mode)).T) |
There was a problem hiding this comment.
I think in the multilabel case, classes should be [0, 1]... I might be wrong.
There was a problem hiding this comment.
this doesn't make sense in terms of efficient data structures. I think you mean csr_matrix(mode).T, or you mean csc_matrix(mode.reshape(-1, 1))
There was a problem hiding this comment.
Oh yes, I missed that.
| in zip(pred_labels, weights[inliers])], | ||
| dtype=np.int) | ||
|
|
||
| mode = mode.ravel() |
There was a problem hiding this comment.
The mode returned by the operation above it is of shape (len(inliers),1).
|
ping @jnothman |
|
i'd like to upvote merging this, I could use more of this support as scikit-multilearn can feed sparse matrices to scikit-learn. |
jnothman
left a comment
There was a problem hiding this comment.
Some very minor changes.
I suppose I'll see if someone else wants to wrap this up.
| y_pred[:, k] = classes_k.take(mode) | ||
| for k, classes_k in enumerate(classes_): | ||
| mode = self._mode(_y[neigh_ind, k].toarray(), weights) | ||
| y_pred_sparse_multilabel.append(csr_matrix(mode).T) |
There was a problem hiding this comment.
I think it makes more sense to construct a CSC directly from the reshaped array.
| self._y = self._y.ravel() | ||
|
|
||
| if issparse(y) and self.outputs_2d_: | ||
| if StrictVersion(scipy.__version__) < StrictVersion('0.13.0'): |
There was a problem hiding this comment.
We no longer support the broken scipy
| 'scipy < 0.13') | ||
| self.outputs_2d_ = 'sparse' | ||
| self._y = y | ||
| self.classes_ = [np.array([0, 1], dtype=np.int)] * y.shape[1] |
There was a problem hiding this comment.
Perhaps we should check that the data really doesn't contain other values.
|
|
||
| if outliers: | ||
| y_pred[outliers, :] = self.outlier_label | ||
| y_pred_sparse_multilabel.append(csr_matrix(y_pred_k).T) |
There was a problem hiding this comment.
same here: construct CSC from reshaped array
| y_sparse = csc_matrix(y) | ||
| X_train, X_test, y_train, y_test = train_test_split(X, y_sparse, | ||
| random_state=0) | ||
| if StrictVersion(scipy.__version__) < StrictVersion('0.13.0'): |
|
Hi @jnothman , can I finish this? |
|
Thanks Dor It could do with those small changes I request above, but then it will also need a second review before merge. The code changes aren't supremely readable, so any improvements to code quality would help with a second review |
|
Hi, I just noticed there's another PR for that issue : #9059 |
|
indeed our record keeping is pretty terrible
|
Reference Issue
Fixes #8057
What does this implement/fix? Explain your changes.
Added support for sparse multilabel
yfor nearest neighbor classifiers. Firstly, it checks if the input tofitis sparse + multilabel and converts it to a dense one for storing. Also, it stores another parameter indicating whether the original input was sparse + multilabel or not. Now, inpredict, if this stored value is true, then it convertsy_predto sparse CSC.Also, added tests for the same.