[WIP]: Sparse label_binarizer and OneVsRestClassifier#2458
[WIP]: Sparse label_binarizer and OneVsRestClassifier#2458rsivapr wants to merge 29 commits intoscikit-learn:masterfrom
Conversation
|
In #2441, you said
Can you give the level sparsity of the input and the output? |
|
Apparently, you have some conflict with the master branch. Can you rebase? |
About 0.3% - 0.4% are non-zero values. |
|
|
@arjoly Done. |
sklearn/multiclass.py
Outdated
There was a problem hiding this comment.
When you call delayed on _fit_binary, you save a reference to all input argument.
In you case, it means that you densify Y.
To solve the issue, we should pass to _fit_binary a reference to the sparse Y and the class index of the job.
So the signature of _fit_binary would become _fit_binary(estimator, X, Y, class_index, class_name).
I think that you should treat the sparse and dense case in _fit_binary.
There was a problem hiding this comment.
I am not sure I completely understand this. So do we change the _fit_binary method to fit for each class individually instead of all the classes?
There was a problem hiding this comment.
What do you mean by class index of the job?
There was a problem hiding this comment.
When the delayed function is used to decorate a function, you savea reference
to the function that you call and the argument
In [1]: from joblib.parallel import delayed
In [2]: import numpy as np
In [3]: delayed(np.mean)(np.arange(5))
Out[3]: (<function numpy.core.fromnumeric.mean>, (array([0, 1, 2, 3, 4]),), {})
In [4]: list(delayed(np.mean)(np.arange(i)) for i in range(5))
Out[4]:
[(<function numpy.core.fromnumeric.mean>, (array([], dtype=int64),), {}),
(<function numpy.core.fromnumeric.mean>, (array([0]),), {}),
(<function numpy.core.fromnumeric.mean>, (array([0, 1]),), {}),
(<function numpy.core.fromnumeric.mean>, (array([0, 1, 2]),), {}),
(<function numpy.core.fromnumeric.mean>, (array([0, 1, 2, 3]),), {})]Later on with the Parallel class, you dispatch each function call with the appropriate argument.
In this case, it means that you "densify" the sparse column and it will blow your memory.
What I suggest is to pass to the delayed function a reference to the input array,
the output array, the label index and the label name. In the _fit_binary,
you can densify the label column without blowing your memory.
Memory consumption can be further reduced using the latest version of joblib
and memmap. Though, I am not sure it works with sparse matrix though (@ogrisel?)
There was a problem hiding this comment.
So do we change the _fit_binary method to fit for each class individually instead of all the classes?
I am not sure that I understand what you mean by individually.
In one versus rest strategy, you fit one binary classifier by class
and later on aggregate the output of each one.
Am I missing something?
There was a problem hiding this comment.
Am I missing something?
Sorry. I thought we passed the list of classes every time as an argument and later noticed we just pass [not i, i].
There was a problem hiding this comment.
What I suggest is to pass to the delayed function a reference to the input array,
the output array, the label index and the label name. In the _fit_binary,
you can densify the label column without blowing your memory.
Awesome. I will modify that and add that to this pr.
Memory consumption can be further reduced using the latest version of joblib
and memmap. Though, I am not sure it works with sparse matrix though (@ogrisel?)
Yes. I'd be very interested to see if that works. Perhaps @ogrisel can point me toward the right direction.
There was a problem hiding this comment.
What I suggest is to pass the delayed function references to the input array,
the output array, the label index and the label name. In the_fit_binary,
you can densify the label column without blowing your memory.
That's probably a good idea. Another useful trick may be to rely on the
'pre_dispatch' argument of Parallel, which helps a bit in such a
situation.
|
You have some travis failure. |
|
Thank you for working on this. :-). Can you put at the beginning of pr title WIP (work in progress)? Later on, we can swap to MRG to indicate that we need some reviews. |
|
Thank you for guiding me! Apart from the failures, am I going in the right direction? I think I need to write tests for |
Yes, new features should be tested. We use the nose You can add the new test in the You can find many informations in the contributing documentation. |
sklearn/multiclass.py
Outdated
There was a problem hiding this comment.
@arjoly I made the changes as per your recommendations. Does this look better?
There was a problem hiding this comment.
I would simply do
y = Y[:, class_index]
There was a problem hiding this comment.
@arjoly I made the changes as per your recommendations. Does this look better?
👍 Are you able to make some run on your problem?
sklearn/multiclass.py
Outdated
There was a problem hiding this comment.
You can use the toarray method to perform this operation. All classifier should accept a column as y.
|
Hi Arnaud, will do it. Just busy with my thesis defense in a couple of weeks. Is there anything other than writing tests for this PR? |
There was a problem hiding this comment.
I would rather never use todense() but instead use toarray() to never deal with np.matrix instances in the sklearn code base.
|
I am checking that the pr is not dying. The release is likely to be at the end of december or mid-january. If you don't have time, maybe @mahendrakariya can help you to finish (by making a pr to merge into your branch. If this pr is not ready for 0.15, it's better to merge this in 0.16.
Yes, mainly testing and a bit of narrative documentation. We should also benchmark against the master version. |
|
Do we need to raise an error in metrics to say "not supported yet" to sparse multilabel-indicators? |
|
Actually, I've got a solution for sparse metrics. I haven't yet tested it, and doing so would benefit from this sparse See https://gist.github.com/jnothman/7798757 (untested!) where I use class polymorphism to refactor the multilabel metric code paths. The idea is that each metric calls This approach gets around |
There was a problem hiding this comment.
Firstly, setxor1d is overkill. All you need to ensure there is at most 2 labels is len(y.data) == 0 or np.ptp(y.data) == 0.
Secondly, this doesn't work for sparse types without data (e.g. DOK), or lil_matrix which has a data attribute that is an array of lists. Instead you can use:
def is_sparse_and_binary(y):
if not hasattr(y, 'data') or y.data.dtype.kind == 'O':
y = y.tocoo()
return len(y.data) == 0 or np.ptp(y.data) == 0 (You could use isinstance(y, scipy.sparse.data._data_matrix), but that involves a private class.)
There was a problem hiding this comment.
Nice. the y.data.dtype.kind == 'O' check is non trivial so this code should be included with a comment to explain that this adds support for the DOK and LIL sparse matrix datastructures.
|
I have a tested sparse multilabel metrics implementation at https://github.com/jnothman/scikit-learn/tree/sparse_multi_metrics. I'll make a PR of it once this is accepted. |
There was a problem hiding this comment.
please use scipy.sparse.issparse
|
@arjoly, I've just taken a look through more of the PR than I had before. I think it still requires some non-trivial work apart from testing. I think it needs to either be held back until 0.16, or taken on quickly by a developer who is more familiar with the subtleties of |
This looks very nice. Have you tried to bench it a bit? |
Is this mostly about: #2458 (comment) or do you have other major road blocks in mind? |
|
@ogrisel wrote about my metrics implementation:
I haven't. I expect the sequence of sequences will be slower because I yet again need to get the set of labels, and may need to iterate through the data more times than before. For the label indicator matrices, I think it's exactly the same vector operations, only with the small O(1) class construction and method calling overhead. |
|
Yes, it's mostly with regard to that issue, as well as some code duplication, ensuring that this works for lots of matrix types. I guess this also depends on what the exact release time-scale is for 0.15. Note that @rsivapr is unlikely to do much on this before his thesis defence in a couple of weeks and that there aren't tests for some of the functionality. |
Very nice !!!
I think that this would be very nice if @rsivapr is able to finish its (first?) contribution. Anybody wanting to help is welcome. |
|
Hey @arjoly. I should be able to work on this starting mid-January. I would have plenty of time to perhaps work on more issues as well. Until then I'm afraid I would have to focus on my thesis. Thanks for review @jnothman, @ogrisel, and @arjoly. I will attempt to fix all of the above suggestions when I get back. |
|
How's this going, Rohit? |
|
Hi @rsivapr, I think it's about time to deprecate the sequence of sequence format. A lot of work have already been done in that direction and waiting in different pull requests (thanks to @jnothman). Since this pull request is blocking the way, I took some time to work on the sparse label binarizer (see this branch https://github.com/arjoly/scikit-learn/tree/sparse-label_binarizer). It's almost finished. Is it ok for you if I finish the sparse label binarizer? |
|
@arjoly I can close this PR if you'd like. |
|
Ok then I am closing. Good luck with your job! |
|
Thanks for your effort here @rsivapr, and even if the code doesn't go in, your first attempt at an inevitable feature has highlighted the challenges in what needs to be done. Good luck with the job. |
#2441
sklearn.utils.multiclass.type_of_targetto recognise sparse binary matrix as a'multilabel-indicator'format.sklearn.utils.muticlass.unique_labelsto work with this new format.label_binarizefunction to return a sparse label indicator.OVRto work with sparse output matrices. For instance by extracting a dense column of the sparse matrix and fit the classifier on it. Prediction would be made by creating the sparse matrix incrementally out of each fitted classifier._fit_binarymethod to take in dense and sparse matrices.import scipy.sparse as spinstead offrom scipy.sparse import coo_matrix