[MRG+1] Sparse One vs. Rest#3276
[MRG+1] Sparse One vs. Rest#3276hamsal wants to merge 54 commits intoscikit-learn:masterfrom hamsal:sprs-ovr
Conversation
|
Ping myself @arjoly |
|
This is in WIP mode proper csc_matrix construction is necessary in |
|
I think this is getting to the point where it would benefit from a review so I will mark it for merge. Please note that the changes specific to this pull request are located in sklearn/multiclass.py and sklearn/tests/testmulticlass.py. |
|
For others' help, this comparison may be useful https://github.com/hamsal/scikit-learn/compare/scikit-learn:sparse_labelbinarizer...sprs-ovr (or you could open a new PR that bases off scikit-learn's |
There was a problem hiding this comment.
please insert another blank line
There was a problem hiding this comment.
This comment has been addressed
sklearn/multiclass.py
Outdated
There was a problem hiding this comment.
Am I right to expect that argmaxima to be an integer and not a label from classes?
There was a problem hiding this comment.
I have revised this line to sample its labels from the label_binarizer.classes_ list and I wrote a test that uses string labels for the multiclass case.
|
Can you also add minimal support for decision_function, predict_proba and predict_log_proba? While this might lead to big dense matrix, it's still interesting for a user to see the probability/decision_score/log_proba associated to a few samples. |
|
I am marking this a WIP while I work on the above revisions |
|
@arjoly Is it not the case that sparse support for these functions only take up more space for the matrix representation? If a user wants to see the from these functions after training with sparse target data they would not need to make any changes to how they use these functions. Is there any benefit to converting the output to a sparse matrix? |
sklearn/multiclass.py
Outdated
There was a problem hiding this comment.
This does not follow naming conventions... perhaps you meant _get_col
Apparently, I wasn't clear. You might want to fit with sparse output Here, it means to add some appropriate tests. |
sklearn/multiclass.py
Outdated
There was a problem hiding this comment.
This will densify the entire matrix in memory, which is the opposite of what's wanted.
Use (_get_col(Y, i) for i in range(Y.shape[1])) (no surrounding [])
There was a problem hiding this comment.
Thank you! I have updated this to use the generator expression.
This is an option of the Parallel function of joblib. |
|
@arjoly I'm familiar with |
|
Oups, have I missunderstood something? In my current understanding, this parameter unroll the generator |
|
I believe I have addressed all of the above comments. The results of Vlads benchmark are a little confusing but I think along with my benchmark it validates that |
sklearn/multiclass.py
Outdated
There was a problem hiding this comment.
Please add a comment (or a sentence in the docs) stating that n_jobs > 1 can be slower than n_jobs == 1 when the individual binary classifiers are very fast to fit (as in the case in @arjoly's benchmark.)
You can add a comment in the source referencing this joblib issue.
Raise a ValueError in the label binarizer when dealing with multiputput target data, test a ValueError is raised in ovr for multioutput target data. Test a binary classification task with ovr
There was a problem hiding this comment.
y : {array-like, sparse matrix}, shape = [n_samples] or [n_samples, n_classes]
Could it be one line?
There was a problem hiding this comment.
This extends over the line limit
There was a problem hiding this comment.
I am working on building the doc, I am getting errors importing ImportError: No module named sklearn.externals.six after running make html. I am building scikit learn python setup.py build_ext --inplace then adding the directory to my PYTHONPATH. I am still trying to figure out the problem.
There was a problem hiding this comment.
This rule is in the scikit-learn folder.
There was a problem hiding this comment.
Ok I have gotten it to start building. I apprently have not run python setup.py install in the scikit-learn folder before.
There was a problem hiding this comment.
It looks bad in the documentation [n_samples, n_classes] is in the body. Maybe it is better to move the entire statement shape = [n_samples] or [n_samples, n_classes] into the body?
There was a problem hiding this comment.
shape = [n_samples] or [n_samples, n_classes] into the body?
Usually it is put in the header.
There was a problem hiding this comment.
Althoug it is not entirely precise another solution could be to shorten shape = [n_samples] or [n_samples, n_classes] to shape = [n_samples, n_classes]
sklearn/multiclass.py
Outdated
|
OK, I am 👍 for merge. Bravo to @hamsal and all the reviewers!!! This is good job. If we don't get a -1, next core dev that reviews this PR should merge it. |
|
Merge by rebase !! Congratulation :-) |
|
Thank you @hamsal! |
Yes, that's an important contribution! |
|
Thank you for the reviews!!! |
|
@hamsal, I am writing a what's new entry for sparse multilabel support. I realise that no attribution has been made in the code, or in the commit log to @rsivapr. Did you build upon his code? If so, it should at least be acknowledged in the changelog, and in the |
|
Hello all, thanks for all your work on this! What was the final change? The documentation still seems to suggest a dense matrix as a legitimate way to read in y_train. But the memory error persits. https://scikit-learn.org/1.5/modules/multiclass.html I didn't quite follow everything above, should y_train be read into a CSR matrix instead? |
This is a PR for the other parts in PR #2458 concerning sparse output support for the ovr classier. This branch was made off of the sprs-lbl-bin branch from PR #3203 and will be rebased to remove all the sparse label binarizer additions from the diff once the pull request is merged.
Benchmark Experiments using PR #2828 for data generation:
fit_ovr:sparse_output=TrueVS.sparse_output=sp.issparse(y)fit_ovrWith input as CSR matrix: Cast to CSC before column extraction VS. Leave as CSRnjobs != 1