[MRG+1] Patch liblinear for sample_weights in LogisticRegression(and CV)#5274
Conversation
0aff304 to
a700d3c
Compare
a700d3c to
2ef06ac
Compare
|
ping @vstolbunov . Also @fabianp could you have a look, since you know this part of the code very well. |
2ef06ac to
268b1bc
Compare
|
tests pass now. |
|
I took a look last night and they had passed so I wasn't sure what the problem was? |
|
I think I forgot to build it locally and hence I got some segmentation faults. But now it's all right |
|
@TomDLT you might want to review this. |
|
You should also compare the results with liblinear and other solvers, like in this test. And testing it, it reveals that liblinear does not handles |
|
It looks pretty good to me. (yet I am not a C++ master) |
6e6ec68 to
867596b
Compare
|
@TomDLT Can you update the PR to MRG+1 if you are happy? (Btw, I don't know what your definition of a C++ master is, but whatever it is I'm not one either :P) |
867596b to
5134286
Compare
|
If you are looking for a C++ master, @larsmans is a good candidate :) |
sklearn/svm/base.py
Outdated
There was a problem hiding this comment.
PEP8, put the check_consistent_length import on its own line.
75758f6 to
84668a7
Compare
|
You should also be supporting in |
|
I thought that was for |
I'd thought it was a general patch to liblinear, but maybe.. |
b11922c to
c01faae
Compare
|
@jnothman So we'll merge this and add support for other solvers later? |
c01faae to
9d8d7e4
Compare
|
Rebased. Would be great if someone can give a final +1 |
|
played with this a bit and it worked great. Merging. |
[MRG+1] Patch liblinear for sample_weights in LogisticRegression(and CV)
|
the pyx file doesn't compile with cython 0.21, 0.22 or 0.23. You used 0.20, I'll try that next. I'm pretty scared of the casting that is going on there. This was found by @arthurmensch in #5557 |
|
Installed 0.20, still doesn't compile. |
|
I think I just forgot to push the generated C files. Just a second. |
|
How objective function changes in the case of sample_weight for Logistic Regression? Can you please provide the mathematical expression? I assume objective function changes like this E(\mathbf{w}) = - \sum_{n=1}^{N} {s_n t_n \ln y_n + (1-s_n t_n) \ln(1-y_n)} where s_n is the sample_weight of nth sample. The above equation modified according to equation 4.90 of Christopher Bishop's PRML book. Clarification: The equation is written in Latex. Could not post image |
|
@niteshroyal this is not the right place to ask usage questions, see http://scikit-learn.org/dev/faq.html#what-s-the-best-way-to-get-help-on-scikit-learn-usage |
|
ordinarily, weighting means solving an objective that is equivalent to
having the samples repeated in proportion to their weight
|
@jnothman I have seen that when the class weights are too imbalanced adding more degrees of freedom to the model won't always result in a lower (accordingly weighted) log_loss. So I suspect that, except for a numerical issue that I'm unaware of, the objective function is not exactly the same than the one log_loss is evaluating (equivalent to "samples repeated in proportion to their weight"). Of course, this isn't a generalization problem, the loss was computed over the training set. The parameter |
|
Ok, I think it was a numerical issue indeed, playing with the |
This add sample_weights to the liblinear solver for
LogisticRegressionandLogisticRegressionCV. It had been already added to the other solvers in another PR