[WIP] Sample weight consistency #5515
[WIP] Sample weight consistency #5515ainafp wants to merge 17 commits intoscikit-learn:masterfrom ainafp:sample_weight_consistency
Conversation
…ck whether test should apply to all of the estimators
sklearn/tests/test_common.py
Outdated
There was a problem hiding this comment.
There was a problem hiding this comment.
I meant instead of try catch block, it would be safer to have a list of Estimators that have sample_weight in the fit param but might not support it. (which in this case is just LogisticRegression)
wdyt
|
see if you can make the SGDClassifier and Perceptron working maybe? might be a convergence issue. Or we ask to high a precision on the coef. |
…for SGD estimators to change number of iterations or precision.
|
I couldn't make SGDClassifier and Perceptron work, so I exclude them also. The excluded are Tests are now passing. Please take a look at the exclusion lists and see if you are OK with that |
|
Problems with Travis: different errors on different machines I guess. |
|
Do I exclude them too? test_get_params_invariance has a typo |
sklearn/tests/test_common.py
Outdated
There was a problem hiding this comment.
Can you make this casting more explicit, maybe with astype(np.int) ?
|
Is there any way to cover the case of float (positive) weights? |
|
I guess one could work with rationals and then use the common denominator. On Thu, Oct 22, 2015 at 3:07 PM, Giorgio Patrini notifications@github.com
|
There was a problem hiding this comment.
Do you think a test for aug would make sense here? Something like
assert_equal(X_aug_train.shape[0], np.sum(sample_weight[train]))
|
@eickenberg Right, but that would fall on an implementation with integer numbers again. I agree that the tests with integers will uncover the biggest issues. However, it's true that most if not all the algorithms can take positive real number as weights. |
There was a problem hiding this comment.
I was not suggesting to do that. I think int makes more sense in this case.
My argument is that your tests will not cover the situation in which learning algorithms are weighted with float sample weights, as there is not a corresponding interpretation of "weight = number of sample copy". I simply do not think there is a way to cover this situation though.
There was a problem hiding this comment.
well, you could have a dataset with some duplicate samples and use sample-weight 0.5 on it and compare against the data without duplicates? Or use 1.5 and compare it with data that has triples?
|
ok I didn't understand it then. Would it make sense to multiply the weights by the number of decimals used and then do the same (augment data)? Is this what you mean? It doesn't change much though. |
|
I think solving weighting with rescaling (the multiplication by the weights) is only appropriate with linear models. |
|
What do you suggest then? |
|
Your idea is the way to go with this test, I would not do anything else :) |
|
@ogrisel do you know why travis doesn't pass with python 3 but it does with python 2? |
|
We were discussing in #5526 (comment) the addition of another test
The code should look like this one, looping over the appropriate regressors and testing both dense and sparse inputs. |
|
How does it work sample weight in MultinomialNB ?, Is there any documentation or equation? Someone can help me please, I appreciate it! |
|
I think the issue is the same I saw in #7618. There should be an AttributeError, not a ValueError. That's a bug in the SVC. |
|
Feel free to fix this here or do a separate PR. |
|
fixes #5367. |
|
Sorry I'm a bit out of the loop with this one. What's the status? Can you rebase? |
|
Hmm... I'd forgotten this was here. Should this be closed given #11558, @sergulaydore? I've not yet checked if it's completely redundant. |
1 similar comment
|
Hmm... I'd forgotten this was here. Should this be closed given #11558, @sergulaydore? I've not yet checked if it's completely redundant. |
Supersedes #5461
Added a 0/1 sample weight test. Failing estimators are
For the previous test,
The difference between the 2 are:
How do I proceed? Do I create an exclusion list or an inclusion list (p.ex. linear models) for the previous test?
ping @eickenberg, @amueller, @glouppe, @arjoly, @agramfort, @GaelVaroquaux