Use sample_weight in the metric calculations in model validations and cross validation#2
Conversation
ryan-deak-zefr
left a comment
There was a problem hiding this comment.
Put a few places that I am concerned about.
| # test_sample_weight_sums. | ||
| if self.return_train_score: | ||
| if train_sample_weight_sums is None: | ||
| samples = int(np.sum(test_sample_counts[:n_splits])) |
There was a problem hiding this comment.
These two lines are the hairiest of all. I'm not 100% certain about whether this is the right way to do it but unfortunately, I don't see a better way because the total dataset size isn't maintained anywhere inside the function. Any thoughts?
There was a problem hiding this comment.
Can we not store the total data set size in some way and pass it in?
There was a problem hiding this comment.
I'm not sure if this is possible. The reason I say this is that we can't necessarily just take the data set size because I don't think that the cross validation iterator (the cv parameter in GridSearchCV) must iterate over the entire dataset. So it should be based on the sizes defined by cv.
There was a problem hiding this comment.
Should I put a comment to this effect in the code?
| else: | ||
| out, train_wts, test_wts = ([], [], []) | ||
|
|
||
| train_wts = collapse_nones(train_wts) |
There was a problem hiding this comment.
Turns a list of None into a None.
sklearn/model_selection/_search.py
Outdated
| if train_sample_weight_sums is None: | ||
| samples = int(np.sum(test_sample_counts[:n_splits])) | ||
| train_sample_counts = \ | ||
| np.array(test_sample_counts[:n_splits], dtype=np.int) - \ |
There was a problem hiding this comment.
Could you explain why this difference is necessary?
There was a problem hiding this comment.
@zachary-mcpherson-zefr: Good catch! The arguments are backward. Will fix. In the meantime, here's the reasoning (assuming the code was correct):
Say that you have the following three folds for cross validation:
- fold 1: 50 examples
- fold 2: 51 examples
- fold 3: 49 examples
So it typically works by iterating over the test fold. So let's say that we are on the second loop iteration and therefore, fold 2 is the test fold. That means that folds 1 and 3 are the training folds. So, there should be 99 training examples for test fold 2.
samples should represent the total number of rows of data in the entire dataset. Then, to get the number of training rows, we can subtract the number of testing rows from the total to get the number of training rows associated with the test fold. So, back to our example, if the test fold is fold 2, we have samples = 150 and fold 2 is 51, then 150 - 51 == 99. Make sense?
| # test_sample_weight_sums. | ||
| if self.return_train_score: | ||
| if train_sample_weight_sums is None: | ||
| samples = int(np.sum(test_sample_counts[:n_splits])) |
There was a problem hiding this comment.
Can we not store the total data set size in some way and pass it in?
| return x is None | ||
|
|
||
| def collape_nones(xs): | ||
| return None if xs is None or any(map(is_none, xs)) else xs |
There was a problem hiding this comment.
Perhaps the function can be written like with a generator instead of a map?
def collapse_nones_g(xs):
if xs is None or any(x is None for x in xs):
return None
return xsThere was a problem hiding this comment.
It's already a generator:
In [32]: %time sum(range(200000000))
CPU times: user 3.86 s, sys: 9.2 ms, total: 3.86 s
Wall time: 3.86 s
In [34]: %timeit any(map(lambda x: x > 2, range(200000000)))
940 ns ± 6.97 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
| (precision_score, True, False, [2000000, 1000000, 1, 999999]), | ||
| (log_loss, False, True, [2500000, 500000, 200000, 100000]), | ||
| (brier_score_loss, False, True, [2500000, 500000, 200000, 100000]), | ||
| ]) |
There was a problem hiding this comment.
I don't think it's necessary. We could, but this is way better than what seems to currently exist in sklearn. I also have https://gist.github.com/deaktator/94545f807f139eba2c8a15381f2495e0 for which I'd like to submit a future PR to sklearn. It's just test code. Not any change to implementation.
DO NOT MERGE WITH PR WITHOUT ASKING!
Purpose
To correct metric calculations in model validation (and cross validation) by attempting to use
sample_weightin the metric calculations whensample_weightare supplied infit_params.Adds sample weighting for model validation methods like cross validation inside
sklearn/model_selection/_validation.pyandsklearn/model_selection/_search.py.Reference Issues/PRs
Fixes scikit-learn#4632. See also scikit-learn#10806.
What does this implement/fix? Explain your changes.
This change always uses the same
sample_weightused in training to weight the metric calculations in_fit_and_scoreand functions called by_fit_and_score, so it should be thought of changing the default behavior of things like cross validation. This is similar to PR scikit-learn#10806 but doesn't allow separate test weights to be specified. It changes_validation.pyand_search.py.None of the CV search classes (
GridSearchCV, etc) are changed. This operates at the function level.Any other comments?
This PR provides some simple test showing how unweighted metrics have very unfavorable consequences when models are trained with importance weights but not scored with the same weights. It is shown in the test that the current omission of
sample_weightin metrics calculations considerably overestimates the accuracy.