TST Replace boston in histgradboost test_predictor#16918
TST Replace boston in histgradboost test_predictor#16918thomasjpfan merged 5 commits intoscikit-learn:masterfrom
Conversation
| assert r2_score(y_train, predictor.predict(X_train)) > 0.69 | ||
| assert r2_score(y_test, predictor.predict(X_test)) > 0.30 |
There was a problem hiding this comment.
aren't these rather low? Same with the other PR you have, maybe using another dataset would be more easonale?
There was a problem hiding this comment.
Yes, that is the problem with the diabetes dataset. I have changed to california housing and it seems to work reasonably well with the original bins.
- train: (bins=200; 0.8233 (bins=256; 0.8340)
- test: (bins=200; 0.8112) (bins=256; 0.8094)
There was a problem hiding this comment.
The downside of using fetch_california_housing is that it requires network access, which means we would need to mark these test with @pytest.mark.network.
There was a problem hiding this comment.
With some parameter tuning on the diabetes dataset I can get these results:
n_bins=50
train: 0.4253613178731953
test: 0.38498296812822475
n_bins=100
train: 0.4298426536827863
test: 0.3991035630532065
Parameters:
min_samples_leaf = 50
max_leaf_nodes = None
@thomasjpfan and @adrinjalali which dataset do you guys suggest to use?
There was a problem hiding this comment.
would a make_regression with some tuned parameters not be a good option?
There was a problem hiding this comment.
would a make_regression with some tuned parameters not be a good option?
There was a problem hiding this comment.
would a make_regression with some tuned parameters not be a good option?
There was a problem hiding this comment.
Good point, I'll try this tomorrow.
|
+1 for using |
|
Thanks @ogrisel and @adrinjalali. I've amended to |
adrinjalali
left a comment
There was a problem hiding this comment.
thanks @lucyleeow . This LGTM
thomasjpfan
left a comment
There was a problem hiding this comment.
LGTM thank you @lucyleeow !
Reference Issues/PRs
Towards #16155
What does this implement/fix? Explain your changes.
Replace boston dataset with
diabetesCalifornia housing dataset insklearn/ensemble/_hist_gradient_boosting/tests/test_predictor.pyAny other comments?
Unsure of the bestn_binsvalues/what this is testing. I noticed that the boston features are more spread out and generally has a longer right tail cf the diabetes dataset. Also the R2 values withn_bin200 and 256 with diabetes were the same.The R2 values with California housing are: