TST Replace Boston dataset in test_tree#17290
TST Replace Boston dataset in test_tree#17290thomasjpfan merged 7 commits intoscikit-learn:masterfrom
Conversation
|
If the tree is grown without fixing the depth, each leaf in the tree will be a sample. So you clearly overfit but you will do a perfect classification if you train and test on the same data. Basically, this is a behaviour that we want to check (I don't know if it was intended here). |
|
it would be the same behaviour for regression. |
glemaitre
left a comment
There was a problem hiding this comment.
Maybe we can check the score but the changes LGTM
Thanks, that makes sense. I think the test copied from (or at least is the same as) this one in scikit-learn/sklearn/ensemble/tests/test_forest.py Lines 159 to 175 in 2f26540 In I'm not sure what this test is checking for - but if is checking that using less features reduces the learning ability (as suggested by the comment and the worse score value in the 2nd assert) then, it isn't doing a good job. We should restrict depth. |
|
ping @glemaitre |
|
This is kinda funny that we only have a single build failing here. The splits are probably different in 32 bits. The |
|
But we set random state? For me(64 bit) overfitting occurs quickly. For max_depth=20, 4/6 scores are 0 |
|
The score is larger than 60 with 32bit The large difference between 32 and 64 is odd. |
Basically, I recall that the trees built with 32 bits architecture are different from the one built with 64 bits. So either we increase the threshold or we skip the test on 32 bits architecture with the decorator |
|
Thank you @lucyleeow ! |
Reference Issues/PRs
Towards #16155
What does this implement/fix? Explain your changes.
Replace Boston dataset with diabetes dataset in
sklearn/tree/trests/test_tree.pyAny other comments?
I noticed that in
test_boston(nowtest_diabetes) the score was always 0 for all estimators and criterions - for both boston and diabetes datasets. I confirmed thatreg.predict(diabetes.data)gives exactlydiabetes.target- possibly due to theregbeing fitted on the same dataset. I don't think this is what was intended? Happy to amend, here or in another PR, if this is not right.