Skip to content

TST Replace Boston dataset in test_tree#17290

Merged
thomasjpfan merged 7 commits intoscikit-learn:masterfrom
lucyleeow:test_tree
Jun 1, 2020
Merged

TST Replace Boston dataset in test_tree#17290
thomasjpfan merged 7 commits intoscikit-learn:masterfrom
lucyleeow:test_tree

Conversation

@lucyleeow
Copy link
Copy Markdown
Member

@lucyleeow lucyleeow commented May 20, 2020

Reference Issues/PRs

Towards #16155

What does this implement/fix? Explain your changes.

Replace Boston dataset with diabetes dataset in sklearn/tree/trests/test_tree.py

Any other comments?

I noticed that in test_boston (now test_diabetes) the score was always 0 for all estimators and criterions - for both boston and diabetes datasets. I confirmed that reg.predict(diabetes.data) gives exactly diabetes.target - possibly due to the reg being fitted on the same dataset. I don't think this is what was intended? Happy to amend, here or in another PR, if this is not right.

@glemaitre
Copy link
Copy Markdown
Member

If the tree is grown without fixing the depth, each leaf in the tree will be a sample. So you clearly overfit but you will do a perfect classification if you train and test on the same data. Basically, this is a behaviour that we want to check (I don't know if it was intended here).

@glemaitre
Copy link
Copy Markdown
Member

it would be the same behaviour for regression.

Copy link
Copy Markdown
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can check the score but the changes LGTM

@lucyleeow
Copy link
Copy Markdown
Member Author

lucyleeow commented May 21, 2020

If the tree is grown without fixing the depth, each leaf in the tree will be a sample.

Thanks, that makes sense.

I think the test copied from (or at least is the same as) this one in test_forest:

def check_regression_criterion(name, criterion):
# Check consistency on regression dataset.
ForestRegressor = FOREST_REGRESSORS[name]
reg = ForestRegressor(n_estimators=5, criterion=criterion,
random_state=1)
reg.fit(X_reg, y_reg)
score = reg.score(X_reg, y_reg)
assert score > 0.93, ("Failed with max_features=None, criterion %s "
"and score = %f" % (criterion, score))
reg = ForestRegressor(n_estimators=5, criterion=criterion,
max_features=6, random_state=1)
reg.fit(X_reg, y_reg)
score = reg.score(X_reg, y_reg)
assert score > 0.92, ("Failed with max_features=6, criterion %s "
"and score = %f" % (criterion, score))

In test_forest the regressors are not completely overfit as each tree uses only a subset of the data and the scores are not perfect.

I'm not sure what this test is checking for - but if is checking that using less features reduces the learning ability (as suggested by the comment and the worse score value in the 2nd assert) then, it isn't doing a good job. We should restrict depth.

@lucyleeow
Copy link
Copy Markdown
Member Author

ping @glemaitre

@glemaitre
Copy link
Copy Markdown
Member

This is kinda funny that we only have a single build failing here. The splits are probably different in 32 bits. The max_depth is probably not sufficient since the splits are completely random. You can increase the depth of the trees then.

@lucyleeow
Copy link
Copy Markdown
Member Author

lucyleeow commented May 27, 2020

But we set random state? For me(64 bit) overfitting occurs quickly. For max_depth=20, 4/6 scores are 0

DecisionTreeRegressor mse score 0.0
DecisionTreeRegressor mae score 5.786199095022624
DecisionTreeRegressor friedman_mse score 0.0
ExtraTreeRegressor mse score 0.0
ExtraTreeRegressor mae score 3.6911764705882355
ExtraTreeRegressor friedman_mse score 0.0

@glemaitre glemaitre self-assigned this May 27, 2020
@lucyleeow
Copy link
Copy Markdown
Member Author

lucyleeow commented May 27, 2020

The score is larger than 60 with 32bit

Tree       = <class 'sklearn.tree._classes.ExtraTreeRegressor'>
criterion  = 'mse'
max_depth  = 15
name       = 'ExtraTreeRegressor'
reg        = ExtraTreeRegressor(max_depth=15, max_features=6, random_state=0)
score      = 281.43585091379208

The large difference between 32 and 64 is odd.

@glemaitre
Copy link
Copy Markdown
Member

The large difference between 32 and 64 is odd.

Basically, I recall that the trees built with 32 bits architecture are different from the one built with 64 bits. So either we increase the threshold or we skip the test on 32 bits architecture with the decorator @skip_if_32bit from sklearn.utils._testing. Let's do the latest and see what people think about it in a second review.

Copy link
Copy Markdown
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thomasjpfan thomasjpfan merged commit f27adc5 into scikit-learn:master Jun 1, 2020
@thomasjpfan
Copy link
Copy Markdown
Member

Thank you @lucyleeow !

@lucyleeow lucyleeow deleted the test_tree branch June 2, 2020 08:57
viclafargue pushed a commit to viclafargue/scikit-learn that referenced this pull request Jun 26, 2020
jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants