TST: Decision trees: add test for split optimality#32193
TST: Decision trees: add test for split optimality#32193OmarManzoor merged 42 commits intoscikit-learn:mainfrom
Conversation
|
Continuing the discussion from #32351 about testing here, as I think it's relevant for this PR. @cakedev0 said:
@ogrisel said:
My answer: I have that in mind, I don't think randomized tests have to be slow, I try to hit the proper balance, but yes it's not easy. Currently, all the tests of Note that the test in this PR takes only ~2s for 20 generated variants. |
|
Thanks for the review and the great comments suggestions!
I don't think it's worth the effort: I think this part (tree-building logic) is low-risk and isn't going to change soon. While for the splitting part, there are a lot of changes/potential future changes (my missing-values refactor, categorical, sparse + NaNs, binning, other optimizations?). Plus there was quite a few small edge-case bugs. So we really needed such a test. |
|
Note: The ~20 generated tests run in ~1s, so it's a very reasonable load on the CI ^^ |
I agree. |
OmarManzoor
left a comment
There was a problem hiding this comment.
Thank you for the PR @cakedev0. I only left a few minor comments otherwise LGTM
Minor comments updates Co-authored-by: Omar Salman <omar.salman@arbisoft.com>
Co-authored-by: Arthur Lacote <arthur.lcte@gmail.com>
Motivation
I recently opened 5 issues related to algorithmic bugs in the logic of finding a/the best split in decision trees or calculating the impurity:
DecisionTreeRegressor: invalid impurity forcriterion="poisson"with missing values #32870 - tests related to this bug are skipped for now.NaNdetection leading to incorrect split. #33113 - fixed in this PR.Related files are mostly
sklearn/tree/_{splitter|partitioner|criterion}.pyx.I think this shows that current tests a too weak, and I propose to add a very strong test, that will give us confidence in the current code and in the future changes.
Reference Issues/PRs
Closes #32175
Fixes #33113
PRs with the fixes:
DecisionTree*partitioning with missing values present #32351criterion="absolute_error"by greatly simplifying the logic #32119 - planned to be merged after this one as we want to use this one as non-regression testWhat does this implement/fix? Explain your changes.
Add a test that compares sklearn's implementation of
node_best_split(fromsklearn/tree/_splitter.pyx) with a naive implementation in Python/numpy (slow, unusable in practice, but much easier to get right).This allows exact testing with many random inputs, which: