Skip to content

TST: Decision trees: add test for split optimality#32193

Merged
OmarManzoor merged 42 commits intoscikit-learn:mainfrom
cakedev0:testing-split
Jan 27, 2026
Merged

TST: Decision trees: add test for split optimality#32193
OmarManzoor merged 42 commits intoscikit-learn:mainfrom
cakedev0:testing-split

Conversation

@cakedev0
Copy link
Copy Markdown
Contributor

@cakedev0 cakedev0 commented Sep 15, 2025

Motivation

I recently opened 5 issues related to algorithmic bugs in the logic of finding a/the best split in decision trees or calculating the impurity:

Related files are mostly sklearn/tree/_{splitter|partitioner|criterion}.pyx.

I think this shows that current tests a too weak, and I propose to add a very strong test, that will give us confidence in the current code and in the future changes.

Reference Issues/PRs

Closes #32175
Fixes #33113

PRs with the fixes:

What does this implement/fix? Explain your changes.

Add a test that compares sklearn's implementation of node_best_split (from sklearn/tree/_splitter.pyx) with a naive implementation in Python/numpy (slow, unusable in practice, but much easier to get right).

This allows exact testing with many random inputs, which:

  • helps greatly with detecting bugs in edge/rare cases
  • will help a lot in having confidence that potential future changes are valid (in addition to the PRs mentioned above: quantile regression trees, categorical features, NaNs support with MAE criterion, etc.)

@github-actions
Copy link
Copy Markdown

github-actions bot commented Sep 15, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 9d8b93b. Link to the linter CI: here

@cakedev0 cakedev0 changed the title MNT: Test for optimal splits MNT: Decision trees: add test for split optimality Sep 15, 2025
@cakedev0
Copy link
Copy Markdown
Contributor Author

cakedev0 commented Oct 7, 2025

Continuing the discussion from #32351 about testing here, as I think it's relevant for this PR.

@cakedev0 said:

I personally love tests with many random inputs. Toy examples should only be used for debugging IMO. It's still good to have them as tests, as it makes it easy to debug regressions. But it's really not enough to build confidence.

@ogrisel said:

Unfortunately, such randomized tests are often costly to run so we need to strike a balance. Ideally, individual tests should last no longer than 1 s on a usual CI runner.
Slow test suites makes it painful to run the tests locally or wait for the CI to complete on a PR. It would further make the release process more horrible than it is, given the number of Python versions and architecture combination that we support.

My answer:

I have that in mind, I don't think randomized tests have to be slow, I try to hit the proper balance, but yes it's not easy.

Currently, all the tests of sklearn/tree run in ~6s, while the tests of sklearn/ensemble/ run in ~100s. Given that a lot of things in sklearn/ensemble/ rely on sklearn/tree/, I think it's reasonable to spend a bit more time in testing it.

Note that the test in this PR takes only ~2s for 20 generated variants.

@cakedev0 cakedev0 changed the title MNT: Decision trees: add test for split optimality TST: Decision trees: add test for split optimality Jan 15, 2026
@github-actions github-actions bot removed the CI:Linter failure The linter CI is failing on this PR label Jan 23, 2026
@cakedev0
Copy link
Copy Markdown
Contributor Author

Thanks for the review and the great comments suggestions!

We could also test for deeper trees and recursively check [...]

I don't think it's worth the effort: I think this part (tree-building logic) is low-risk and isn't going to change soon.

While for the splitting part, there are a lot of changes/potential future changes (my missing-values refactor, categorical, sparse + NaNs, binning, other optimizations?). Plus there was quite a few small edge-case bugs. So we really needed such a test.

@cakedev0 cakedev0 marked this pull request as ready for review January 23, 2026 20:32
@cakedev0
Copy link
Copy Markdown
Contributor Author

Note: The ~20 generated tests run in ~1s, so it's a very reasonable load on the CI ^^

@ogrisel
Copy link
Copy Markdown
Member

ogrisel commented Jan 26, 2026

I don't think it's worth the effort

I agree.

Copy link
Copy Markdown
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ogrisel ogrisel added the Waiting for Second Reviewer First reviewer is done, need a second one! label Jan 26, 2026
Copy link
Copy Markdown
Contributor

@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @cakedev0. I only left a few minor comments otherwise LGTM

Copy link
Copy Markdown
Contributor

@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@OmarManzoor OmarManzoor enabled auto-merge (squash) January 27, 2026 13:40
@OmarManzoor OmarManzoor merged commit 66d314a into scikit-learn:main Jan 27, 2026
36 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:tree Waiting for Second Reviewer First reviewer is done, need a second one!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: Trees: minor bug in NaN detection leading to incorrect split. Unexpected behavior of tree splits: missing values handling is buggy?

3 participants