TST: Decision trees: add test for split optimality by cakedev0 · Pull Request #32193 · scikit-learn/scikit-learn

cakedev0 · 2025-09-15T19:33:22Z

Motivation

I recently opened 5 issues related to algorithmic bugs in the logic of finding a/the best split in decision trees or calculating the impurity:

Decision/Extra trees & missing values: split ignored when NaNs + single unique value in feature #32272
Trees: impurity decrease calculation is buggy when there are missing values #32178
DecisionTreeRegressor with absolute error criterion: non-optimal split #32099
BUG: DecisionTreeRegressor: invalid impurity for criterion="poisson" with missing values #32870 - tests related to this bug are skipped for now.
BUG: Trees: minor bug in NaN detection leading to incorrect split. #33113 - fixed in this PR.

Related files are mostly sklearn/tree/_{splitter|partitioner|criterion}.pyx.

I think this shows that current tests a too weak, and I propose to add a very strong test, that will give us confidence in the current code and in the future changes.

Reference Issues/PRs

Closes #32175
Fixes #33113

PRs with the fixes:

FIX Decision/Extra trees: fix handling of missing values in detection of constant features #32274
FIX Fix DecisionTree* partitioning with missing values present #32351
Fix: improve speed of trees with MAE criterion from O(n^2) to O(n log n) #32100
FEA Add support for missing values in tree estimators with criterion="absolute_error" by greatly simplifying the logic #32119 - planned to be merged after this one as we want to use this one as non-regression test
Fix/tree nan mask cakedev0/scikit-learn#5 (was merged into this PR)

What does this implement/fix? Explain your changes.

Add a test that compares sklearn's implementation of node_best_split (from sklearn/tree/_splitter.pyx) with a naive implementation in Python/numpy (slow, unusable in practice, but much easier to get right).

This allows exact testing with many random inputs, which:

helps greatly with detecting bugs in edge/rare cases
will help a lot in having confidence that potential future changes are valid (in addition to the PRs mentioned above: quantile regression trees, categorical features, NaNs support with MAE criterion, etc.)

github-actions · 2025-09-15T19:34:24Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 9d8b93b. Link to the linter CI: here}

cakedev0 · 2025-10-07T13:08:10Z

Continuing the discussion from #32351 about testing here, as I think it's relevant for this PR.

@cakedev0 said:

I personally love tests with many random inputs. Toy examples should only be used for debugging IMO. It's still good to have them as tests, as it makes it easy to debug regressions. But it's really not enough to build confidence.

@ogrisel said:

Unfortunately, such randomized tests are often costly to run so we need to strike a balance. Ideally, individual tests should last no longer than 1 s on a usual CI runner.
Slow test suites makes it painful to run the tests locally or wait for the CI to complete on a PR. It would further make the release process more horrible than it is, given the number of Python versions and architecture combination that we support.

My answer:

I have that in mind, I don't think randomized tests have to be slow, I try to hit the proper balance, but yes it's not easy.

Currently, all the tests of sklearn/tree run in ~6s, while the tests of sklearn/ensemble/ run in ~100s. Given that a lot of things in sklearn/ensemble/ rely on sklearn/tree/, I think it's reasonable to spend a bit more time in testing it.

Note that the test in this PR takes only ~2s for 20 generated variants.

… testing-split

cakedev0 · 2026-01-23T20:31:58Z

Thanks for the review and the great comments suggestions!

We could also test for deeper trees and recursively check [...]

I don't think it's worth the effort: I think this part (tree-building logic) is low-risk and isn't going to change soon.

While for the splitting part, there are a lot of changes/potential future changes (my missing-values refactor, categorical, sparse + NaNs, binning, other optimizations?). Plus there was quite a few small edge-case bugs. So we really needed such a test.

cakedev0 · 2026-01-23T20:42:55Z

Note: The ~20 generated tests run in ~1s, so it's a very reasonable load on the CI ^^

ogrisel · 2026-01-26T10:03:55Z

I don't think it's worth the effort

I agree.

ogrisel

LGTM!

OmarManzoor

Thank you for the PR @cakedev0. I only left a few minor comments otherwise LGTM

sklearn/tree/tests/test_split.py

Minor comments updates Co-authored-by: Omar Salman <omar.salman@arbisoft.com>

… testing-split

sklearn/tree/tests/test_split.py

Co-authored-by: Arthur Lacote <arthur.lcte@gmail.com>

… testing-split

OmarManzoor

LGTM!

test for optimal splits

91348de

github-actions bot added the module:tree label Sep 15, 2025

cakedev0 changed the title ~~MNT: Test for optimal splits~~ MNT: Decision trees: add test for split optimality Sep 15, 2025

cakedev0 mentioned this pull request Sep 15, 2025

Unexpected behavior of tree splits: missing values handling is buggy? #32175

Closed

added friedman mse

7d38553

cakedev0 mentioned this pull request Sep 17, 2025

WIP: FEA Quantile regression for decision trees cakedev0/scikit-learn#1

Draft

11 tasks

test sparse; clean-up

9d8b93b

cakedev0 mentioned this pull request Sep 25, 2025

FIX Decision/Extra trees: fix handling of missing values in detection of constant features #32274

Merged

added categorical

afb65fd

cakedev0 mentioned this pull request Oct 3, 2025

FIX Fix DecisionTree* partitioning with missing values present #32351

Merged

2 tasks

Merge branch 'main' into testing-split

ea7052d

cakedev0 added 3 commits December 8, 2025 14:46

Merge branch 'main' into testing-split

d53351b

Merge branch 'testing-split' of github.com:cakedev0/scikit-learn into…

29c4948

… testing-split

Merge branch 'main' into testing-split

c82571e

cakedev0 mentioned this pull request Dec 9, 2025

BUG: DecisionTreeRegressor: invalid impurity for criterion="poisson" with missing values #32870

Closed

Merge remote-tracking branch 'upstream/main' into testing-split

309690e

cakedev0 changed the title ~~MNT: Decision trees: add test for split optimality~~ TST: Decision trees: add test for split optimality Jan 15, 2026

cakedev0 mentioned this pull request Jan 15, 2026

FEA Add support for missing values in tree estimators with criterion="absolute_error" by greatly simplifying the logic #32119

Merged

cakedev0 added 8 commits January 19, 2026 16:47

Merge remote-tracking branch 'upstream/main' into testing-split

8e01f23

small cleanup

16f6926

remove categorical test for now

41bcb81

continuing removing categorical support

c1b5357

merged with ogrisel test

7ab790c

minor formatting

d04df60

improvements

bb6e26e

fix & simplify

5cf5ae2

cakedev0 mentioned this pull request Jan 20, 2026

FIX: fix nan detection in trees #33114

Closed

Merge remote-tracking branch 'upstream/main' into fix/tree_nan_mask

4bac5c6

fix test params

036bba9

github-actions bot removed the CI:Linter failure The linter CI is failing on this PR label Jan 23, 2026

cakedev0 added 2 commits January 23, 2026 18:28

fix for older numpy version (CI failure)

0368a3f

testing node values

73f0509

cakedev0 marked this pull request as ready for review January 23, 2026 20:32

cakedev0 mentioned this pull request Jan 23, 2026

BUG: Trees: minor bug in NaN detection leading to incorrect split. #33113

Closed

small rephrasing

960c660

cakedev0 mentioned this pull request Jan 23, 2026

FIX: fix boundary 0-weight edge-case in _weighted_percentile #33127

Merged

5 tasks

Merge remote-tracking branch 'upstream/main' into testing-split

247be20

ogrisel approved these changes Jan 26, 2026

View reviewed changes

ogrisel added the Waiting for Second Reviewer First reviewer is done, need a second one! label Jan 26, 2026

cakedev0 added 3 commits January 26, 2026 18:34

added small todo

5074414

Merge remote-tracking branch 'upstream/main' into testing-split

ebf6db5

Merge remote-tracking branch 'upstream/main' into testing-split

a207bd7

OmarManzoor reviewed Jan 27, 2026

View reviewed changes

sklearn/tree/tests/test_split.py Outdated Show resolved Hide resolved

sklearn/tree/tests/test_split.py Outdated Show resolved Hide resolved

sklearn/tree/tests/test_split.py Outdated Show resolved Hide resolved

cakedev0 and others added 3 commits January 27, 2026 13:02

remove w > 0 masking after wp fix was merged

e3915d6

Apply suggestions from code review

9beaeb5

Minor comments updates Co-authored-by: Omar Salman <omar.salman@arbisoft.com>

Merge branch 'testing-split' of github.com:cakedev0/scikit-learn into…

364a94e

… testing-split

OmarManzoor reviewed Jan 27, 2026

View reviewed changes

sklearn/tree/tests/test_split.py Show resolved Hide resolved

OmarManzoor and others added 4 commits January 27, 2026 18:29

Update sklearn/tree/tests/test_split.py

21aecf2

Co-authored-by: Arthur Lacote <arthur.lcte@gmail.com>

Merge remote-tracking branch 'upstream/main' into testing-split

1dd7bbb

reformulated comment

22a51e9

Merge branch 'testing-split' of github.com:cakedev0/scikit-learn into…

59964aa

… testing-split

OmarManzoor approved these changes Jan 27, 2026

View reviewed changes

OmarManzoor enabled auto-merge (squash) January 27, 2026 13:40

OmarManzoor merged commit 66d314a into scikit-learn:main Jan 27, 2026
36 checks passed

cakedev0 mentioned this pull request Feb 15, 2026

MNT: scikit-learn compat: remove import of _any_isnan_axis0 sebp/scikit-survival#581

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TST: Decision trees: add test for split optimality#32193

TST: Decision trees: add test for split optimality#32193
OmarManzoor merged 42 commits intoscikit-learn:mainfrom
cakedev0:testing-split

cakedev0 commented Sep 15, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Sep 15, 2025 •

edited

Loading

Uh oh!

cakedev0 commented Oct 7, 2025

Uh oh!

cakedev0 commented Jan 23, 2026

Uh oh!

cakedev0 commented Jan 23, 2026

Uh oh!

ogrisel commented Jan 26, 2026

Uh oh!

ogrisel left a comment

Uh oh!

OmarManzoor left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

OmarManzoor left a comment •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

cakedev0 commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Uh oh!

github-actions bot commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

cakedev0 commented Oct 7, 2025

Uh oh!

cakedev0 commented Jan 23, 2026

Uh oh!

cakedev0 commented Jan 23, 2026

Uh oh!

ogrisel commented Jan 26, 2026

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

OmarManzoor left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

OmarManzoor left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cakedev0 commented Sep 15, 2025 •

edited

Loading

github-actions bot commented Sep 15, 2025 •

edited

Loading

OmarManzoor left a comment •

edited

Loading