FEA Add support for missing values in tree estimators with `criterion="absolute_error"` by greatly simplifying the logic by cakedev0 · Pull Request #32119 · scikit-learn/scikit-learn

cakedev0 · 2025-09-06T08:40:21Z

This PR refactors how missing values are handled in trees by:

removing missing-values handling from Criterion subclasses
making it the responsability of the splitter & partitionner only

This greatly simplifies the logic and unlocks for free the support of missing values for MAE trees.

Reference Issues/PRs

Accidentally fixes #32870

Otherwise, I looked but I didn't find any issue requesting the support of missing values with criterion="absolute_error". Just this recent comment from @ogrisel in this PR: #32100 (review). Maybe it's because MAE trees were just too slow in sklearn before PR #32100, so they aren't used a lot for now.

What does this implement/fix? Explain your changes.

Currently, a part of the missing values support is done by each subclass of Criterion. I believe it's not a great design because:

Criterion is "X-blind", it's not aware of X values. It just looks at y and sample_weights in the order defined by the sorted indices (sample_indices). It never looks at X values. But somehow, by making it handle missing values, it does have some dealing with X values. Why not just use the ordering of sample_indices to take into account missing values? Like what we do for any other value (even inf/-inf for instance).
It requires each criterion to implement nan-handling logic.

So, to the question "Why not just use the ordering of sample_indices to take into account missing values? " my answer is :"yes, let's just do that". The result is removing 200 lines from _criterion.pxy while not increasing the complexity of the splitter and the partitioner (actually, it also simplifies a bit the splitter).

Any other comments?

I think it might unlock adding the support for missing values + monotonic constraints without too much efforts, but I haven't look into it yet.

It might also simplify a bit the support for missing values + sparse, but this would still be hard (probably not worth the effort).

github-actions · 2025-09-06T08:41:22Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 47816d1. Link to the linter CI: here}

cakedev0 · 2025-09-10T11:46:24Z

Note: at this day, tests passes on my laptop and most CI unit tests pipelines are successful. But some are failing, I managed to reproduce one of the failing pipelines locally using a Docker image. I still need to find the bug though.

cakedev0 · 2025-09-11T07:52:47Z

Tests pass! 🎊

Well, I learned the difference between memcpy and memmove the hard way 😂

…rn into tree-simpler-missing

test_forest and test_missing_value_is_predictive test_split and test_split_impurity test_swap and test_py_swap_array_slices_random test_tree and test_regression_tree_missing_values_toy

adam2392 · 2026-03-04T14:08:23Z

Let me know if I need to take another look. Otw, I'll let @lorentzenchr and/or @ogrisel converge first.

cakedev0 · 2026-03-04T14:24:17Z

Do you think you're good to approve it on your side?

I think @lorentzenchr can take another look and maybe approve it too?

And then maybe we can merge? 🤞

ogrisel

Overall LGTM besides the following small details but I am still confused (as Christian) by line 450 in _splitter.pyx.

sklearn/tree/_partitioner.pyx

sklearn/tree/_partitioner.pxd

sklearn/tree/_splitter.pyx

ogrisel

Assuming the following is correct, let's also expand the docstring of next_p to make that more explicit.

Is there a way to test this one way or another? Or is this already tested? If so LGTM.

sklearn/tree/_splitter.pyx

ogrisel · 2026-03-04T16:05:07Z

Note that a merge with main is required to get circle ci to run again.

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

cakedev0 · 2026-03-04T20:24:03Z

Explanation of the last commit: While expanding the docstring of next_p, I noticed missing_on_the_left was not needed: it's a duplicate of the variable missing_go_to_left in the best split function. And the call site of next_p is precisely this best split function, so we can directly pass missing_go_to_left to next_p there.

lorentzenchr

This LGTM, just a few nitpicks and the open question from Olivier and me.

sklearn/ensemble/tests/test_forest.py

sklearn/tree/_classes.py

sklearn/tree/_partitioner.pxd

sklearn/tree/_partitioner.pyx

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

ogrisel

LGTM.

sklearn/tree/_splitter.pyx

…rn into tree-simpler-missing

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

lorentzenchr

@cakedev0 Thank you for this really neat PR. Less code, more and better functionality, that's just 🚀

cakedev0 · 2026-03-05T12:45:50Z

Amazing 🎉

Thank you everyone!

cakedev0 added 9 commits September 5, 2025 21:30

first draft: compilation ok

d5da0ec

fixed seg fault bug

61bbe68

some tests with decision tree and missing values are passing

974bd88

WIP

b02427f

WIP

59aae5b

fixed the silly mistake that was causing seg faults & infinite loops

13bd4e2

AE is now supported

944dd55

Fix last major bug - in random splits

d24078e

cleanup

2d867e6

github-actions bot added module:ensemble module:tree cython labels Sep 6, 2025

cakedev0 added 7 commits September 6, 2025 10:41

cleanup prints

f722f28

cleanup unsued var

4a1d061

cleanup

8abd690

attempt at fixing CI failing for 32-bits systems

dc66653

removed checks on criterion; included abs err in more tests

92a8041

removed a line added for exp I forgot to remove

6d6c1cc

WIP debuggin 32-bits tests

a209a9d

cakedev0 added 5 commits September 10, 2025 23:26

fixed silly bug that took way too long to find

628a61a

WIP debugging

b8365c5

tests are passing? Or am I too tired?

99a9a64

Probably fixed all the bugs

1c876bc

Removed all debug prints

79e12ea

cakedev0 added 3 commits September 12, 2025 16:46

cleanup; comments; unit test for swap slices

f687cfa

changelog

b3c06ae

Merge remote-tracking branch 'upstream/main' into tree-simpler-missing

7a5cfce

cakedev0 added 5 commits February 27, 2026 13:14

fix forests test

01a48b7

Merge branch 'tree-simpler-missing' of github.com:cakedev0/scikit-lea…

6bc474b

…rn into tree-simpler-missing

Merge remote-tracking branch 'upstream/main' into tree-simpler-missing

d7a87d1

slightly looser gap

e56ca9a

run seed-sensitive tree/forest tests [all random seeds]

b249584

test_forest and test_missing_value_is_predictive test_split and test_split_impurity test_swap and test_py_swap_array_slices_random test_tree and test_regression_tree_missing_values_toy

cakedev0 added 3 commits March 4, 2026 15:58

added edge cases tests for swap slices

2a3c58b

todo for swap buffer size optim

6afec7b

Merge remote-tracking branch 'upstream/main' into tree-simpler-missing

8c804db

ogrisel reviewed Mar 4, 2026

View reviewed changes

sklearn/tree/_partitioner.pyx Outdated Show resolved Hide resolved

sklearn/tree/_partitioner.pxd Outdated Show resolved Hide resolved

sklearn/tree/_splitter.pyx Show resolved Hide resolved

sklearn/tree/_splitter.pyx Outdated Show resolved Hide resolved

ogrisel reviewed Mar 4, 2026

View reviewed changes

sklearn/tree/_splitter.pyx Outdated Show resolved Hide resolved

cakedev0 and others added 2 commits March 4, 2026 20:42

Add/improve comments

b231b93

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

remove missing_on_the_left + document a lot next_p

6c7fdba

lorentzenchr reviewed Mar 5, 2026

View reviewed changes

Cleanups & comments

3226c56

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

ogrisel approved these changes Mar 5, 2026

View reviewed changes

sklearn/tree/_splitter.pyx Show resolved Hide resolved

cakedev0 and others added 3 commits March 5, 2026 12:10

y -= min(y)

faac86c

Merge branch 'tree-simpler-missing' of github.com:cakedev0/scikit-lea…

476c3f5

…rn into tree-simpler-missing

Comment about nan <= inf

ec58af2

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

lorentzenchr approved these changes Mar 5, 2026

View reviewed changes

lorentzenchr changed the title ~~FEA Trees: Add support for missing values with criterion="absolute_error" by greatly simplifying the logic~~ FEA Add support for missing values in tree estimators with criterion="absolute_error" by greatly simplifying the logic Mar 5, 2026

lorentzenchr merged commit 33b51a6 into scikit-learn:main Mar 5, 2026
38 checks passed

github-project-automation bot moved this from In Progress to Done in Tree-based models Mar 5, 2026

cakedev0 mentioned this pull request Mar 5, 2026

RFC: DecisionTree*/ExtraTree*: why accepting a splitter parameter? (let's deprecate it) #33158

Open

lucolivi mentioned this pull request Mar 10, 2026

CI Non-deterministic failure for sklearn/ensemble/tests/test_forest.py::test_missing_values_is_resilient on some CI builds #33510

Closed

lorentzenchr mentioned this pull request Mar 16, 2026

[RFC] Tree module improvements #5212

Closed

12 tasks

Uh oh!

Conversation

cakedev0 commented Sep 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

github-actions bot commented Sep 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

cakedev0 commented Sep 10, 2025

Uh oh!

cakedev0 commented Sep 11, 2025

Uh oh!

adam2392 commented Mar 4, 2026

Uh oh!

cakedev0 commented Mar 4, 2026

Uh oh!

ogrisel left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ogrisel commented Mar 4, 2026

Uh oh!

cakedev0 commented Mar 4, 2026

Uh oh!

lorentzenchr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lorentzenchr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cakedev0 commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

cakedev0 commented Sep 6, 2025 •

edited

Loading

github-actions bot commented Sep 6, 2025 •

edited

Loading

ogrisel left a comment •

edited

Loading