Skip to content

FEA Add support for missing values in tree estimators with criterion="absolute_error" by greatly simplifying the logic#32119

Merged
lorentzenchr merged 174 commits intoscikit-learn:mainfrom
cakedev0:tree-simpler-missing
Mar 5, 2026
Merged

FEA Add support for missing values in tree estimators with criterion="absolute_error" by greatly simplifying the logic#32119
lorentzenchr merged 174 commits intoscikit-learn:mainfrom
cakedev0:tree-simpler-missing

Conversation

@cakedev0
Copy link
Copy Markdown
Contributor

@cakedev0 cakedev0 commented Sep 6, 2025

This PR refactors how missing values are handled in trees by:

  • removing missing-values handling from Criterion subclasses
  • making it the responsability of the splitter & partitionner only

This greatly simplifies the logic and unlocks for free the support of missing values for MAE trees.

Reference Issues/PRs

Accidentally fixes #32870

Otherwise, I looked but I didn't find any issue requesting the support of missing values with criterion="absolute_error". Just this recent comment from @ogrisel in this PR: #32100 (review). Maybe it's because MAE trees were just too slow in sklearn before PR #32100, so they aren't used a lot for now.

What does this implement/fix? Explain your changes.

Currently, a part of the missing values support is done by each subclass of Criterion. I believe it's not a great design because:

  • Criterion is "X-blind", it's not aware of X values. It just looks at y and sample_weights in the order defined by the sorted indices (sample_indices). It never looks at X values. But somehow, by making it handle missing values, it does have some dealing with X values. Why not just use the ordering of sample_indices to take into account missing values? Like what we do for any other value (even inf/-inf for instance).
  • It requires each criterion to implement nan-handling logic.

So, to the question "Why not just use the ordering of sample_indices to take into account missing values? " my answer is :"yes, let's just do that". The result is removing 200 lines from _criterion.pxy while not increasing the complexity of the splitter and the partitioner (actually, it also simplifies a bit the splitter).

Any other comments?

I think it might unlock adding the support for missing values + monotonic constraints without too much efforts, but I haven't look into it yet.

It might also simplify a bit the support for missing values + sparse, but this would still be hard (probably not worth the effort).

@github-actions
Copy link
Copy Markdown

github-actions bot commented Sep 6, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 47816d1. Link to the linter CI: here

@cakedev0
Copy link
Copy Markdown
Contributor Author

Note: at this day, tests passes on my laptop and most CI unit tests pipelines are successful. But some are failing, I managed to reproduce one of the failing pipelines locally using a Docker image. I still need to find the bug though.

@cakedev0
Copy link
Copy Markdown
Contributor Author

Tests pass! 🎊

Well, I learned the difference between memcpy and memmove the hard way 😂

  test_forest and test_missing_value_is_predictive
  test_split and test_split_impurity
  test_swap and test_py_swap_array_slices_random
  test_tree and test_regression_tree_missing_values_toy
@adam2392
Copy link
Copy Markdown
Member

adam2392 commented Mar 4, 2026

Let me know if I need to take another look. Otw, I'll let @lorentzenchr and/or @ogrisel converge first.

@cakedev0
Copy link
Copy Markdown
Contributor Author

cakedev0 commented Mar 4, 2026

Do you think you're good to approve it on your side?

I think @lorentzenchr can take another look and maybe approve it too?

And then maybe we can merge? 🤞

Copy link
Copy Markdown
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM besides the following small details but I am still confused (as Christian) by line 450 in _splitter.pyx.

Copy link
Copy Markdown
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming the following is correct, let's also expand the docstring of next_p to make that more explicit.

Is there a way to test this one way or another? Or is this already tested? If so LGTM.

@ogrisel
Copy link
Copy Markdown
Member

ogrisel commented Mar 4, 2026

Note that a merge with main is required to get circle ci to run again.

cakedev0 and others added 2 commits March 4, 2026 20:42
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
@cakedev0
Copy link
Copy Markdown
Contributor Author

cakedev0 commented Mar 4, 2026

Explanation of the last commit: While expanding the docstring of next_p, I noticed missing_on_the_left was not needed: it's a duplicate of the variable missing_go_to_left in the best split function. And the call site of next_p is precisely this best split function, so we can directly pass missing_go_to_left to next_p there.

Copy link
Copy Markdown
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM, just a few nitpicks and the open question from Olivier and me.

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
Copy link
Copy Markdown
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

cakedev0 and others added 3 commits March 5, 2026 12:10
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Copy link
Copy Markdown
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cakedev0 Thank you for this really neat PR. Less code, more and better functionality, that's just 🚀

@lorentzenchr lorentzenchr changed the title FEA Trees: Add support for missing values with criterion="absolute_error" by greatly simplifying the logic FEA Add support for missing values in tree estimators with criterion="absolute_error" by greatly simplifying the logic Mar 5, 2026
@lorentzenchr lorentzenchr merged commit 33b51a6 into scikit-learn:main Mar 5, 2026
38 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in Tree-based models Mar 5, 2026
@cakedev0
Copy link
Copy Markdown
Contributor Author

cakedev0 commented Mar 5, 2026

Amazing 🎉

Thank you everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

BUG: DecisionTreeRegressor: invalid impurity for criterion="poisson" with missing values

5 participants