FIX Decision/Extra trees: fix handling of missing values in detection of constant features by cakedev0 · Pull Request #32274 · scikit-learn/scikit-learn

cakedev0 · 2025-09-25T14:52:06Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Simple fix of the if-condition detecting constant features.

I updated the user-guide, as one of the example was precisely based on this buggy behavior...

Any other comments?

I'm not willing to write a lot of tests in this PR, as I'd rather move forward on this testing-focused PR instead: #32193

Indeed it's the test from PR #32193 that made me find this bug (and two other bugs) while not being aimed at finding this bug precisely (nor any of the two others). Which, IMO, proves it's a better test (for split logic correctness) than tests that are already present, and better than toy examples crafted to check one precise behavior.

github-actions · 2025-09-25T14:53:09Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 7b1267d. Link to the linter CI: here}

cakedev0 · 2025-10-01T20:22:40Z

Attracting some attention here: @adam2392 @thomasjpfan if you have some bandwidth 🙏

The bug fixed in this PR has non-negligible impact I think and the fix is simple.

adam2392 · 2025-10-02T12:43:13Z

This bug isn't isolated to just having nans and one constant value in the dataset? It can occur at any depth of the tree so in some sense we are splitting potentially 1 less time than we could in all cases with a missing value.

Is that correct?

adam2392

Need to look at the resulting docs more closely

sklearn/tree/_splitter.pyx

adam2392 · 2025-10-02T12:44:30Z

doc/modules/tree.rst

-the split with all the missing values going to the left node or the right node.
+During training, for each potential threshold on the non-missing data,
+the splitter will evaluate the split with all the missing values going to the left
+node or the right node. Hence all missing values always end up in the same node


Suggested change

node or the right node. Hence all missing values always end up in the same node

node or the right node. Hence all missing values always end up in the same node after a split (they can be scattered throughout the tree though)

There is already a "(sometimes with only missing values)." in parentheses after. Better split that in 2 sentences instead.

Hum... What do you think of rewriting this sentence to:

So, for a dataset with just one feature, all missing values always end up in the same node, potentially a node with only missing values.

That was the case I wanted to have readers imagine.
We can also just remove this sentence, I don't think it's crucial.

I don't think we should remove this sentence, but perhaps reword to make it more clear. I think we just explicitly state that during training, the splitter evaluates the following and picks one:

sending all missing data to the left, and non-missing data to the right

splitter evaluates thresholds on non-missing data to split those with feature value less than threshold to the left, and >= to the right (idr if it's <= and >, or the other way around), and simultaneously evaluates sending all the missing data to the left, or right along with that threshold

Thus, missing values are always grouped together in one of the children nodes. Sometimes, the children node consists of only missing values.

Probably can get rewritten to be even more clear, which one takes precedence given equality.

adam2392 · 2025-10-02T12:59:32Z

doc/modules/tree.rst

    >>> tree.predict(X)
-    array([0, 0, 1, 1])
-
- If the criterion evaluation is the same for both nodes,


Why did you remove this part?

The example from is part is based on the buggy behavior.

This example has several equivalent splits (same loss) on the root node. (1): [-1] [1 nan nan] and (2): [nan nan -1] [1] and, and yes in that case, missing will go to the right (split 1). But then, once the bug is fixed, another split will occur and the tree will look this this (missing end up at the bottom right):

and tree.predict([[np.nan]]) returns [1] (tree.predict([[np.nan]]) returns [0.5 0.5]).

Generally speaking, when there are several equivalent splits, we choose the first we encounter because of the strict > in if current_proxy_improvement > best_proxy_improvement:. In practice, with precision errors, we might choose an arbitrary split among the set of optimal splits.

We could document that, but I think it's too detailed for the user-guide if we want to keep it relatively short. Maybe just adding the sentence "we choose an arbitrary split among the set of optimal splits" is ok though.

I meant this part:

- If the criterion evaluation is the same for both nodes, then the tie for missing value at predict time is broken by going to the right node. The splitter also checks the split where all the missing values go to one child and non-missing values go to the other

These seem like important details to keep? I agree it could be rewritten though.

We have the following cases:

no missing during training: what happens during predict time with missing

missing during training: what happens during predict time with missing

We have the following cases:

no missing during training: what happens during predict time with missing

missing during training: what happens during predict time with missing

Yes indeed. And those two points are document by point 1 of the user-guide:

By default when predicting, the samples with missing values are classified with the class used in the split found during training:

And point 3:

If no missing values are seen during training for a given feature, then during prediction missing values are mapped to the child with the most samples

Point 2 (the one I'm removing) is:

If the criterion evaluation is the same for both nodes, then the tie for missing value at predict time is broken by going to the right node. The splitter also checks the split where all the missing values go to one child and non-missing values go to the other:

First, "The splitter also checks the split where all the missing values go to one child and non-missing values go to the other" should rather go with point 1; it has nothing to do with the criterion evaluation being the same for both nodes.

Second, "the tie for missing value at predict time is broken by going to the right node" is indeed true, but it's a sub-case of the case of several optimal splits, which is not documented. So it shouldn't be documented here just for missing values. We could document this case of several optimal splits, but personally I don't think it's needed, and it should anyway be discussed in a separate issue/PR.
An example of this case without missing values: X = [1, 2, 3], y = [1, 0, 1] ⇒ the two possible splits are equivalent. We will choose the split 1 | 2 3 because it's encountered first. Similarly, we first examine the splits with missing values on the right, so in the case of equivalent splits, we choose the one with missing values on the right.

Conclusion: I think this point 2 should be removed; as of now, it's confusing, over-detailed compared to the rest of the user guide, and provides an erroneous example.

If we remove point 2, I'll still like a place to document tie breaking behavior. Most of the context of how tie breaking works is in the issue tracker, but not in the docs: #23728

Thanks for the pointer to the meta issue! Is there a "sub issue" dedicated to decision trees? If not, I'd like to open one and continue this discussion here (one option would be to implement unbiased tie breaking, rather than explaining the bias in the doc).

Though, I think removing point 2 here doesn't require to solve this issue first.

Edit: Ah it seems I already read this issue one month ago, I had completely forgotten about it 😅

I see. I don't think I agree w/ the reasoning to remove point 2. It seems that the issue is moreso the incomplete documentation of order of splits when multiple optimal splits exist. We should instead document all 3 points, but specify in the order of priority they are evaluated in the code, right?

Am I misunderstanding?

It seems that the issue is moreso the incomplete documentation of order of splits when multiple optimal splits exist

Yes, that's a bit why I wanted to first completely remove mentions about order of splits and then maybe add it a a more complete/uniform way.

But, ok let's not remove point 2 here and let's talk about documentation in another issue. For now I'll do the minimal change in the doc for it to build (it wasn't building because the output expectation of point 2 was based on the bug).

I'll create the other issue and try to summarize there the discussions we had here.

cakedev0 · 2025-10-03T09:06:49Z

@adam2392

This bug isn't isolated to just having nans and one constant value in the dataset? It can occur at any depth of the tree so in some sense we are splitting potentially 1 less time than we could in all cases with a missing value.

Is that correct?

Yes it's correct. That's why the impact is non-negligible.

cakedev0 · 2025-11-19T12:23:30Z

Up here. There are some concerns about my changes of the doc, but I think I addressed them.

I think having an example in the user guide based on a bug is quite damaging for readers understanding, plus the bug in itself has a quite important impact on deep trees with missing values (many possible splits will be ignored)

Thanks 🙏

lesteve

Quick comments after looking at it for maybe 10 minutes ...

sklearn/tree/tests/test_tree.py

doc/whats_new/upcoming_changes/sklearn.tree/32274.fix.rst

adam2392 · 2025-11-20T01:59:29Z

I left some comments. Thanks @cakedev0 !

…rn into fix/constant-missing

lesteve

LGTM, thanks!

lesteve · 2025-11-24T08:00:24Z

Let's merge this one, thanks!

I combined @adam2392's and @thomasjpfan's comments into an additional approval.

- Drop support for Python 3.10 - Drop support for scikit-learn <1.8 due to scikit-learn/scikit-learn#32274

added test only; to show it fails before the fix

0fd992f

github-actions bot added the module:tree label Sep 25, 2025

cakedev0 mentioned this pull request Sep 25, 2025

TST: Decision trees: add test for split optimality #32193

Merged

cakedev0 added 3 commits September 25, 2025 17:22

fix

a400250

fix doc

de1b1ac

Merge branch 'doc-missing' into fix/constant-missing

7b69f30

cakedev0 mentioned this pull request Sep 25, 2025

Decision/Extra trees & missing values: split ignored when NaNs + single unique value in feature #32272

Closed

cakedev0 added 2 commits September 25, 2025 21:59

changelog

023ac4c

improve doc

9f99cce

cakedev0 marked this pull request as ready for review September 25, 2025 21:23

adam2392 self-requested a review October 2, 2025 12:37

adam2392 reviewed Oct 2, 2025

View reviewed changes

sklearn/tree/_splitter.pyx Outdated Show resolved Hide resolved

sklearn/tree/_splitter.pyx Outdated Show resolved Hide resolved

adam2392 reviewed Oct 2, 2025

View reviewed changes

switched conditions

5170bac

cakedev0 requested review from adam2392, ogrisel and thomasjpfan October 20, 2025 12:00

cakedev0 added 2 commits November 14, 2025 17:53

Merge remote-tracking branch 'upstream/main' into fix/constant-missing

a9e14a3

Merge branch 'main' into fix/constant-missing

d1b4b39

lesteve reviewed Nov 19, 2025

View reviewed changes

sklearn/tree/tests/test_tree.py Show resolved Hide resolved

doc/whats_new/upcoming_changes/sklearn.tree/32274.fix.rst Outdated Show resolved Hide resolved

cakedev0 added 4 commits November 20, 2025 09:26

comments in test

827fba9

improved changelog

2dea762

Merge branch 'fix/constant-missing' of github.com:cakedev0/scikit-lea…

b553e3a

…rn into fix/constant-missing

Merge remote-tracking branch 'upstream/main' into fix/constant-missing

674472c

revert most doc changes

7b1267d

lesteve approved these changes Nov 20, 2025

View reviewed changes

lesteve added this to the 1.8 milestone Nov 20, 2025

cakedev0 mentioned this pull request Nov 20, 2025

DOC: Clarify tie-breaking logic for equivalent splits in decision tree documentation #32752

Open

lesteve merged commit 28679aa into scikit-learn:main Nov 24, 2025
38 checks passed

sebp added a commit to sebp/scikit-survival that referenced this pull request Dec 14, 2025

Add support for scikit-learn 1.8

9d7e68a

- Drop support for Python 3.10 - Drop support for scikit-learn <1.8 due to scikit-learn/scikit-learn#32274

sebp added a commit to sebp/scikit-survival that referenced this pull request Dec 17, 2025

Add support for scikit-learn 1.8

aa1b491

- Drop support for Python 3.10 - Drop support for scikit-learn <1.8 due to scikit-learn/scikit-learn#32274

sebp added a commit to sebp/scikit-survival that referenced this pull request Dec 17, 2025

Add support for scikit-learn 1.8

5275c84

- Drop support for Python 3.10 - Drop support for scikit-learn <1.8 due to scikit-learn/scikit-learn#32274

sebp added a commit to sebp/scikit-survival that referenced this pull request Dec 18, 2025

Add support for scikit-learn 1.8

d9ee486

- Drop support for Python 3.10 - Drop support for scikit-learn <1.8 due to scikit-learn/scikit-learn#32274

sebp added a commit to sebp/scikit-survival that referenced this pull request Dec 19, 2025

Add support for scikit-learn 1.8

d148cd0

- Drop support for Python 3.10 - Drop support for scikit-learn <1.8 due to scikit-learn/scikit-learn#32274

sebp added a commit to sebp/scikit-survival that referenced this pull request Dec 19, 2025

Add support for scikit-learn 1.8

2f3de3e

- Drop support for Python 3.10 - Drop support for scikit-learn <1.8 due to scikit-learn/scikit-learn#32274

sebp added a commit to sebp/scikit-survival that referenced this pull request Dec 19, 2025

Add support for scikit-learn 1.8

6710b20

- Drop support for Python 3.10 - Drop support for scikit-learn <1.8 due to scikit-learn/scikit-learn#32274

sebp added a commit to sebp/scikit-survival that referenced this pull request Dec 19, 2025

Add support for scikit-learn 1.8

3633749

- Drop support for Python 3.10 - Drop support for scikit-learn <1.8 due to scikit-learn/scikit-learn#32274

sebp added a commit to sebp/scikit-survival that referenced this pull request Feb 1, 2026

Add support for scikit-learn 1.8

98ef880

- Drop support for Python 3.10 - Drop support for scikit-learn <1.8 due to scikit-learn/scikit-learn#32274

sebp mentioned this pull request Feb 1, 2026

Add support for scikit-learn 1.8 sebp/scikit-survival#580

Merged

4 tasks

	node or the right node. Hence all missing values always end up in the same node
	node or the right node. Hence all missing values always end up in the same node after a split (they can be scattered throughout the tree though)

Uh oh!

Conversation

cakedev0 commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

github-actions bot commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

cakedev0 commented Oct 1, 2025

Uh oh!

adam2392 commented Oct 2, 2025

Uh oh!

adam2392 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adam2392 Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cakedev0 Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cakedev0 Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cakedev0 commented Oct 3, 2025

Uh oh!

cakedev0 commented Nov 19, 2025

Uh oh!

lesteve left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

adam2392 commented Nov 20, 2025

Uh oh!

lesteve left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lesteve commented Nov 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

cakedev0 commented Sep 25, 2025 •

edited

Loading

github-actions bot commented Sep 25, 2025 •

edited

Loading

adam2392 Nov 20, 2025 •

edited

Loading

cakedev0 Oct 3, 2025 •

edited

Loading

cakedev0 Oct 9, 2025 •

edited

Loading