Skip to content

FIX Decision/Extra trees: fix handling of missing values in detection of constant features#32274

Merged
lesteve merged 14 commits intoscikit-learn:mainfrom
cakedev0:fix/constant-missing
Nov 24, 2025
Merged

FIX Decision/Extra trees: fix handling of missing values in detection of constant features#32274
lesteve merged 14 commits intoscikit-learn:mainfrom
cakedev0:fix/constant-missing

Conversation

@cakedev0
Copy link
Copy Markdown
Contributor

@cakedev0 cakedev0 commented Sep 25, 2025

Reference Issues/PRs

Fixes #32272

What does this implement/fix? Explain your changes.

Simple fix of the if-condition detecting constant features.

I updated the user-guide, as one of the example was precisely based on this buggy behavior...

Any other comments?

I'm not willing to write a lot of tests in this PR, as I'd rather move forward on this testing-focused PR instead: #32193

Indeed it's the test from PR #32193 that made me find this bug (and two other bugs) while not being aimed at finding this bug precisely (nor any of the two others). Which, IMO, proves it's a better test (for split logic correctness) than tests that are already present, and better than toy examples crafted to check one precise behavior.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Sep 25, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 7b1267d. Link to the linter CI: here

@cakedev0 cakedev0 marked this pull request as ready for review September 25, 2025 21:23
@cakedev0
Copy link
Copy Markdown
Contributor Author

cakedev0 commented Oct 1, 2025

Attracting some attention here: @adam2392 @thomasjpfan if you have some bandwidth 🙏

The bug fixed in this PR has non-negligible impact I think and the fix is simple.

@adam2392 adam2392 self-requested a review October 2, 2025 12:37
@adam2392
Copy link
Copy Markdown
Member

adam2392 commented Oct 2, 2025

This bug isn't isolated to just having nans and one constant value in the dataset? It can occur at any depth of the tree so in some sense we are splitting potentially 1 less time than we could in all cases with a missing value.

Is that correct?

Copy link
Copy Markdown
Member

@adam2392 adam2392 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to look at the resulting docs more closely

the split with all the missing values going to the left node or the right node.
During training, for each potential threshold on the non-missing data,
the splitter will evaluate the split with all the missing values going to the left
node or the right node. Hence all missing values always end up in the same node
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
node or the right node. Hence all missing values always end up in the same node
node or the right node. Hence all missing values always end up in the same node after a split (they can be scattered throughout the tree though)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already a "(sometimes with only missing values)." in parentheses after. Better split that in 2 sentences instead.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum... What do you think of rewriting this sentence to:

So, for a dataset with just one feature, all missing values always end up in the same node, potentially a node with only missing values.

That was the case I wanted to have readers imagine.
We can also just remove this sentence, I don't think it's crucial.

Copy link
Copy Markdown
Member

@adam2392 adam2392 Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should remove this sentence, but perhaps reword to make it more clear. I think we just explicitly state that during training, the splitter evaluates the following and picks one:

  1. sending all missing data to the left, and non-missing data to the right
  2. splitter evaluates thresholds on non-missing data to split those with feature value less than threshold to the left, and >= to the right (idr if it's <= and >, or the other way around), and simultaneously evaluates sending all the missing data to the left, or right along with that threshold

Thus, missing values are always grouped together in one of the children nodes. Sometimes, the children node consists of only missing values.

Probably can get rewritten to be even more clear, which one takes precedence given equality.

>>> tree.predict(X)
array([0, 0, 1, 1])

- If the criterion evaluation is the same for both nodes,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you remove this part?

Copy link
Copy Markdown
Contributor Author

@cakedev0 cakedev0 Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example from is part is based on the buggy behavior.

This example has several equivalent splits (same loss) on the root node. (1): [-1] [1 nan nan] and (2): [nan nan -1] [1] and, and yes in that case, missing will go to the right (split 1). But then, once the bug is fixed, another split will occur and the tree will look this this (missing end up at the bottom right):

Image

and tree.predict([[np.nan]]) returns [1] (tree.predict([[np.nan]]) returns [0.5 0.5]).

Generally speaking, when there are several equivalent splits, we choose the first we encounter because of the strict > in if current_proxy_improvement > best_proxy_improvement:. In practice, with precision errors, we might choose an arbitrary split among the set of optimal splits.

We could document that, but I think it's too detailed for the user-guide if we want to keep it relatively short. Maybe just adding the sentence "we choose an arbitrary split among the set of optimal splits" is ok though.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant this part:

- If the criterion evaluation is the same for both nodes,
  then the tie for missing value at predict time is broken by going to the
  right node. 
The splitter also checks the split where all the missing
  values go to one child and non-missing values go to the other

These seem like important details to keep? I agree it could be rewritten though.

We have the following cases:

  • no missing during training: what happens during predict time with missing
  • missing during training: what happens during predict time with missing

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have the following cases:

  • no missing during training: what happens during predict time with missing
  • missing during training: what happens during predict time with missing

Yes indeed. And those two points are document by point 1 of the user-guide:

By default when predicting, the samples with missing values are classified with the class used in the split found during training:

And point 3:

If no missing values are seen during training for a given feature, then during prediction missing values are mapped to the child with the most samples

Point 2 (the one I'm removing) is:

If the criterion evaluation is the same for both nodes, then the tie for missing value at predict time is broken by going to the right node. The splitter also checks the split where all the missing values go to one child and non-missing values go to the other:

First, "The splitter also checks the split where all the missing values go to one child and non-missing values go to the other" should rather go with point 1; it has nothing to do with the criterion evaluation being the same for both nodes.

Second, "the tie for missing value at predict time is broken by going to the right node" is indeed true, but it's a sub-case of the case of several optimal splits, which is not documented. So it shouldn't be documented here just for missing values. We could document this case of several optimal splits, but personally I don't think it's needed, and it should anyway be discussed in a separate issue/PR.
An example of this case without missing values: X = [1, 2, 3], y = [1, 0, 1] ⇒ the two possible splits are equivalent. We will choose the split 1 | 2 3 because it's encountered first. Similarly, we first examine the splits with missing values on the right, so in the case of equivalent splits, we choose the one with missing values on the right.

Conclusion: I think this point 2 should be removed; as of now, it's confusing, over-detailed compared to the rest of the user guide, and provides an erroneous example.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we remove point 2, I'll still like a place to document tie breaking behavior. Most of the context of how tie breaking works is in the issue tracker, but not in the docs: #23728

Copy link
Copy Markdown
Contributor Author

@cakedev0 cakedev0 Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointer to the meta issue! Is there a "sub issue" dedicated to decision trees? If not, I'd like to open one and continue this discussion here (one option would be to implement unbiased tie breaking, rather than explaining the bias in the doc).

Though, I think removing point 2 here doesn't require to solve this issue first.

Edit: Ah it seems I already read this issue one month ago, I had completely forgotten about it 😅

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I don't think I agree w/ the reasoning to remove point 2. It seems that the issue is moreso the incomplete documentation of order of splits when multiple optimal splits exist. We should instead document all 3 points, but specify in the order of priority they are evaluated in the code, right?

Am I misunderstanding?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the issue is moreso the incomplete documentation of order of splits when multiple optimal splits exist

Yes, that's a bit why I wanted to first completely remove mentions about order of splits and then maybe add it a a more complete/uniform way.

But, ok let's not remove point 2 here and let's talk about documentation in another issue. For now I'll do the minimal change in the doc for it to build (it wasn't building because the output expectation of point 2 was based on the bug).

I'll create the other issue and try to summarize there the discussions we had here.

@cakedev0
Copy link
Copy Markdown
Contributor Author

cakedev0 commented Oct 3, 2025

@adam2392

This bug isn't isolated to just having nans and one constant value in the dataset? It can occur at any depth of the tree so in some sense we are splitting potentially 1 less time than we could in all cases with a missing value.

Is that correct?

Yes it's correct. That's why the impact is non-negligible.

@cakedev0
Copy link
Copy Markdown
Contributor Author

Up here. There are some concerns about my changes of the doc, but I think I addressed them.

I think having an example in the user guide based on a bug is quite damaging for readers understanding, plus the bug in itself has a quite important impact on deep trees with missing values (many possible splits will be ignored)

Thanks 🙏

Copy link
Copy Markdown
Member

@lesteve lesteve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick comments after looking at it for maybe 10 minutes ...

@adam2392
Copy link
Copy Markdown
Member

I left some comments. Thanks @cakedev0 !

Copy link
Copy Markdown
Member

@lesteve lesteve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@lesteve lesteve added this to the 1.8 milestone Nov 20, 2025
@lesteve lesteve merged commit 28679aa into scikit-learn:main Nov 24, 2025
38 checks passed
@lesteve
Copy link
Copy Markdown
Member

lesteve commented Nov 24, 2025

Let's merge this one, thanks!

I combined @adam2392's and @thomasjpfan's comments into an additional approval.

sebp added a commit to sebp/scikit-survival that referenced this pull request Dec 14, 2025
- Drop support for Python 3.10
- Drop support for scikit-learn <1.8 due to scikit-learn/scikit-learn#32274
sebp added a commit to sebp/scikit-survival that referenced this pull request Dec 17, 2025
- Drop support for Python 3.10
- Drop support for scikit-learn <1.8 due to scikit-learn/scikit-learn#32274
sebp added a commit to sebp/scikit-survival that referenced this pull request Dec 17, 2025
- Drop support for Python 3.10
- Drop support for scikit-learn <1.8 due to scikit-learn/scikit-learn#32274
sebp added a commit to sebp/scikit-survival that referenced this pull request Dec 18, 2025
- Drop support for Python 3.10
- Drop support for scikit-learn <1.8 due to scikit-learn/scikit-learn#32274
sebp added a commit to sebp/scikit-survival that referenced this pull request Dec 19, 2025
- Drop support for Python 3.10
- Drop support for scikit-learn <1.8 due to scikit-learn/scikit-learn#32274
sebp added a commit to sebp/scikit-survival that referenced this pull request Dec 19, 2025
- Drop support for Python 3.10
- Drop support for scikit-learn <1.8 due to scikit-learn/scikit-learn#32274
sebp added a commit to sebp/scikit-survival that referenced this pull request Dec 19, 2025
- Drop support for Python 3.10
- Drop support for scikit-learn <1.8 due to scikit-learn/scikit-learn#32274
sebp added a commit to sebp/scikit-survival that referenced this pull request Dec 19, 2025
- Drop support for Python 3.10
- Drop support for scikit-learn <1.8 due to scikit-learn/scikit-learn#32274
sebp added a commit to sebp/scikit-survival that referenced this pull request Feb 1, 2026
- Drop support for Python 3.10
- Drop support for scikit-learn <1.8 due to scikit-learn/scikit-learn#32274
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Decision/Extra trees & missing values: split ignored when NaNs + single unique value in feature

5 participants