FIX Decision/Extra trees: fix handling of missing values in detection of constant features#32274
Conversation
|
Attracting some attention here: @adam2392 @thomasjpfan if you have some bandwidth 🙏 The bug fixed in this PR has non-negligible impact I think and the fix is simple. |
|
This bug isn't isolated to just having nans and one constant value in the dataset? It can occur at any depth of the tree so in some sense we are splitting potentially 1 less time than we could in all cases with a missing value. Is that correct? |
adam2392
left a comment
There was a problem hiding this comment.
Need to look at the resulting docs more closely
doc/modules/tree.rst
Outdated
| the split with all the missing values going to the left node or the right node. | ||
| During training, for each potential threshold on the non-missing data, | ||
| the splitter will evaluate the split with all the missing values going to the left | ||
| node or the right node. Hence all missing values always end up in the same node |
There was a problem hiding this comment.
| node or the right node. Hence all missing values always end up in the same node | |
| node or the right node. Hence all missing values always end up in the same node after a split (they can be scattered throughout the tree though) |
There was a problem hiding this comment.
There is already a "(sometimes with only missing values)." in parentheses after. Better split that in 2 sentences instead.
There was a problem hiding this comment.
Hum... What do you think of rewriting this sentence to:
So, for a dataset with just one feature, all missing values always end up in the same node, potentially a node with only missing values.
That was the case I wanted to have readers imagine.
We can also just remove this sentence, I don't think it's crucial.
There was a problem hiding this comment.
I don't think we should remove this sentence, but perhaps reword to make it more clear. I think we just explicitly state that during training, the splitter evaluates the following and picks one:
- sending all missing data to the left, and non-missing data to the right
- splitter evaluates thresholds on non-missing data to split those with feature value less than threshold to the left, and >= to the right (idr if it's <= and >, or the other way around), and simultaneously evaluates sending all the missing data to the left, or right along with that threshold
Thus, missing values are always grouped together in one of the children nodes. Sometimes, the children node consists of only missing values.
Probably can get rewritten to be even more clear, which one takes precedence given equality.
| >>> tree.predict(X) | ||
| array([0, 0, 1, 1]) | ||
|
|
||
| - If the criterion evaluation is the same for both nodes, |
There was a problem hiding this comment.
The example from is part is based on the buggy behavior.
This example has several equivalent splits (same loss) on the root node. (1): [-1] [1 nan nan] and (2): [nan nan -1] [1] and, and yes in that case, missing will go to the right (split 1). But then, once the bug is fixed, another split will occur and the tree will look this this (missing end up at the bottom right):
and tree.predict([[np.nan]]) returns [1] (tree.predict([[np.nan]]) returns [0.5 0.5]).
Generally speaking, when there are several equivalent splits, we choose the first we encounter because of the strict > in if current_proxy_improvement > best_proxy_improvement:. In practice, with precision errors, we might choose an arbitrary split among the set of optimal splits.
We could document that, but I think it's too detailed for the user-guide if we want to keep it relatively short. Maybe just adding the sentence "we choose an arbitrary split among the set of optimal splits" is ok though.
There was a problem hiding this comment.
I meant this part:
- If the criterion evaluation is the same for both nodes,
then the tie for missing value at predict time is broken by going to the
right node.
The splitter also checks the split where all the missing
values go to one child and non-missing values go to the other
These seem like important details to keep? I agree it could be rewritten though.
We have the following cases:
- no missing during training: what happens during predict time with missing
- missing during training: what happens during predict time with missing
There was a problem hiding this comment.
We have the following cases:
- no missing during training: what happens during predict time with missing
- missing during training: what happens during predict time with missing
Yes indeed. And those two points are document by point 1 of the user-guide:
By default when predicting, the samples with missing values are classified with the class used in the split found during training:
And point 3:
If no missing values are seen during training for a given feature, then during prediction missing values are mapped to the child with the most samples
Point 2 (the one I'm removing) is:
If the criterion evaluation is the same for both nodes, then the tie for missing value at predict time is broken by going to the right node. The splitter also checks the split where all the missing values go to one child and non-missing values go to the other:
First, "The splitter also checks the split where all the missing values go to one child and non-missing values go to the other" should rather go with point 1; it has nothing to do with the criterion evaluation being the same for both nodes.
Second, "the tie for missing value at predict time is broken by going to the right node" is indeed true, but it's a sub-case of the case of several optimal splits, which is not documented. So it shouldn't be documented here just for missing values. We could document this case of several optimal splits, but personally I don't think it's needed, and it should anyway be discussed in a separate issue/PR.
An example of this case without missing values: X = [1, 2, 3], y = [1, 0, 1] ⇒ the two possible splits are equivalent. We will choose the split 1 | 2 3 because it's encountered first. Similarly, we first examine the splits with missing values on the right, so in the case of equivalent splits, we choose the one with missing values on the right.
Conclusion: I think this point 2 should be removed; as of now, it's confusing, over-detailed compared to the rest of the user guide, and provides an erroneous example.
There was a problem hiding this comment.
If we remove point 2, I'll still like a place to document tie breaking behavior. Most of the context of how tie breaking works is in the issue tracker, but not in the docs: #23728
There was a problem hiding this comment.
Thanks for the pointer to the meta issue! Is there a "sub issue" dedicated to decision trees? If not, I'd like to open one and continue this discussion here (one option would be to implement unbiased tie breaking, rather than explaining the bias in the doc).
Though, I think removing point 2 here doesn't require to solve this issue first.
Edit: Ah it seems I already read this issue one month ago, I had completely forgotten about it 😅
There was a problem hiding this comment.
I see. I don't think I agree w/ the reasoning to remove point 2. It seems that the issue is moreso the incomplete documentation of order of splits when multiple optimal splits exist. We should instead document all 3 points, but specify in the order of priority they are evaluated in the code, right?
Am I misunderstanding?
There was a problem hiding this comment.
It seems that the issue is moreso the incomplete documentation of order of splits when multiple optimal splits exist
Yes, that's a bit why I wanted to first completely remove mentions about order of splits and then maybe add it a a more complete/uniform way.
But, ok let's not remove point 2 here and let's talk about documentation in another issue. For now I'll do the minimal change in the doc for it to build (it wasn't building because the output expectation of point 2 was based on the bug).
I'll create the other issue and try to summarize there the discussions we had here.
Yes it's correct. That's why the impact is non-negligible. |
|
Up here. There are some concerns about my changes of the doc, but I think I addressed them. I think having an example in the user guide based on a bug is quite damaging for readers understanding, plus the bug in itself has a quite important impact on deep trees with missing values (many possible splits will be ignored) Thanks 🙏 |
lesteve
left a comment
There was a problem hiding this comment.
Quick comments after looking at it for maybe 10 minutes ...
|
I left some comments. Thanks @cakedev0 ! |
|
Let's merge this one, thanks! I combined @adam2392's and @thomasjpfan's comments into an additional approval. |
- Drop support for Python 3.10 - Drop support for scikit-learn <1.8 due to scikit-learn/scikit-learn#32274
- Drop support for Python 3.10 - Drop support for scikit-learn <1.8 due to scikit-learn/scikit-learn#32274
- Drop support for Python 3.10 - Drop support for scikit-learn <1.8 due to scikit-learn/scikit-learn#32274
- Drop support for Python 3.10 - Drop support for scikit-learn <1.8 due to scikit-learn/scikit-learn#32274
- Drop support for Python 3.10 - Drop support for scikit-learn <1.8 due to scikit-learn/scikit-learn#32274
- Drop support for Python 3.10 - Drop support for scikit-learn <1.8 due to scikit-learn/scikit-learn#32274
- Drop support for Python 3.10 - Drop support for scikit-learn <1.8 due to scikit-learn/scikit-learn#32274
- Drop support for Python 3.10 - Drop support for scikit-learn <1.8 due to scikit-learn/scikit-learn#32274
- Drop support for Python 3.10 - Drop support for scikit-learn <1.8 due to scikit-learn/scikit-learn#32274
Reference Issues/PRs
Fixes #32272
What does this implement/fix? Explain your changes.
Simple fix of the if-condition detecting constant features.
I updated the user-guide, as one of the example was precisely based on this buggy behavior...
Any other comments?
I'm not willing to write a lot of tests in this PR, as I'd rather move forward on this testing-focused PR instead: #32193
Indeed it's the test from PR #32193 that made me find this bug (and two other bugs) while not being aimed at finding this bug precisely (nor any of the two others). Which, IMO, proves it's a better test (for split logic correctness) than tests that are already present, and better than toy examples crafted to check one precise behavior.