FEA Add support for missing values in tree estimators with criterion="absolute_error" by greatly simplifying the logic#32119
Conversation
|
Note: at this day, tests passes on my laptop and most CI unit tests pipelines are successful. But some are failing, I managed to reproduce one of the failing pipelines locally using a Docker image. I still need to find the bug though. |
|
Tests pass! 🎊 Well, I learned the difference between |
…rn into tree-simpler-missing
test_forest and test_missing_value_is_predictive test_split and test_split_impurity test_swap and test_py_swap_array_slices_random test_tree and test_regression_tree_missing_values_toy
|
Let me know if I need to take another look. Otw, I'll let @lorentzenchr and/or @ogrisel converge first. |
|
Do you think you're good to approve it on your side? I think @lorentzenchr can take another look and maybe approve it too? And then maybe we can merge? 🤞 |
ogrisel
left a comment
There was a problem hiding this comment.
Assuming the following is correct, let's also expand the docstring of next_p to make that more explicit.
Is there a way to test this one way or another? Or is this already tested? If so LGTM.
|
Note that a merge with |
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
|
Explanation of the last commit: While expanding the docstring of |
lorentzenchr
left a comment
There was a problem hiding this comment.
This LGTM, just a few nitpicks and the open question from Olivier and me.
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
…rn into tree-simpler-missing
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
lorentzenchr
left a comment
There was a problem hiding this comment.
@cakedev0 Thank you for this really neat PR. Less code, more and better functionality, that's just 🚀
criterion="absolute_error" by greatly simplifying the logic
|
Amazing 🎉 Thank you everyone! |
This PR refactors how missing values are handled in trees by:
CriterionsubclassesThis greatly simplifies the logic and unlocks for free the support of missing values for MAE trees.
Reference Issues/PRs
Accidentally fixes #32870
Otherwise, I looked but I didn't find any issue requesting the support of missing values with
criterion="absolute_error". Just this recent comment from @ogrisel in this PR: #32100 (review). Maybe it's because MAE trees were just too slow in sklearn before PR #32100, so they aren't used a lot for now.What does this implement/fix? Explain your changes.
Currently, a part of the missing values support is done by each subclass of
Criterion. I believe it's not a great design because:Criterionis "X-blind", it's not aware of X values. It just looks atyandsample_weightsin the order defined by the sorted indices (sample_indices). It never looks at X values. But somehow, by making it handle missing values, it does have some dealing with X values. Why not just use the ordering ofsample_indicesto take into account missing values? Like what we do for any other value (even inf/-inf for instance).So, to the question "Why not just use the ordering of
sample_indicesto take into account missing values? " my answer is :"yes, let's just do that". The result is removing 200 lines from_criterion.pxywhile not increasing the complexity of the splitter and the partitioner (actually, it also simplifies a bit the splitter).Any other comments?
I think it might unlock adding the support for missing values + monotonic constraints without too much efforts, but I haven't look into it yet.
It might also simplify a bit the support for missing values + sparse, but this would still be hard (probably not worth the effort).