-
-
Notifications
You must be signed in to change notification settings - Fork 26.9k
Unexpected behavior of tree splits: missing values handling is buggy? #32175
Copy link
Copy link
Closed
Labels
Description
Describe the bug
When adding a sanity check in the best split function (_splitter.pxy), I get a bunch of tests failing. This probably reveal a bug in missing values handling.
Steps/Code to Reproduce
Add those lines after this line.
current_proxy_improvement = criterion.proxy_impurity_improvement()
if current_proxy_improvement < best_proxy_improvement - 1:
raise ValueError(f"Unconsistent improvement {current_proxy_improvement} < {best_proxy_improvement}" )And then run the following tests: pytest sklearn/ensemble/tests/test_forest.py sklearn/tree/tests/
Expected Results
No error is thrown, proving the final partitionning of samples is optimal and the children impurities are the correct/optimal ones.
Actual Results
Many tests fail. I will split them into two categories, based on the alleged cause:
- MAE criterion
Errors that have likely the same cause to than those two issues: DecisionTreeRegressor with absolute error criterion: non-optimal split #32099 BUG Median not always being calculated correctly for DecisionTrees in the WeightedMedianCalculator #10725. The current implementation of the MAE criterion is slightly buggy. My PR Fix: improve speed of trees with MAE criterion from O(n^2) to O(n log n) #32100 will fix it.
FAILED sklearn/ensemble/tests/test_forest.py::test_importances[ExtraTreesRegressor-absolute_error-float64] - ValueError: Unconsistent improvement -9.0 < -4.0
FAILED sklearn/ensemble/tests/test_forest.py::test_importances[ExtraTreesRegressor-absolute_error-float32] - ValueError: Unconsistent improvement -9.0 < -4.0
On the branch of my PR those tests don't fail.
- Missing values
Many tests related to missing values are failing. As explained in my PR FEA Add support for missing values in tree estimators withcriterion="absolute_error"by greatly simplifying the logic #32119, the current way missing values are handled is a bit convoluted, and probably a bit buggy
Edit: now documented in #32178
FAILED sklearn/ensemble/tests/test_forest.py::test_missing_values_is_resilient[make_regression-ExtraTreesRegressor] - ValueError: Unconsistent improvement 222739.5025534111 < 230516.5948817273
FAILED sklearn/ensemble/tests/test_forest.py::test_missing_values_is_resilient[make_classification-ExtraTreesClassifier] - ValueError: Unconsistent improvement -5.2 < -3.1999999999999993
FAILED sklearn/ensemble/tests/test_forest.py::test_missing_value_is_predictive[ExtraTreesRegressor] - ValueError: Unconsistent improvement 108.8032853873815 < 110.47401111474339
FAILED sklearn/ensemble/tests/test_forest.py::test_missing_value_is_predictive[ExtraTreesClassifier] - ValueError: Unconsistent improvement -14.078947368421051 < -11.323308270676694
FAILED sklearn/tree/tests/test_tree.py::test_missing_values_poisson[ExtraTreeRegressor] - ValueError: Unconsistent improvement 2961.5788534750745 < 2972.4751934414253
FAILED sklearn/tree/tests/test_tree.py::test_missing_values_is_resilience[42-None-make_friedman1-ExtraTreeRegressor-0.07] - ValueError: Unconsistent improvement 955.3035680542237 < 992.3319785056362
FAILED sklearn/tree/tests/test_tree.py::test_missing_values_is_resilience[42-None-make_friedman1_classification-ExtraTreeClassifier-0.12] - ValueError: Unconsistent improvement -5.083333333333333 < -4.0
FAILED sklearn/tree/tests/test_tree.py::test_missing_values_is_resilience[42-ones-make_friedman1-ExtraTreeRegressor-0.07] - ValueError: Unconsistent improvement 955.3035680542237 < 992.3319785056362
FAILED sklearn/tree/tests/test_tree.py::test_missing_values_is_resilience[42-ones-make_friedman1_classification-ExtraTreeClassifier-0.12] - ValueError: Unconsistent improvement -5.083333333333333 < -4.0
FAILED sklearn/tree/tests/test_tree.py::test_missing_value_is_predictive[42-ExtraTreeClassifier-0.53] - ValueError:
On the branch for my PR, none of those tests fails.
Versions
System:
python: 3.12.11 (main, Aug 18 2025, 19:19:11) [Clang 20.1.4 ]
executable: /home/arthur/dev-perso/scikit-learn/sklearn-env/bin/python
machine: Linux-6.14.0-29-generic-x86_64-with-glibc2.39
Python dependencies:
sklearn: 1.8.dev0
pip: None
setuptools: 80.9.0
numpy: 2.3.3
scipy: 1.16.2
Cython: 3.1.3
pandas: None
matplotlib: 3.10.6
joblib: 1.5.2
threadpoolctl: 3.6.0
Built with OpenMP: True
threadpoolctl info:
user_api: blas
internal_api: openblas
num_threads: 16
prefix: libscipy_openblas
filepath: /home/arthur/dev-perso/scikit-learn/sklearn-env/lib/python3.12/site-packages/numpy.libs/libscipy_openblas64_-8fb3d286.so
version: 0.3.30
threading_layer: pthreads
architecture: Haswell
user_api: blas
internal_api: openblas
num_threads: 16
prefix: libscipy_openblas
filepath: /home/arthur/dev-perso/scikit-learn/sklearn-env/lib/python3.12/site-packages/scipy.libs/libscipy_openblas-b75cc656.so
version: 0.3.29.dev
threading_layer: pthreads
architecture: Haswell
user_api: openmp
internal_api: openmp
num_threads: 16
prefix: libgomp
filepath: /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0
version: NoneReactions are currently unavailable