Skip to content

Unexpected behavior of tree splits: missing values handling is buggy? #32175

@cakedev0

Description

@cakedev0

Describe the bug

When adding a sanity check in the best split function (_splitter.pxy), I get a bunch of tests failing. This probably reveal a bug in missing values handling.

Steps/Code to Reproduce

Add those lines after this line.

current_proxy_improvement = criterion.proxy_impurity_improvement()
if current_proxy_improvement < best_proxy_improvement - 1:
    raise ValueError(f"Unconsistent improvement {current_proxy_improvement} < {best_proxy_improvement}" )

And then run the following tests: pytest sklearn/ensemble/tests/test_forest.py sklearn/tree/tests/

Expected Results

No error is thrown, proving the final partitionning of samples is optimal and the children impurities are the correct/optimal ones.

Actual Results

Many tests fail. I will split them into two categories, based on the alleged cause:

  1. MAE criterion
    Errors that have likely the same cause to than those two issues: DecisionTreeRegressor with absolute error criterion: non-optimal split #32099 BUG Median not always being calculated correctly for DecisionTrees in the WeightedMedianCalculator #10725. The current implementation of the MAE criterion is slightly buggy. My PR Fix: improve speed of trees with MAE criterion from O(n^2) to O(n log n) #32100 will fix it.
FAILED sklearn/ensemble/tests/test_forest.py::test_importances[ExtraTreesRegressor-absolute_error-float64] - ValueError: Unconsistent improvement -9.0 < -4.0
FAILED sklearn/ensemble/tests/test_forest.py::test_importances[ExtraTreesRegressor-absolute_error-float32] - ValueError: Unconsistent improvement -9.0 < -4.0

On the branch of my PR those tests don't fail.

  1. Missing values
    Many tests related to missing values are failing. As explained in my PR FEA Add support for missing values in tree estimators with criterion="absolute_error" by greatly simplifying the logic #32119, the current way missing values are handled is a bit convoluted, and probably a bit buggy

Edit: now documented in #32178

FAILED sklearn/ensemble/tests/test_forest.py::test_missing_values_is_resilient[make_regression-ExtraTreesRegressor] - ValueError: Unconsistent improvement 222739.5025534111 < 230516.5948817273
FAILED sklearn/ensemble/tests/test_forest.py::test_missing_values_is_resilient[make_classification-ExtraTreesClassifier] - ValueError: Unconsistent improvement -5.2 < -3.1999999999999993
FAILED sklearn/ensemble/tests/test_forest.py::test_missing_value_is_predictive[ExtraTreesRegressor] - ValueError: Unconsistent improvement 108.8032853873815 < 110.47401111474339
FAILED sklearn/ensemble/tests/test_forest.py::test_missing_value_is_predictive[ExtraTreesClassifier] - ValueError: Unconsistent improvement -14.078947368421051 < -11.323308270676694
FAILED sklearn/tree/tests/test_tree.py::test_missing_values_poisson[ExtraTreeRegressor] - ValueError: Unconsistent improvement 2961.5788534750745 < 2972.4751934414253
FAILED sklearn/tree/tests/test_tree.py::test_missing_values_is_resilience[42-None-make_friedman1-ExtraTreeRegressor-0.07] - ValueError: Unconsistent improvement 955.3035680542237 < 992.3319785056362
FAILED sklearn/tree/tests/test_tree.py::test_missing_values_is_resilience[42-None-make_friedman1_classification-ExtraTreeClassifier-0.12] - ValueError: Unconsistent improvement -5.083333333333333 < -4.0
FAILED sklearn/tree/tests/test_tree.py::test_missing_values_is_resilience[42-ones-make_friedman1-ExtraTreeRegressor-0.07] - ValueError: Unconsistent improvement 955.3035680542237 < 992.3319785056362
FAILED sklearn/tree/tests/test_tree.py::test_missing_values_is_resilience[42-ones-make_friedman1_classification-ExtraTreeClassifier-0.12] - ValueError: Unconsistent improvement -5.083333333333333 < -4.0
FAILED sklearn/tree/tests/test_tree.py::test_missing_value_is_predictive[42-ExtraTreeClassifier-0.53] - ValueError: 

On the branch for my PR, none of those tests fails.

Versions

System:
    python: 3.12.11 (main, Aug 18 2025, 19:19:11) [Clang 20.1.4 ]
executable: /home/arthur/dev-perso/scikit-learn/sklearn-env/bin/python
   machine: Linux-6.14.0-29-generic-x86_64-with-glibc2.39

Python dependencies:
      sklearn: 1.8.dev0
          pip: None
   setuptools: 80.9.0
        numpy: 2.3.3
        scipy: 1.16.2
       Cython: 3.1.3
       pandas: None
   matplotlib: 3.10.6
       joblib: 1.5.2
threadpoolctl: 3.6.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
    num_threads: 16
         prefix: libscipy_openblas
       filepath: /home/arthur/dev-perso/scikit-learn/sklearn-env/lib/python3.12/site-packages/numpy.libs/libscipy_openblas64_-8fb3d286.so
        version: 0.3.30
threading_layer: pthreads
   architecture: Haswell

       user_api: blas
   internal_api: openblas
    num_threads: 16
         prefix: libscipy_openblas
       filepath: /home/arthur/dev-perso/scikit-learn/sklearn-env/lib/python3.12/site-packages/scipy.libs/libscipy_openblas-b75cc656.so
        version: 0.3.29.dev
threading_layer: pthreads
   architecture: Haswell

       user_api: openmp
   internal_api: openmp
    num_threads: 16
         prefix: libgomp
       filepath: /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0
        version: None

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions