Skip to content

FEA Categorical split support for DecisionTree*, and RandomForest*#33354

Open
adam2392 wants to merge 20 commits intoscikit-learn:mainfrom
adam2392:nocats-v2
Open

FEA Categorical split support for DecisionTree*, and RandomForest*#33354
adam2392 wants to merge 20 commits intoscikit-learn:mainfrom
adam2392:nocats-v2

Conversation

@adam2392
Copy link
Copy Markdown
Member

@adam2392 adam2392 commented Feb 22, 2026

Reference Issues/PRs

Supersedes #12866 #4899 and #7068 , #3346

Supersedes #29437 because the diff was too complicated relative to main.

Took some inspiration from: cakedev0#2, so I would say we should say this is co-authored with @cakedev0. Instead of only using a uint64_t to encode the bitset, I opt for uint64_t*, which future proofs our ability to handle much larger cardinality categories, which will be useful for the categorical support in ExtraTrees*.

What does this implement/fix? Explain your changes.

Implements splitting rules when categorical data is given in a decision tree classifier/regressor. For now, limits it to use the breiman shortcut to demonstrate the value and the main implementation changes. Downstream features (outlined in the next section) can be more simply added after this PR.

  • Tree data structures are moved to maintain the cimport/import sequence (ParentInfo, SplitRecord, Node) -> _utils.pxd
  • Adds partitioning of samples given a category (this complements nicely the partitioning of samples given a numerical threshold)
  • Adds splitting logic given a category
  • add deterministic unit-tests, performance tests, and basic error checks. The deterministic test for a tree I think is correct... Its very useful because it checks that we are tracking categorical splitting over multiple split nodes (at least 2 levels).
  • implemented the breiman shortcut.

Also ran this benchmark, which compares one-hot-encoding (OHE), ordinal (ord), and native-categorical support (cat); I had to group-encode rare categories to make this all fit within the constraint of "256" max categories:

There are two results worth looking at. The first is using default parameters where max_features = 'sqrt'. I shuffled up the categories after ordinal encoding.

categorical_benchmark

The second is where we fix max_features=None, so it samples all features:
categorical_benchmark

(Interestingly, if you order the categories by their occurrence, then the ordinal approach does quite well. This probably opens up a door of heuristics that we can later on add for categorical support. there is no 1 right-answer for everything, so I think it's totally reasonable).

Here, OHE is abysmally slow, and performance is essentially equivalent with native categorical support being on average the best. Surprisingly the ordinal strategy is quite good. It could be due to the fact that maybe I'm cutting categorical support down by group-encoding rare categories, which could be informative. It has ~10+x deeper trees, and almost 3-5x longer fittime.

Open TODO In a downstream PR

  1. [Feature] Add a bitset cache (i.e. a uint64_t memory view) that allows us to compute the category split as we traverse the tree for inference speedup
  2. [Feature] Brute-force search; expands categorical support to multi-class and multi-output problems
  3. [Refactor] Consolidate the goes_left operation to work at inference time too
  4. [Refactor] Understand the bitset operations and merge these with the HistGradientBoosting bitset operations
  5. [Feature] Implement categorical splitting for RandomSplitter for ExtraTrees. Pretty easy now that we have an arbitrary sized bitset, which we can expand to hold many categories in the case of a randomized tree
  6. [Feature] Add support for sparse categorical splitting with best and random splitter
  7. [Feature] Add support for categorical in exporting tree

Any other comments?

May be related to #24967

@adam2392 adam2392 changed the title Nocats v2 FEA Categorical split support for DecisionTree*, and RandomForest* Feb 22, 2026
@adam2392
Copy link
Copy Markdown
Member Author

I got bored over the weekend, so here is this feature I've been dying to get into scikit-learn! Yay snow.

@lorentzenchr
Copy link
Copy Markdown
Member

@cakedev0 Your review would help a lot.

[Feature] Add a bitset cache (i.e. a uint64_t memory view) that allows us to compute the category split as we traverse the tree for inference speedup

The HGBT uses bitsets, see sklearn/ensemble/_hist_gradient_boosting/_bitset.pyx. Maybe, one can put it into utils and reuse here.

@adam2392
Copy link
Copy Markdown
Member Author

Good point @lorentzenchr! I was hoping we can do that in a follow on PR to keep the diff relatively contained. I tried it in the previous PR but there's a lot of diff from moving things around essentially.

I.e. item 4 :p

@github-actions github-actions bot added the CI:Linter failure The linter CI is failing on this PR label Feb 22, 2026
@github-actions github-actions bot removed the CI:Linter failure The linter CI is failing on this PR label Feb 22, 2026
@github-actions github-actions bot added the CI:Linter failure The linter CI is failing on this PR label Feb 22, 2026
@github-actions github-actions bot removed the CI:Linter failure The linter CI is failing on this PR label Feb 22, 2026
@cakedev0
Copy link
Copy Markdown
Contributor

I will for sure give it a review. I took a quick look already and it looks pretty nice at first sight 👍 you took the best of our preceding PRs I think ^^ (though I don't fully remember them 😅 )

I took a look at the possible TODOs, and here is my opinon

  1. I don't really understand what it is (but probably should be done in an other PR as it's just an optimization)
  2. Let's do it in another PR.
  3. That sounds like a great idea! I think you can give it a try in this PR if you want, I don't think it will increase the diff too much. But it can also wait for another PR (let's not forget about it though! ^^).
  4. I agree that it's better for the diff to do it later (should be a quick merge just after this one)
  5. I don't have it clearly in mind, but if it's really easy / doesn't increase the diff much, maybe you can do it in this PR?
  6. Sparse can really wait I think ^^
  7. Let's do it in another PR (note: along with correctly handling the "missing go to left/right" display).

One thing that concerns me, and that we'll need to address before really publicizing the categorical support is the risk of over-fitting, see cakedev0#2 (comment). But I think it's fine to address it later, in another PR.

Also, I'd appreciate that we merge #32119 first, it's been opened for a while and it will have some conflicts with this one... (nothing hard to solve, but still...). If have some boredom left, can you give it a review? 😉

@cakedev0
Copy link
Copy Markdown
Contributor

cakedev0 commented Feb 23, 2026

Can you copy the adapted test_split.py from here? I'd like to have this test to continue covering the split logic entirely.

@ogrisel
Copy link
Copy Markdown
Member

ogrisel commented Feb 23, 2026

(Interestingly, if you order the categories by their occurrence, then the ordinal approach does quite well. This probably opens up a door of heuristics that we can later on add for categorical support. there is no 1 right-answer for everything, so I think it's totally reasonable).

I think we should add an option for ordinal encoding by number of occurrences in the training set (independently of this PR), see: #32161. Feel free to say so there so that we can remove the "Needs decision: include features" label and unleash the flow of contributors ;)

Copy link
Copy Markdown
Contributor

@cakedev0 cakedev0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a first pass, many small things, no big blockers. I'll let you clean up, add the split tests, and fix the CI before making a second path.

from sklearn.utils._random cimport our_rand_r

# =============================================================================
# Bitset for decision-tree functions
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment to mention it's copied from HGB (it is, right?).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly copied, but independently implemented to reduce complexity and dependency in this PR. Added a TODO comment to consolidate

node_ndarray = d['nodes']
value_ndarray = d['values']

# If the array is big-endian, swap it back to native first:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment about that being needed because with have bitsets in nodes (or maybe it's because we have a union type?).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, I suspect it's because of:

# cdef Node dummy
# NODE_DTYPE = np.asarray(<Node[:1]>(&dummy)).dtype

being removed in favor of the more explicit type definition. I re-tried the unit-tests without these LOC, and I get an error on the test_different_endianness_pickle just to confirm that indeed we need to ensure the ordering of the bytes to be robust in different endianness systems.

ChatGPT didn't give a great answer as to why this didn't occur with the old code.


.. versionadded:: 1.4

categorical_features : array-like of int or bool of shape (n_features,) or
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to handle support in forests in this PR? I'd rather leave that for a second PR (one that would also add the support for categorical_features="from_dtype" like in HGB).

Btw, I think it would be nice that ExtaTrees/RF/GB call tree data validation, the logic should be shareable. For now, I see duplicated logic, and I'm sure why it's this way... (feels like a good exploration topic for a possible future PR ^^)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I can remove the forest stuff, and pend for a downstream PR. I added it to enable the demo shown in the PR description.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove it once we are closer to merge as I might want to re-run benchmarks (if ppl ask).

@cakedev0
Copy link
Copy Markdown
Contributor

cakedev0 commented Mar 4, 2026

Can you copy the adapted test_split.py from here? I'd like to have this test to continue covering the split logic entirely.

Sorry I had forgotten to put the link in this message 😅

https://github.com/cakedev0/scikit-learn/blob/adam_categorical/sklearn/tree/tests/test_split.py

@lorentzenchr
Copy link
Copy Markdown
Member

Is there a test for equivalence of DecisionTreeRegressor and HGBT with just one tree (learning rate=1).

@adam2392
Copy link
Copy Markdown
Member Author

adam2392 commented Mar 4, 2026

I can definitely add one. I suppose as long as the categories have less than 64 categories, then the trees should be "equivalent"?

Let me try and see if there's any devil in details.

@github-actions github-actions bot added the CI:Linter failure The linter CI is failing on this PR label Mar 6, 2026
@github-actions github-actions bot removed the CI:Linter failure The linter CI is failing on this PR label Mar 7, 2026
@adam2392 adam2392 requested a review from cakedev0 March 7, 2026 03:23

# store the number of categories per feature
# Note: it is expected to be passed in with a defined number
# per feature via the Python caller. So this array will never be
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# per feature via the Python caller. So this array will never be
# per feature via the Python caller. So this array will never be

@adam2392
Copy link
Copy Markdown
Member Author

adam2392 commented Mar 7, 2026

Kay addressed most comments, and added a test of equivalence vs HGBT. The PR is quite long, so I'm open to doing some of the refactoring Cython in another PR.

The unit-tests pass, augmented with @cakedev0's more explicit tree unit-tests, so should give a high assurance we're doing things right.

Copy link
Copy Markdown
Contributor

@cakedev0 cakedev0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started a quick review but I don't have the bandwidth to finish it for now.

Comment on lines +504 to +505
# node.threshold = _TREE_UNDEFINED
node.split_value.threshold = <float64_t> _TREE_UNDEFINED
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this commented line. And do you need <float64_t> here?

Suggested change
# node.threshold = _TREE_UNDEFINED
node.split_value.threshold = <float64_t> _TREE_UNDEFINED
node.split_value.threshold = _TREE_UNDEFINED

@cakedev0
Copy link
Copy Markdown
Contributor

cakedev0 commented Mar 9, 2026

Higher level considerations: when reading this issue #32161 from @ogrisel, I discovered the TargetEncoder which is basically equivalent to the breiman shortcut idea (but only at the first split). But the benefit I see to TargetEncoder is the decreased risk of overfitting, the main downside is that you don't split on categories (you collapse/mix categories).

For now, on the same dataset than @adam2392 used for his benchmark/plots, I get better results with TargetEncoder. But to make a fair comparison, I would need to make some HPO. I'd try to make something clean and share it.

@lorentzenchr
Copy link
Copy Markdown
Member

lorentzenchr commented Mar 9, 2026

I discovered the TargetEncoder which is basically equivalent to the breiman shortcut idea (but only at the first split). But the benefit I see to TargetEncoder is the decreased risk of overfitting, the main downside is that you don't split on categories (you collapse/mix categories).

I think that native support for categorical is essential, instead of a pipeline with one-hot-encoding. This has nothing to do with model performance but rather with modelling convenience. A nice additional benefit is that fit time improves for tree based models.

The use of target encoding should be a conscious one by a user, not automatic.

@cakedev0
Copy link
Copy Markdown
Contributor

cakedev0 commented Mar 9, 2026

@lorentzenchr I completely agree with you on that! For explainability too, native support is so much better.

But I'd still like to have a clear idea of the model's performance gains compared to other alternatives. It's not blocking for this PR, but I think it's a valuable piece of information for documentation/communication/future directions...

@lorentzenchr lorentzenchr mentioned this pull request Mar 16, 2026
12 tasks
@lorentzenchr
Copy link
Copy Markdown
Member

You are already 2, @adam2392, @cakedev0. Can you ping me again after one reviewer approval?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants