FEA Categorical split support for `DecisionTree*`, and `RandomForest*` by adam2392 · Pull Request #33354 · scikit-learn/scikit-learn

adam2392 · 2026-02-22T04:40:33Z

Reference Issues/PRs

Supersedes #12866 #4899 and #7068 , #3346

Supersedes #29437 because the diff was too complicated relative to main.

Took some inspiration from: cakedev0#2, so I would say we should say this is co-authored with @cakedev0. Instead of only using a uint64_t to encode the bitset, I opt for uint64_t*, which future proofs our ability to handle much larger cardinality categories, which will be useful for the categorical support in ExtraTrees*.

What does this implement/fix? Explain your changes.

Implements splitting rules when categorical data is given in a decision tree classifier/regressor. For now, limits it to use the breiman shortcut to demonstrate the value and the main implementation changes. Downstream features (outlined in the next section) can be more simply added after this PR.

Tree data structures are moved to maintain the cimport/import sequence (ParentInfo, SplitRecord, Node) -> _utils.pxd
Adds partitioning of samples given a category (this complements nicely the partitioning of samples given a numerical threshold)
Adds splitting logic given a category
add deterministic unit-tests, performance tests, and basic error checks. The deterministic test for a tree I think is correct... Its very useful because it checks that we are tracking categorical splitting over multiple split nodes (at least 2 levels).
implemented the breiman shortcut.

Also ran this benchmark, which compares one-hot-encoding (OHE), ordinal (ord), and native-categorical support (cat); I had to group-encode rare categories to make this all fit within the constraint of "256" max categories:

gist

There are two results worth looking at. The first is using default parameters where max_features = 'sqrt'. I shuffled up the categories after ordinal encoding.

The second is where we fix max_features=None, so it samples all features:

(Interestingly, if you order the categories by their occurrence, then the ordinal approach does quite well. This probably opens up a door of heuristics that we can later on add for categorical support. there is no 1 right-answer for everything, so I think it's totally reasonable).

Here, OHE is abysmally slow, and performance is essentially equivalent with native categorical support being on average the best. Surprisingly the ordinal strategy is quite good. It could be due to the fact that maybe I'm cutting categorical support down by group-encoding rare categories, which could be informative. It has ~10+x deeper trees, and almost 3-5x longer fittime.

Open TODO In a downstream PR

[Feature] Add a bitset cache (i.e. a uint64_t memory view) that allows us to compute the category split as we traverse the tree for inference speedup
[Feature] Brute-force search; expands categorical support to multi-class and multi-output problems
[Refactor] Consolidate the goes_left operation to work at inference time too
[Refactor] Understand the bitset operations and merge these with the HistGradientBoosting bitset operations
[Feature] Implement categorical splitting for RandomSplitter for ExtraTrees. Pretty easy now that we have an arbitrary sized bitset, which we can expand to hold many categories in the case of a randomized tree
[Feature] Add support for sparse categorical splitting with best and random splitter
[Feature] Add support for categorical in exporting tree

Any other comments?

May be related to #24967

adam2392 · 2026-02-22T04:42:03Z

I got bored over the weekend, so here is this feature I've been dying to get into scikit-learn! Yay snow.

lorentzenchr · 2026-02-22T09:32:30Z

@cakedev0 Your review would help a lot.

[Feature] Add a bitset cache (i.e. a uint64_t memory view) that allows us to compute the category split as we traverse the tree for inference speedup

The HGBT uses bitsets, see sklearn/ensemble/_hist_gradient_boosting/_bitset.pyx. Maybe, one can put it into utils and reuse here.

adam2392 · 2026-02-22T12:49:39Z

Good point @lorentzenchr! I was hoping we can do that in a follow on PR to keep the diff relatively contained. I tried it in the previous PR but there's a lot of diff from moving things around essentially.

I.e. item 4 :p

cakedev0 · 2026-02-23T09:22:34Z

I will for sure give it a review. I took a quick look already and it looks pretty nice at first sight 👍 you took the best of our preceding PRs I think ^^ (though I don't fully remember them 😅 )

I took a look at the possible TODOs, and here is my opinon

I don't really understand what it is (but probably should be done in an other PR as it's just an optimization)
Let's do it in another PR.
That sounds like a great idea! I think you can give it a try in this PR if you want, I don't think it will increase the diff too much. But it can also wait for another PR (let's not forget about it though! ^^).
I agree that it's better for the diff to do it later (should be a quick merge just after this one)
I don't have it clearly in mind, but if it's really easy / doesn't increase the diff much, maybe you can do it in this PR?
Sparse can really wait I think ^^
Let's do it in another PR (note: along with correctly handling the "missing go to left/right" display).

One thing that concerns me, and that we'll need to address before really publicizing the categorical support is the risk of over-fitting, see cakedev0#2 (comment). But I think it's fine to address it later, in another PR.

Also, I'd appreciate that we merge #32119 first, it's been opened for a while and it will have some conflicts with this one... (nothing hard to solve, but still...). If have some boredom left, can you give it a review? 😉

cakedev0 · 2026-02-23T09:41:52Z

Can you copy the adapted test_split.py from here? I'd like to have this test to continue covering the split logic entirely.

ogrisel · 2026-02-23T17:08:03Z

(Interestingly, if you order the categories by their occurrence, then the ordinal approach does quite well. This probably opens up a door of heuristics that we can later on add for categorical support. there is no 1 right-answer for everything, so I think it's totally reasonable).

I think we should add an option for ordinal encoding by number of occurrences in the training set (independently of this PR), see: #32161. Feel free to say so there so that we can remove the "Needs decision: include features" label and unleash the flow of contributors ;)

cakedev0

Here is a first pass, many small things, no big blockers. I'll let you clean up, add the split tests, and fix the CI before making a second path.

sklearn/tree/meson.build

cakedev0 · 2026-03-04T15:58:03Z

sklearn/tree/_utils.pyx

 from sklearn.utils._random cimport our_rand_r

+# =============================================================================
+# Bitset for decision-tree functions


Add a comment to mention it's copied from HGB (it is, right?).

Slightly copied, but independently implemented to reduce complexity and dependency in this PR. Added a TODO comment to consolidate

sklearn/tree/_tree.pyx

cakedev0 · 2026-03-04T16:19:10Z

sklearn/tree/_tree.pyx

        node_ndarray = d['nodes']
        value_ndarray = d['values']

+        # If the array is big-endian, swap it back to native first:


Can you add a comment about that being needed because with have bitsets in nodes (or maybe it's because we have a union type?).

I'm not sure, I suspect it's because of:

# cdef Node dummy # NODE_DTYPE = np.asarray(<Node[:1]>(&dummy)).dtype

being removed in favor of the more explicit type definition. I re-tried the unit-tests without these LOC, and I get an error on the test_different_endianness_pickle just to confirm that indeed we need to ensure the ordering of the bytes to be robust in different endianness systems.

ChatGPT didn't give a great answer as to why this didn't occur with the old code.

sklearn/tree/_tree.pyx

sklearn/tree/_partitioner.pyx

cakedev0 · 2026-03-04T17:08:21Z

sklearn/ensemble/_forest.py


        .. versionadded:: 1.4

+    categorical_features : array-like of int or bool of shape (n_features,) or


Do you want to handle support in forests in this PR? I'd rather leave that for a second PR (one that would also add the support for categorical_features="from_dtype" like in HGB).

Btw, I think it would be nice that ExtaTrees/RF/GB call tree data validation, the logic should be shareable. For now, I see duplicated logic, and I'm sure why it's this way... (feels like a good exploration topic for a possible future PR ^^)

Yeah I can remove the forest stuff, and pend for a downstream PR. I added it to enable the demo shown in the PR description.

I'll remove it once we are closer to merge as I might want to re-run benchmarks (if ppl ask).

sklearn/tree/_classes.py

sklearn/tree/_splitter.pyx

sklearn/tree/_splitter.pxd

cakedev0 · 2026-03-04T17:49:40Z

Can you copy the adapted test_split.py from here? I'd like to have this test to continue covering the split logic entirely.

Sorry I had forgotten to put the link in this message 😅

https://github.com/cakedev0/scikit-learn/blob/adam_categorical/sklearn/tree/tests/test_split.py

lorentzenchr · 2026-03-04T19:12:04Z

Is there a test for equivalence of DecisionTreeRegressor and HGBT with just one tree (learning rate=1).

adam2392 · 2026-03-04T21:29:07Z

I can definitely add one. I suppose as long as the categories have less than 64 categories, then the trees should be "equivalent"?

Let me try and see if there's any devil in details.

cakedev0 · 2026-03-07T09:54:40Z

sklearn/tree/_tree.pyx


        # store the number of categories per feature
+        # Note: it is expected to be passed in with a defined number
+        #  per feature via the Python caller. So this array will never be


Suggested change

# per feature via the Python caller. So this array will never be

# per feature via the Python caller. So this array will never be

adam2392 · 2026-03-07T15:41:27Z

Kay addressed most comments, and added a test of equivalence vs HGBT. The PR is quite long, so I'm open to doing some of the refactoring Cython in another PR.

The unit-tests pass, augmented with @cakedev0's more explicit tree unit-tests, so should give a high assurance we're doing things right.

cakedev0

I started a quick review but I don't have the bandwidth to finish it for now.

cakedev0 · 2026-03-07T15:55:00Z

sklearn/tree/_tree.pyx

+                    # node.threshold = _TREE_UNDEFINED
+                    node.split_value.threshold = <float64_t> _TREE_UNDEFINED


Let's remove this commented line. And do you need <float64_t> here?

Suggested change

# node.threshold = _TREE_UNDEFINED

node.split_value.threshold = <float64_t> _TREE_UNDEFINED

node.split_value.threshold = _TREE_UNDEFINED

cakedev0 · 2026-03-09T12:48:59Z

Higher level considerations: when reading this issue #32161 from @ogrisel, I discovered the TargetEncoder which is basically equivalent to the breiman shortcut idea (but only at the first split). But the benefit I see to TargetEncoder is the decreased risk of overfitting, the main downside is that you don't split on categories (you collapse/mix categories).

For now, on the same dataset than @adam2392 used for his benchmark/plots, I get better results with TargetEncoder. But to make a fair comparison, I would need to make some HPO. I'd try to make something clean and share it.

lorentzenchr · 2026-03-09T13:00:31Z

I discovered the TargetEncoder which is basically equivalent to the breiman shortcut idea (but only at the first split). But the benefit I see to TargetEncoder is the decreased risk of overfitting, the main downside is that you don't split on categories (you collapse/mix categories).

I think that native support for categorical is essential, instead of a pipeline with one-hot-encoding. This has nothing to do with model performance but rather with modelling convenience. A nice additional benefit is that fit time improves for tree based models.

The use of target encoding should be a conscious one by a user, not automatic.

cakedev0 · 2026-03-09T13:11:06Z

@lorentzenchr I completely agree with you on that! For explainability too, native support is so much better.

But I'd still like to have a clear idea of the model's performance gains compared to other alternatives. It's not blocking for this PR, but I think it's a valuable piece of information for documentation/communication/future directions...

lorentzenchr · 2026-03-16T14:31:12Z

You are already 2, @adam2392, @cakedev0. Can you ping me again after one reviewer approval?

adamjli added 4 commits February 21, 2026 13:52

Adding working python to _tree.pyxpxd

03a5c89

Adding updates to make it work

b41b789

Merge branch 'main' into nocats-v2

003c0a7

Ran linter on everything

44e316d

github-actions bot added cython module:ensemble module:tree labels Feb 22, 2026

adam2392 changed the title ~~Nocats v2~~ FEA Categorical split support for DecisionTree*, and RandomForest* Feb 22, 2026

adam2392 requested review from adrinjalali, lesteve, ogrisel and thomasjpfan February 22, 2026 04:42

adam2392 mentioned this pull request Feb 22, 2026

FEA Categorical split support for DecisionTree*, ExtraTree*, RandomForest* and `ExtraTrees* #29437

Closed

adam2392 added 2 commits February 22, 2026 09:49

Fix unit-tests

4d236de

Fix chnagelog and unit-teset errors

118145b

github-actions bot added the CI:Linter failure The linter CI is failing on this PR label Feb 22, 2026

Fix lint

f85168d

github-actions bot removed the CI:Linter failure The linter CI is failing on this PR label Feb 22, 2026

Fix pickle and docs

987f03f

github-actions bot added the CI:Linter failure The linter CI is failing on this PR label Feb 22, 2026

adam2392 added 2 commits February 22, 2026 11:40

Clean diff

8ff13c9

Clean diff

777ba3c

github-actions bot removed the CI:Linter failure The linter CI is failing on this PR label Feb 22, 2026

cakedev0 reviewed Mar 4, 2026

View reviewed changes

Fix the classes

ce51840

adam2392 mentioned this pull request Mar 6, 2026

Add an option to OrdinalEncoder to sort encoding by decreasing frequencies #32161

Open

adam2392 added 2 commits March 6, 2026 17:21

Lint

e36dd13

Adding the test equivalence

f58b88f

github-actions bot added the CI:Linter failure The linter CI is failing on this PR label Mar 6, 2026

adam2392 added 4 commits March 6, 2026 18:40

Remove diff in meson.build

a0238cb

Address comments

13e9f23

Remove some unnecessary issues

95920e5

Fix lint

a6b6f47

github-actions bot removed the CI:Linter failure The linter CI is failing on this PR label Mar 7, 2026

adam2392 requested a review from cakedev0 March 7, 2026 03:23

cakedev0 reviewed Mar 7, 2026

View reviewed changes

adam2392 added 2 commits March 7, 2026 10:37

Fix docstring and type error in testing equivalence

40edcb5

Merge branch 'main' into nocats-v2

b91cbdf

cakedev0 reviewed Mar 9, 2026

View reviewed changes

lorentzenchr mentioned this pull request Mar 16, 2026

[RFC] Tree module improvements #5212

Closed

12 tasks

Merge branch 'main' into nocats-v2

4aef21b


		.. versionadded:: 1.4

		categorical_features : array-like of int or bool of shape (n_features,) or

	# per feature via the Python caller. So this array will never be
	# per feature via the Python caller. So this array will never be

		# node.threshold = _TREE_UNDEFINED
		node.split_value.threshold = <float64_t> _TREE_UNDEFINED

	# node.threshold = _TREE_UNDEFINED
	node.split_value.threshold = <float64_t> _TREE_UNDEFINED
	node.split_value.threshold = _TREE_UNDEFINED

Uh oh!

Conversation

adam2392 commented Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Open TODO In a downstream PR

Any other comments?

Uh oh!

adam2392 commented Feb 22, 2026

Uh oh!

lorentzenchr commented Feb 22, 2026

Uh oh!

adam2392 commented Feb 22, 2026

Uh oh!

cakedev0 commented Feb 23, 2026

Uh oh!

cakedev0 commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cakedev0 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cakedev0 commented Mar 4, 2026

Uh oh!

lorentzenchr commented Mar 4, 2026

Uh oh!

adam2392 commented Mar 4, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adam2392 commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cakedev0 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cakedev0 commented Mar 9, 2026

Uh oh!

lorentzenchr commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cakedev0 commented Mar 9, 2026

Uh oh!

lorentzenchr commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

adam2392 commented Feb 22, 2026 •

edited

Loading

cakedev0 commented Feb 23, 2026 •

edited

Loading

ogrisel commented Feb 23, 2026 •

edited

Loading

adam2392 commented Mar 7, 2026 •

edited

Loading

lorentzenchr commented Mar 9, 2026 •

edited

Loading