FEA Categorical split support for DecisionTree*, and RandomForest*#33354
FEA Categorical split support for DecisionTree*, and RandomForest*#33354adam2392 wants to merge 20 commits intoscikit-learn:mainfrom
DecisionTree*, and RandomForest*#33354Conversation
DecisionTree*, and RandomForest*
|
I got bored over the weekend, so here is this feature I've been dying to get into scikit-learn! Yay snow. |
|
@cakedev0 Your review would help a lot.
The HGBT uses bitsets, see |
|
Good point @lorentzenchr! I was hoping we can do that in a follow on PR to keep the diff relatively contained. I tried it in the previous PR but there's a lot of diff from moving things around essentially. I.e. item 4 :p |
|
I will for sure give it a review. I took a quick look already and it looks pretty nice at first sight 👍 you took the best of our preceding PRs I think ^^ (though I don't fully remember them 😅 ) I took a look at the possible TODOs, and here is my opinon
One thing that concerns me, and that we'll need to address before really publicizing the categorical support is the risk of over-fitting, see cakedev0#2 (comment). But I think it's fine to address it later, in another PR. Also, I'd appreciate that we merge #32119 first, it's been opened for a while and it will have some conflicts with this one... (nothing hard to solve, but still...). If have some boredom left, can you give it a review? 😉 |
|
Can you copy the adapted |
I think we should add an option for ordinal encoding by number of occurrences in the training set (independently of this PR), see: #32161. Feel free to say so there so that we can remove the "Needs decision: include features" label and unleash the flow of contributors ;) |
cakedev0
left a comment
There was a problem hiding this comment.
Here is a first pass, many small things, no big blockers. I'll let you clean up, add the split tests, and fix the CI before making a second path.
| from sklearn.utils._random cimport our_rand_r | ||
|
|
||
| # ============================================================================= | ||
| # Bitset for decision-tree functions |
There was a problem hiding this comment.
Add a comment to mention it's copied from HGB (it is, right?).
There was a problem hiding this comment.
Slightly copied, but independently implemented to reduce complexity and dependency in this PR. Added a TODO comment to consolidate
| node_ndarray = d['nodes'] | ||
| value_ndarray = d['values'] | ||
|
|
||
| # If the array is big-endian, swap it back to native first: |
There was a problem hiding this comment.
Can you add a comment about that being needed because with have bitsets in nodes (or maybe it's because we have a union type?).
There was a problem hiding this comment.
I'm not sure, I suspect it's because of:
# cdef Node dummy
# NODE_DTYPE = np.asarray(<Node[:1]>(&dummy)).dtype
being removed in favor of the more explicit type definition. I re-tried the unit-tests without these LOC, and I get an error on the test_different_endianness_pickle just to confirm that indeed we need to ensure the ordering of the bytes to be robust in different endianness systems.
ChatGPT didn't give a great answer as to why this didn't occur with the old code.
|
|
||
| .. versionadded:: 1.4 | ||
|
|
||
| categorical_features : array-like of int or bool of shape (n_features,) or |
There was a problem hiding this comment.
Do you want to handle support in forests in this PR? I'd rather leave that for a second PR (one that would also add the support for categorical_features="from_dtype" like in HGB).
Btw, I think it would be nice that ExtaTrees/RF/GB call tree data validation, the logic should be shareable. For now, I see duplicated logic, and I'm sure why it's this way... (feels like a good exploration topic for a possible future PR ^^)
There was a problem hiding this comment.
Yeah I can remove the forest stuff, and pend for a downstream PR. I added it to enable the demo shown in the PR description.
There was a problem hiding this comment.
I'll remove it once we are closer to merge as I might want to re-run benchmarks (if ppl ask).
Sorry I had forgotten to put the link in this message 😅 https://github.com/cakedev0/scikit-learn/blob/adam_categorical/sklearn/tree/tests/test_split.py |
|
Is there a test for equivalence of |
|
I can definitely add one. I suppose as long as the categories have less than 64 categories, then the trees should be "equivalent"? Let me try and see if there's any devil in details. |
|
|
||
| # store the number of categories per feature | ||
| # Note: it is expected to be passed in with a defined number | ||
| # per feature via the Python caller. So this array will never be |
There was a problem hiding this comment.
| # per feature via the Python caller. So this array will never be | |
| # per feature via the Python caller. So this array will never be |
|
Kay addressed most comments, and added a test of equivalence vs HGBT. The PR is quite long, so I'm open to doing some of the refactoring Cython in another PR. The unit-tests pass, augmented with @cakedev0's more explicit tree unit-tests, so should give a high assurance we're doing things right. |
cakedev0
left a comment
There was a problem hiding this comment.
I started a quick review but I don't have the bandwidth to finish it for now.
| # node.threshold = _TREE_UNDEFINED | ||
| node.split_value.threshold = <float64_t> _TREE_UNDEFINED |
There was a problem hiding this comment.
Let's remove this commented line. And do you need <float64_t> here?
| # node.threshold = _TREE_UNDEFINED | |
| node.split_value.threshold = <float64_t> _TREE_UNDEFINED | |
| node.split_value.threshold = _TREE_UNDEFINED |
|
Higher level considerations: when reading this issue #32161 from @ogrisel, I discovered the For now, on the same dataset than @adam2392 used for his benchmark/plots, I get better results with |
I think that native support for categorical is essential, instead of a pipeline with one-hot-encoding. This has nothing to do with model performance but rather with modelling convenience. A nice additional benefit is that fit time improves for tree based models. The use of target encoding should be a conscious one by a user, not automatic. |
|
@lorentzenchr I completely agree with you on that! For explainability too, native support is so much better. But I'd still like to have a clear idea of the model's performance gains compared to other alternatives. It's not blocking for this PR, but I think it's a valuable piece of information for documentation/communication/future directions... |
Reference Issues/PRs
Supersedes #12866 #4899 and #7068 , #3346
Supersedes #29437 because the diff was too complicated relative to main.
Took some inspiration from: cakedev0#2, so I would say we should say this is co-authored with @cakedev0. Instead of only using a uint64_t to encode the bitset, I opt for
uint64_t*, which future proofs our ability to handle much larger cardinality categories, which will be useful for the categorical support in ExtraTrees*.What does this implement/fix? Explain your changes.
Implements splitting rules when categorical data is given in a decision tree classifier/regressor. For now, limits it to use the breiman shortcut to demonstrate the value and the main implementation changes. Downstream features (outlined in the next section) can be more simply added after this PR.
ParentInfo,SplitRecord,Node) ->_utils.pxdAlso ran this benchmark, which compares one-hot-encoding (OHE), ordinal (ord), and native-categorical support (cat); I had to group-encode rare categories to make this all fit within the constraint of "256" max categories:
There are two results worth looking at. The first is using default parameters where
max_features = 'sqrt'. I shuffled up the categories after ordinal encoding.The second is where we fix

max_features=None, so it samples all features:(Interestingly, if you order the categories by their occurrence, then the ordinal approach does quite well. This probably opens up a door of heuristics that we can later on add for categorical support. there is no 1 right-answer for everything, so I think it's totally reasonable).
Here, OHE is abysmally slow, and performance is essentially equivalent with native categorical support being on average the best. Surprisingly the ordinal strategy is quite good. It could be due to the fact that maybe I'm cutting categorical support down by group-encoding rare categories, which could be informative. It has ~10+x deeper trees, and almost 3-5x longer fittime.
Open TODO In a downstream PR
goes_leftoperation to work at inference time tooAny other comments?
May be related to #24967