MAINT Pull apart Splitter and Partitioner in the sklearn/tree code#29458
MAINT Pull apart Splitter and Partitioner in the sklearn/tree code#29458thomasjpfan merged 24 commits intoscikit-learn:mainfrom
Conversation
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
|
Lmk if this diff is too large… I couldn't see another way to do it (e.g. incrementally), since all the functions sort of rely on each other. I guess one thing I could do is do the "sorting" functions first, and then one partitioner at a time, but that seems kind of tedious(?) |
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
|
What do you think of this refactoring proposal (no code functionality change). cc: @thomasjpfan, @glemaitre |
sklearn/tree/_partitioner.pyx
Outdated
| cdef float32_t INFINITY_32t = np.inf | ||
|
|
||
|
|
||
| cdef class Partitioner: |
There was a problem hiding this comment.
According to your opening comment, is this suppose to be BasePartitioner?
Also, is defining this base class just for defining the interface? I didn't define the interface so we can be sure there is no v-table look up. I do not think Cython would generate one, but there could be a bug in Cython that will generate it in the future, and those are hard to detect.
If we want to be safe, we can comment out the interface and let it be a comment.
There was a problem hiding this comment.
According to your opening comment, is this suppose to be
BasePartitioner?
Yes, I should change it back.
Also, is defining this base class just for defining the interface? I didn't define the interface so we can be sure there is no v-table look up. I do not think Cython would generate one, but there could be a bug in Cython that will generate it in the future, and those are hard to detect.
Yes it's just to define the interface, but I noticed no runtime regression. Just to confirm, your suggestion is to remove the abstract interface definition, and just leave one there as a multi-in-line comment?
There was a problem hiding this comment.
Just to confirm, your suggestion is to remove the abstract interface definition, and just leave one there as a multi-in-line comment?
Yea, that was my suggestion.
| # but it would have resulted in a ~10% overall tree fitting performance | ||
| # degradation caused by the overhead frequent virtual method lookups. | ||
| ctypedef fused Partitioner: | ||
| DensePartitioner |
There was a problem hiding this comment.
Since this is still a fused type, I think there should not be any v-table look ups.
But to be sure, can you check the C++ code in node_split_best to see if there is a v-table look up?
There was a problem hiding this comment.
The generated C++ code shows:
static CYTHON_INLINE int __pyx_fuse_0__pyx_f_7sklearn_4tree_9_splitter_node_split_best(struct __pyx_obj_7sklearn_4tree_9_splitter_Splitter *, struct __pyx_obj_7sklearn_4tree_12_partitioner_DensePartitioner *, struct __pyx_obj_7sklearn_4tree_10_criterion_Criterion *, struct __pyx_t_7sklearn_4tree_9_splitter_SplitRecord *, struct __pyx_t_7sklearn_4tree_5_tree_ParentInfo *); /*proto*/
static CYTHON_INLINE int __pyx_fuse_1__pyx_f_7sklearn_4tree_9_splitter_node_split_best(struct __pyx_obj_7sklearn_4tree_9_splitter_Splitter *, struct __pyx_obj_7sklearn_4tree_12_partitioner_SparsePartitioner *, struct __pyx_obj_7sklearn_4tree_10_criterion_Criterion *, struct __pyx_t_7sklearn_4tree_9_splitter_SplitRecord *, struct __pyx_t_7sklearn_4tree_5_tree_ParentInfo *); /*proto*/
static CYTHON_INLINE int __pyx_fuse_0__pyx_f_7sklearn_4tree_9_splitter_node_split_random(struct __pyx_obj_7sklearn_4tree_9_splitter_Splitter *, struct __pyx_obj_7sklearn_4tree_12_partitioner_DensePartitioner *, struct __pyx_obj_7sklearn_4tree_10_criterion_Criterion *, struct __pyx_t_7sklearn_4tree_9_splitter_SplitRecord *, struct __pyx_t_7sklearn_4tree_5_tree_ParentInfo *); /*proto*/
static CYTHON_INLINE int __pyx_fuse_1__pyx_f_7sklearn_4tree_9_splitter_node_split_random(struct __pyx_obj_7sklearn_4tree_9_splitter_Splitter *, struct __pyx_obj_7sklearn_4tree_12_partitioner_SparsePartitioner *, struct __pyx_obj_7sklearn_4tree_10_criterion_Criterion *, struct __pyx_t_7sklearn_4tree_9_splitter_SplitRecord *, struct __pyx_t_7sklearn_4tree_5_tree_ParentInfo *); /*proto*/
so there is no v-table lookup as there is a specialized function generated for each combination of node_split_best/node_split_random and Dense/Sparse Partitioner.
There was a problem hiding this comment.
For now, I reverted the BasePartitioner in e24cda9 and implemented your suggestion: #29458 (comment)
Signed-off-by: Adam Li <adam2392@gmail.com>
…into partition
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
|
Perhaps @adrinjalali you're interested in this too? This will help the categorical splitting for Decision Trees be a bit easier to review #29437. |
OmarManzoor
left a comment
There was a problem hiding this comment.
Thanks for the PR @adam2392. I think the overall structure looks nice. Could you resolve the conflicts and incorporate uint8_t here as well?
Signed-off-by: Adam Li <adam2392@gmail.com>
Thanks for the review @OmarManzoor! I have addressed these issues. |
sklearn/tree/_partitioner.pyx
Outdated
|
|
||
|
|
||
| # Sort n-element arrays pointed to by feature_values and samples, simultaneously, | ||
| # by the values in feature_values. Algorithm: Introsort (Musser, SP&E, 1997). |
There was a problem hiding this comment.
Would it be feasible to move all these sorting functions towards the end of the file so that we have the common functions in one place and the classes are present together.
There was a problem hiding this comment.
I left this alone as this was how it was structured in _splitter.pyx. These functions end up being after their corresponding "Splitter". I don't have strong feelings though.
Perhaps @thomasjpfan has some thoughts since he's browsing the tree code often :p
There was a problem hiding this comment.
Looking it over, I like @OmarManzoor suggestion and moving the functions to be under SparsePartitioner.
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
…into partition
Signed-off-by: Adam Li <adam2392@gmail.com>
…cikit-learn#29458) Signed-off-by: Adam Li <adam2392@gmail.com>
Reference Issues/PRs
Fixes: #29459
Note: this is entirely just a moving of code, nothing changes functionally, except the abstract implementation (see below). This should hopefully be pretty easy to review since no unit-tests will change, or fail.
What does this implement/fix? Explain your changes.
BasePartitionerto limit the number of repeated function definitions. I can remove this or add as an in-line comment as suggested by @thomasjpfan. I have it currently added, so lmk what you think.These are the summary of the code moves:
_splitter.pyx->_partitioner.pyx_splitter.pyx->_partitioner.pyxshift_missing_values_to_left_if_required) moved to_partitioner.pyxBasePartitionerclass, so the partitioner definition is handled in this abstract class. I can remove this if people want. See: MAINT Pull apart Splitter and Partitioner in the sklearn/tree code #29458 (comment)Any other comments?
This shouldn't introduce any performance regressions as the computational tricks used to make the code the same speed are still there:
DensePartitionerandSparsePartitionerare still decorated with@final, so they are not able to be subclassed and Cython will optimize the code_splitter.pyxis still used to define a join of theDensePartitionerandSparsePartitioner. I'm actually not 100% sure this is needed… But I think if we use thePartitionerclass in_partiitioner.pxd, this may incur performance issues via vtable lookup?I ran some benchmarks using asv on this PR branch and the one on
main, and I actually don't see that many diffs:asv run --bench RandomForestClassifierBenchmark.time_fit —verbose