Add partial_fit function to decision trees#35
Add partial_fit function to decision trees#35PSSF23 wants to merge 58 commits intoneurodata:forkfrom
Conversation
| y, | ||
| sample_weight=None, | ||
| check_input=True, | ||
| classes=None, |
There was a problem hiding this comment.
Do we need to add classes keyword argument, or is this due to an outdated API?
There was a problem hiding this comment.
This is to accommodate the tree building. Not sklearn api I think. The tree needs to know all the potential classes from the beginning, as not all classes are necessarily included in the first batch of data. If new classes are added later there's errors on y dimensions.
| builder_ : TreeBuilder instance | ||
| The underlying TreeBuilder object. | ||
|
|
There was a problem hiding this comment.
Is there any state preserved in the initial builder that we want to preserve when we call update?
I.e. do we need to preserve the builder necessarily?
There was a problem hiding this comment.
I need to check but maybe not.
There was a problem hiding this comment.
The builder object is normally initialized in fit and contains some parameters like min_samples_split, splitter and criterion. The initialization (checks and stuff) of these parameters include many lines of code, so I'll just leave the object as it is for now. We can optimize it later by modulating those steps.
There was a problem hiding this comment.
I think I understand a bit more what needs to be changed. I think if you want a light-weight addition, you can implement:
- an extra cpdef method to the "builder":
initialize_node_queue, which a list of false_roots, which we will callinitial_roots. To determine a node, all you store infalse_rootsseems to be the "parent node" and the "is_left" flag(?), so this can just be a list of tuple of integer and bool. - add
initial_rootsas an attribute to the builder class. This should be instantiated during_cinit_to some empty data structure. - just modify the existing
buildfunction to check ifinitial_rootsis empty, if so, then build as is, otherwise, add the false roots to the PriorityHeap, or Stack respectively depending on if it is the BestFirstBuilder, or DepthFirstBuilder. This is the "extra code" you have to initialize the queues in each builder
In this way, partial_fit can just call initialize_node_queue before calling build, which handles initializing a queue of the false roots. WDYT?
|
Looks good to me. Thanks a lot. I will see if the technical part can work out this way. How do you feel about the partial_fit method in |
Not as familiar w/ the API for partial_fit, but it looks pretty simple, so I would say it should be okay. |
|
FYI @PSSF23 I've migrated the entire set of changes into the |
|
Thanks @adam2392 ! I was focusing on icml reviews and will have more time to optimize the code. |
|
Started working on the root initialization and method modulation. The original My current attempt begins by updating |
I don't think it matters. You only use
Yeah that sounds right I think? Cuz the builder just needs to keep track of where you are in the tree. The splitter just splits the data that it's given and pumps out a new node. The builder then should know where to put the node and make sure the tree is connected well. Perhaps start with writing the By the way, I would also start from |
Actually, I screwed up that branch currently cuz I was trying to incorporate the latest API changes in scikit-learn 1.4dev0 (i.e. monotonicity constraints in trees), but it didn't work. I would start a new branch off of You can name it |
|
Updated version in #50 |
Reference Issues/PRs
neurodata/treeple#40
What does this implement/fix? Explain your changes.
DecisionTreeClassifierto incrementally update the tree structure with new data batchesAny other comments?
@adam2392 This is a branch copy I made so feel free to modify it.