Add partial_fit function to decision trees by PSSF23 · Pull Request #35 · neurodata/scikit-learn

PSSF23 · 2023-03-03T13:34:03Z

Reference Issues/PRs

neurodata/treeple#40

What does this implement/fix? Explain your changes.

Allow DecisionTreeClassifier to incrementally update the tree structure with new data batches

Any other comments?

@adam2392 This is a branch copy I made so feel free to modify it.

…nto stream

adam2392 · 2023-03-03T16:54:01Z

sklearn/tree/_classes.py

+        y,
+        sample_weight=None,
+        check_input=True,
+        classes=None,


Do we need to add classes keyword argument, or is this due to an outdated API?

This is to accommodate the tree building. Not sklearn api I think. The tree needs to know all the potential classes from the beginning, as not all classes are necessarily included in the first batch of data. If new classes are added later there's errors on y dimensions.

adam2392 · 2023-03-03T17:01:37Z

sklearn/tree/_classes.py

+    builder_ : TreeBuilder instance
+        The underlying TreeBuilder object.
+


Is there any state preserved in the initial builder that we want to preserve when we call update?

I.e. do we need to preserve the builder necessarily?

I need to check but maybe not.

The builder object is normally initialized in fit and contains some parameters like min_samples_split, splitter and criterion. The initialization (checks and stuff) of these parameters include many lines of code, so I'll just leave the object as it is for now. We can optimize it later by modulating those steps.

adam2392

I think I understand a bit more what needs to be changed. I think if you want a light-weight addition, you can implement:

an extra cpdef method to the "builder": initialize_node_queue, which a list of false_roots, which we will call initial_roots. To determine a node, all you store in false_roots seems to be the "parent node" and the "is_left" flag(?), so this can just be a list of tuple of integer and bool.
add initial_roots as an attribute to the builder class. This should be instantiated during _cinit_ to some empty data structure.
just modify the existing build function to check if initial_roots is empty, if so, then build as is, otherwise, add the false roots to the PriorityHeap, or Stack respectively depending on if it is the BestFirstBuilder, or DepthFirstBuilder. This is the "extra code" you have to initialize the queues in each builder

In this way, partial_fit can just call initialize_node_queue before calling build, which handles initializing a queue of the false roots. WDYT?

PSSF23 · 2023-03-03T17:24:10Z

Looks good to me. Thanks a lot. I will see if the technical part can work out this way.

How do you feel about the partial_fit method in classes.py? Can it be simplified too?

adam2392 · 2023-03-05T23:19:06Z

How do you feel about the partial_fit method in classes.py? Can it be simplified too?

Not as familiar w/ the API for partial_fit, but it looks pretty simple, so I would say it should be okay.

adam2392 · 2023-03-29T21:34:07Z

FYI @PSSF23 I've migrated the entire set of changes into the fork branch, which will serve as the main branch for any PRs into the sklearn fork. I've added documentation regarding changes into the README file. Lmk if you need any assistance finishing this out.

PSSF23 · 2023-03-29T21:47:25Z

Thanks @adam2392 ! I was focusing on icml reviews and will have more time to optimize the code.

PSSF23 · 2023-07-06T15:15:59Z

Started working on the root initialization and method modulation. The original false_roots was a dict object, should I cdef dict self.initial_roots or just object?

My current attempt begins by updating self.splitter with the new X and y, meanwhile storing the leaves in self.initial_roots for builders to use in build.

adam2392 · 2023-07-06T15:25:35Z

Started working on the root initialization and method modulation. The original false_roots was a dict object, should I cdef dict self.initial_roots or just object?

I don't think it matters. You only use false_roots inside Gil code, so it's basically all Python.

My current attempt begins by updating self.splitter with the new X and y, meanwhile storing the leaves in self.initial_roots for builders to use in build.

Yeah that sounds right I think? Cuz the builder just needs to keep track of where you are in the tree. The splitter just splits the data that it's given and pumps out a new node. The builder then should know where to put the node and make sure the tree is connected well.

Perhaps start with writing the initialize_node_queue function and also write a unit-test for it. Will it be cpdef, or cdef? You can write a cpdef wrapper so you can call it in pytest if you plan on making it a cdef function.

By the way, I would also start from submodulev2 branch instead, so you don't break anything by accident.

adam2392 · 2023-07-06T15:31:24Z

By the way, I would also start from submodulev2 branch instead, so you don't break anything by accident.

Actually, I screwed up that branch currently cuz I was trying to incorporate the latest API changes in scikit-learn 1.4dev0 (i.e. monotonicity constraints in trees), but it didn't work. I would start a new branch off of v1.3, which contains the working sklearn submodule we use in scikit-tree.

You can name it v1.4, so it's slated to get added whenever we upgrade to use sklearn v1.4 in our submodule.

PSSF23 · 2023-08-09T15:13:32Z

Updated version in #50

PSSF23 added 30 commits November 6, 2020 20:25

Start implementing the update function for trees

1542765

Update _tree.pxd

8ded0f7

Remove unused attribute

d6d5879

Remove duplicate operations

0ed0819

Keep whole function for reference

bebe2bc

Catch AttributeError

6ca6725

Evaluate tree building logic

a403f5b

Follow node addition logic

cb4cf43

Work with counting issues and overflowing trees

eb7af31

Work with high variability

c24c87a

Fix y coordinates

5e6685c

Duplicate sample organization

5f6c373

Add _update_split_node function for BestFirstTree

7ac15f2

Work without max_leaf_nodes limit

2a94fa2

Update .gitignore

d6c03a7

Remove capacity resetting

7a3985a

Resolve 1 node tree problem

4f8605e

Optimize node order

11764a1

Update _tree.pyx

02ca737

Optimize partial_fit api

92f7e18

Update from main branch to stream branch

ab51a53

Fix linting

f05a3b2

FIX add __reduce__ functions

e1b6658

Merge branch 'main' into stream

f1a4174

FIX black format the code

0a5420c

FIX remove min_impurity_split

19893c3

FIX update deprecated attribute

fdd1dfd

FIX optimize api & correct __cinit__

b4cbfa4

FIX optimize first partial_fit test

8f4b664

FIX remove FutureWarning filter

3562219

PSSF23 added 14 commits October 13, 2021 12:18

Merge branch 'scikit-learn:main' into stream

005e5fe

Merge branch 'scikit-learn:main' into stream

cd8864e

Merge branch 'scikit-learn:main' into stream

30a6237

DOC add changelog

814e67e

Merge branch 'main' of https://github.com/scikit-learn/scikit-learn i…

46a9ccc

…nto stream

DOC optimize log format

f0d0eb0

Merge branch 'main' of https://github.com/scikit-learn/scikit-learn i…

e9e62e4

…nto stream

FIX remove deprecated parameter

aef4f84

Merge branch 'main' into stream

7d9ff8b

Merge branch 'scikit-learn:main' into stream

9ba887f

FIX optimize n_classes format

55a6b4b

FIX add internal function

8d3f5c7

MNT remove unnecessary checks

c47bbb2

Merge branch 'scikit-learn:main' into stream

a2aab5f

PSSF23 requested a review from adam2392 March 3, 2023 13:34

PSSF23 mentioned this pull request Mar 3, 2023

Add streaming function to trees and forests neurodata/treeple#40

Closed

adam2392 reviewed Mar 3, 2023

View reviewed changes

adam2392 changed the base branch from tree-featuresv2 to fork April 4, 2023 21:51

PSSF23 mentioned this pull request Aug 9, 2023

ENH add partial_fit for DecisionTreeClassifier #50

Closed

PSSF23 closed this Aug 9, 2023

		builder_ : TreeBuilder instance
		The underlying TreeBuilder object.

Conversation

PSSF23 commented Mar 3, 2023

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

adam2392 Mar 3, 2023

Choose a reason for hiding this comment

Uh oh!

PSSF23 Mar 3, 2023

Choose a reason for hiding this comment

Uh oh!

adam2392 Mar 3, 2023

Choose a reason for hiding this comment

Uh oh!

PSSF23 Mar 3, 2023

Choose a reason for hiding this comment

Uh oh!

PSSF23 Mar 11, 2023

Choose a reason for hiding this comment

Uh oh!

adam2392 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PSSF23 commented Mar 3, 2023

Uh oh!

adam2392 commented Mar 5, 2023

Uh oh!

adam2392 commented Mar 29, 2023

Uh oh!

PSSF23 commented Mar 29, 2023

Uh oh!

PSSF23 commented Jul 6, 2023

Uh oh!

adam2392 commented Jul 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adam2392 commented Jul 6, 2023

Uh oh!

PSSF23 commented Aug 9, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

adam2392 left a comment •

edited

Loading

adam2392 commented Jul 6, 2023 •

edited

Loading