[MRG] ENH Add support for missing values to Tree based Classifiers by raghavrv · Pull Request #5974 · scikit-learn/scikit-learn

raghavrv · 2015-12-07T17:46:18Z

Fixes #5870 (Adds support to tree based classifiers, excluding ensemble methods)

For current status, notes, references - https://github.com/raghavrv/sklearn_dev_sandbox/tree/master/tree_methods_missing_val_support

TODO:

NOTE:

The 2 other promising alternative methods are
- Use surrogates to handle missing values as done in rpart - Seems promising with respect to the relative accuracy scores as reported by Ding and Simonoff's paper - Needs some refactoring to our API for this to work - Widely used - importantly this will work even if the training data had no missing values.
- Probabilistic split - This basically sends the the missing-valued samples to both right and left but sets the weights of the samples (in the child) to the number of non-missing samples in the right (or left) split, such that the total weight at the parent node is 1. This seems to be the most widely used method apart from imputation. Gilles feels this cannot be easily accomplished with our current API.

CC: @agramfort @glouppe @jmschrei @arjoly @tguillemot

Thanks a lot to @glouppe, @agramfort, @TomDLT & @vighneshbirodkar for all the patience and help (in and out of github)!

raghavrv · 2016-01-11T04:00:25Z

@agramfort @glouppe Apologies for the ridiculous delay! The training now is done considering the missing values. Could you please take a look and tell me if my approach is correct?

glouppe · 2016-01-11T09:39:54Z

I'll try to give you some feedback asap. Dont hesitate to ping if you dont see anything from me in the next days.

Also cc: @jmschrei @arjoly @pprett

jmschrei · 2016-01-11T23:22:07Z

sklearn/tree/_criterion.pyx

non-missing valued sounds weird. 'correspond to values which are not missing'

glouppe · 2016-12-16T11:05:27Z

There are also conflicts that need to be resolved.

raghavrv · 2016-12-17T10:44:55Z

@glouppe Thanks heaps for the review :D

raghavrv · 2016-12-17T10:45:20Z

sklearn/ensemble/forest.py

        self.verbose = verbose
        self.warm_start = warm_start
        self.class_weight = class_weight
+        self.allow_missing = missing_values is not None


raghavrv · 2016-12-17T12:08:49Z

sklearn/tree/_splitter.pyx

+                            # (We need to reverse reset to get this partition
+                            # as criterion does not update beyond the last sample)
+                            self.criterion.reverse_reset()
+                            self.criterion.move_missing(MISSING_DIR_RIGHT)


(This use case is the reason why I didn't do the rename of move_missing_left)

raghavrv · 2016-12-17T13:10:48Z

sklearn/ensemble/forest.py


        return self.estimators_[0]._validate_X_predict(X, check_input=True)

+    def _validate_missing_mask(self, X, missing_mask=None):


Without missing mask this would have to be isnan. And the reason why we decided to go with missing_mask was that nan representations differ and hence making the comparison of float to nan costly. (Ref: #5870 (comment))

raghavrv · 2016-12-18T08:02:46Z

@glouppe @jmschrei What do you feel about refactoring a minimally merge-able "All Tree Builders, Just the Dense Best Splitter, All Classification Criteria" out as a separate PR? Avoiding the sparse splitter could make this PR more review-able... And in the new PR, I can try to make each commit sensible and ask for a review as and when I push. That way it should be easier to review things and fast-track the process?

glouppe · 2016-12-18T08:52:23Z

Yes, a small and focused PR would be much easier to review.

raghavrv · 2016-12-18T08:52:58Z

Okay thanks!

jnothman · 2016-12-27T03:57:45Z

So is the status here that some of the PR is being forked out?

raghavrv · 2016-12-27T20:53:59Z

So is the status here that some of the PR is being forked out?

Yes!! And I hope to make the commits incremental to help ease the review burden...

jnothman · 2016-12-27T21:06:57Z

incremental commits are less important

…

On 28 December 2016 at 07:54, (Venkat) Raghav (Rajagopalan) < ***@***.***> wrote: So is the status here that some of the PR is being forked out? Yes!! And I hope to make the commits incremental to help ease the review burden... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#5974 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz69DXoQbdcqZ58mn6oUXE3Zs3iB0sks5rMXrogaJpZM4GwVyJ> .

Erotemic · 2017-03-07T18:25:03Z

@raghavrv I'm curious what the status of the smaller PRs are. Which of your branches those PRs correspond to?

I've been depending on this branch in my current work to handle missing values in my random forest training. It would be nice to see this fully merged into master, and I'd be willing to help out to get it there.

raghavrv · 2017-03-07T18:32:46Z

I've been depending on this branch in my current work to handle missing values in my random forest training. It would be nice to see this fully merged into master, and I'd be willing to help out to get it there

Oh really? Nice to hear! :) How did you find the performance and usability? Do you have any benchmarks comments or shortcomings to share. I'd be really interested... I could update it with current master if you want me to?

I got busy with another thing but the state of this PR is - Done and waiting for reviews. Except we have planned to avoid supporting all cases in the first go and instead add a small subset of what is here (say only for dense matrices, best splitter and depth first tree building) to help make the review a less daunting task...

raghavrv · 2017-03-07T18:34:46Z

I'm curious what the status of the smaller PRs are. Which of your branches those PRs correspond to?

Ah sorry that was a specific question... :) I've not yet made those smaller branches yet... I'll try to make them soon. Next week? (It's always motivating if someone finds your work useful ;))

Erotemic · 2017-03-07T23:45:10Z

I've only been using this on a basic level by seeing missing_values=np.nan when I create a RandomForestClassifier. I have a special hand-crafted dense feature vector that I need to perform probabilistic 3-state classification on (determine if two images show the same individual animal: yes, no, or not-enough-information). Some of the measurements like time and GPS on the photos are unreliable, and when building my feature vector some dimensions will be nan, these dimensions will be MAR. In some cases the fact that a dimension is nan might even have significance (MNAR), which is what led me to this PR.

I would link to an example of my code, but unfortunately we've had to close source the repo for political reasons until the work is published. Instead I can link a gist shows a toy example that I made when testing out this PR: https://gist.github.com/Erotemic/c9532be23e21a44a63d0cc77ed2e65fd

As far as rebasing the branch on master, that would be extremely helpful. I've had to write a script to do this and recover from conflicts because I need to integrate this functionality into sklearn whenever I install a python environment on a new machine.

afiodorov · 2017-05-02T10:02:16Z

@raghavrv latest rebase lives here https://github.com/unravelin/scikit-learn/tree/rf_missing. I need this branch to have a better control of handling missing values at the company I work in.

raghavrv · 2017-05-02T12:27:58Z

@Erotemic and others - @glemaitre and myself are working on a rewrite of the tree code to make it parallel (and if possible, faster even for single-threaded mode)... I'm unable to spend some bandwidth to keep this branch rebased :/

@afiodorov Thanks for sharing the rebase! Much appreciated... If you could keep it rebased and synced with latest master for another month or two, it would be awesome! I'll take it from then... Possibly by introducing the missing value functionality into the new rewrite...

Erotemic · 2017-05-05T14:08:53Z

@raghavrv I completely understand the scarcity of bandwidth. Its much more important to have the rewrite of the branch (and I'm very much hoping it includes missing value support).

ashimb9 · 2017-08-20T16:25:59Z

Will this PR eventually support a random forest based missing data imputer in the vein of missForest? A quick read through the discussion suggests that it will not and will be mostly focused with fitting decision trees / RFs in the presence of NaN rather than imputing them. Is that correct or did I miss something? I ask because I was thinking about working on a missForest type Imputer but just wanted to make sure I was not duplicating any work. Would much appreciate your feedback. Thanks!

Erotemic · 2017-08-20T18:22:49Z

@ashimb9 this branch will allow for feature vectors passed to the fit method of the random forest to contain nan values.

I'm not familiar with what a missForest is, but I can tell you that this PR does not define any new Inputers.

ashimb9 · 2017-08-21T03:38:14Z

@Erotemic Yeah, I did not think an Imputer was part of things but just wanted to make sure. Anyway, thank you for responding.

PS: In case you were wondering, missForest is an R package for imputing missing values using random forests.

Yashg19 · 2017-10-09T16:18:41Z

Hey @raghavrv and others, do you know when this PR will be available in the production? It's definitely helpful. Thanks!

jnothman · 2017-10-09T22:43:21Z

Raghav won't be completing this. I think the main concern here was that it was hard to show that it was, in practice helpful, to the extent of justifying the added complexity. Do you have available example datasets to show that this is better than imputation (with the algorithms of https://github.com/hammerlab//fancyimpute for example)?

Gitman-code · 2018-03-15T21:49:35Z

What is the status of this? Like many people I would be interested in the surrogate splitting method existing in the trees of sklearn.

jnothman · 2018-03-16T00:02:23Z

@DrEhrfurchtgebietend perhaps see #5870. But note that raghavrv is no longer with us

jjerphan · 2023-03-08T09:52:29Z

Thank you for having explored this support.

In the meantime, the codebase evolved and #23595 is now more relevant than this PR, which I think can be closed.

raghavrv force-pushed the missing_values_rf branch 7 times, most recently from b5e8502 to e31c913 Compare December 8, 2015 02:18

raghavrv force-pushed the missing_values_rf branch 5 times, most recently from 737e500 to 4149080 Compare December 21, 2015 17:26

raghavrv force-pushed the missing_values_rf branch from 4149080 to 91c8164 Compare December 28, 2015 07:18

raghavrv mentioned this pull request Dec 28, 2015

[MRG] COSMIT Use consts instead of numerical values for better readability #6013

Closed

raghavrv force-pushed the missing_values_rf branch 3 times, most recently from 9784fec to f0b1a51 Compare January 3, 2016 17:06

raghavrv force-pushed the missing_values_rf branch from f0b1a51 to 1d4f3a4 Compare January 4, 2016 13:52

raghavrv mentioned this pull request Jan 4, 2016

[RFC] Tree module improvements #5212

Closed

12 tasks

raghavrv changed the title ~~[WIP][ENH] Random forests - Support missing values~~ [WIP][ENH] Add support for missing values to Tree Classifiers/Regressors Jan 5, 2016

raghavrv force-pushed the missing_values_rf branch 2 times, most recently from 29ef725 to 015c242 Compare January 11, 2016 03:30

raghavrv force-pushed the missing_values_rf branch from 015c242 to e001db5 Compare January 11, 2016 04:25

jmschrei reviewed Jan 11, 2016
View reviewed changes

sklearn/tree/_criterion.pyx Outdated

Copy link
Copy Markdown

Member

jmschrei Jan 11, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-missing valued sounds weird. 'correspond to values which are not missing'

raghavrv commented Dec 17, 2016

View reviewed changes

raghavrv mentioned this pull request Dec 18, 2016

[MRG] ENH Allow handling nan during input validation #8074

Closed

raghavrv mentioned this pull request Mar 4, 2017

[MRG] Add KNN strategy for imputation #4844

Closed

raghavrv mentioned this pull request Jul 18, 2017

[RFC] Missing values in RandomForest #5870

Closed

ashimb9 mentioned this pull request Aug 21, 2017

Random Forest Imputation #9591

Closed

glemaitre mentioned this pull request Mar 26, 2021

Random Forest Prediction and np.nan #19767

Closed

Erotemic mentioned this pull request Sep 6, 2022

Integration with WATCH GERSL/pycold#7

Open


		return self.estimators_[0]._validate_X_predict(X, check_input=True)

		def _validate_missing_mask(self, X, missing_mask=None):

Uh oh!

Conversation

raghavrv commented Dec 7, 2015 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

raghavrv commented Jan 11, 2016

Uh oh!

glouppe commented Jan 11, 2016

Uh oh!

jmschrei Jan 11, 2016

Choose a reason for hiding this comment

Uh oh!

glouppe commented Dec 16, 2016

Uh oh!

raghavrv commented Dec 17, 2016

Uh oh!

raghavrv Dec 17, 2016

Choose a reason for hiding this comment

Uh oh!

raghavrv Dec 17, 2016

Choose a reason for hiding this comment

Uh oh!

raghavrv Dec 17, 2016

Choose a reason for hiding this comment

Uh oh!

raghavrv commented Dec 18, 2016

Uh oh!

glouppe commented Dec 18, 2016

Uh oh!

raghavrv commented Dec 18, 2016

Uh oh!

jnothman commented Dec 27, 2016

Uh oh!

raghavrv commented Dec 27, 2016

Uh oh!

jnothman commented Dec 27, 2016 via email

Uh oh!

Erotemic commented Mar 7, 2017

Uh oh!

raghavrv commented Mar 7, 2017

Uh oh!

raghavrv commented Mar 7, 2017

Uh oh!

Erotemic commented Mar 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

afiodorov commented May 2, 2017

Uh oh!

raghavrv commented May 2, 2017

Uh oh!

Erotemic commented May 5, 2017

Uh oh!

ashimb9 commented Aug 20, 2017

Uh oh!

Erotemic commented Aug 20, 2017

Uh oh!

ashimb9 commented Aug 21, 2017

Uh oh!

Yashg19 commented Oct 9, 2017

Uh oh!

jnothman commented Oct 9, 2017 via email

Uh oh!

Gitman-code commented Mar 15, 2018

Uh oh!

jnothman commented Mar 16, 2018

Uh oh!

jjerphan commented Mar 8, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

17 participants

raghavrv commented Dec 7, 2015 •

edited

Loading

Erotemic commented Mar 7, 2017 •

edited

Loading