[MRG] ENH Add support for missing values to Tree based Classifiers#5974
[MRG] ENH Add support for missing values to Tree based Classifiers#5974raghavrv wants to merge 8 commits intoscikit-learn:mainfrom
Conversation
b5e8502 to
e31c913
Compare
737e500 to
4149080
Compare
4149080 to
91c8164
Compare
9784fec to
f0b1a51
Compare
f0b1a51 to
1d4f3a4
Compare
29ef725 to
015c242
Compare
|
@agramfort @glouppe Apologies for the ridiculous delay! The training now is done considering the missing values. Could you please take a look and tell me if my approach is correct? |
015c242 to
e001db5
Compare
sklearn/tree/_criterion.pyx
Outdated
There was a problem hiding this comment.
non-missing valued sounds weird. 'correspond to values which are not missing'
|
There are also conflicts that need to be resolved. |
|
@glouppe Thanks heaps for the review :D |
| self.verbose = verbose | ||
| self.warm_start = warm_start | ||
| self.class_weight = class_weight | ||
| self.allow_missing = missing_values is not None |
| # (We need to reverse reset to get this partition | ||
| # as criterion does not update beyond the last sample) | ||
| self.criterion.reverse_reset() | ||
| self.criterion.move_missing(MISSING_DIR_RIGHT) |
There was a problem hiding this comment.
(This use case is the reason why I didn't do the rename of move_missing_left)
|
|
||
| return self.estimators_[0]._validate_X_predict(X, check_input=True) | ||
|
|
||
| def _validate_missing_mask(self, X, missing_mask=None): |
There was a problem hiding this comment.
Without missing mask this would have to be isnan. And the reason why we decided to go with missing_mask was that nan representations differ and hence making the comparison of float to nan costly. (Ref: #5870 (comment))
|
@glouppe @jmschrei What do you feel about refactoring a minimally merge-able "All Tree Builders, Just the Dense Best Splitter, All Classification Criteria" out as a separate PR? Avoiding the sparse splitter could make this PR more review-able... And in the new PR, I can try to make each commit sensible and ask for a review as and when I push. That way it should be easier to review things and fast-track the process? |
|
Yes, a small and focused PR would be much easier to review. |
|
Okay thanks! |
|
So is the status here that some of the PR is being forked out? |
Yes!! And I hope to make the commits incremental to help ease the review burden... |
|
incremental commits are less important
…On 28 December 2016 at 07:54, (Venkat) Raghav (Rajagopalan) < ***@***.***> wrote:
So is the status here that some of the PR is being forked out?
Yes!! And I hope to make the commits incremental to help ease the review
burden...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5974 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz69DXoQbdcqZ58mn6oUXE3Zs3iB0sks5rMXrogaJpZM4GwVyJ>
.
|
|
@raghavrv I'm curious what the status of the smaller PRs are. Which of your branches those PRs correspond to? I've been depending on this branch in my current work to handle missing values in my random forest training. It would be nice to see this fully merged into master, and I'd be willing to help out to get it there. |
Oh really? Nice to hear! :) How did you find the performance and usability? Do you have any benchmarks comments or shortcomings to share. I'd be really interested... I could update it with current master if you want me to? I got busy with another thing but the state of this PR is - Done and waiting for reviews. Except we have planned to avoid supporting all cases in the first go and instead add a small subset of what is here (say only for dense matrices, best splitter and depth first tree building) to help make the review a less daunting task... |
Ah sorry that was a specific question... :) I've not yet made those smaller branches yet... I'll try to make them soon. Next week? (It's always motivating if someone finds your work useful ;)) |
|
I've only been using this on a basic level by seeing I would link to an example of my code, but unfortunately we've had to close source the repo for political reasons until the work is published. Instead I can link a gist shows a toy example that I made when testing out this PR: https://gist.github.com/Erotemic/c9532be23e21a44a63d0cc77ed2e65fd As far as rebasing the branch on master, that would be extremely helpful. I've had to write a script to do this and recover from conflicts because I need to integrate this functionality into sklearn whenever I install a python environment on a new machine. |
|
@raghavrv latest rebase lives here https://github.com/unravelin/scikit-learn/tree/rf_missing. I need this branch to have a better control of handling missing values at the company I work in. |
|
@Erotemic and others - @glemaitre and myself are working on a rewrite of the tree code to make it parallel (and if possible, faster even for single-threaded mode)... I'm unable to spend some bandwidth to keep this branch rebased :/ @afiodorov Thanks for sharing the rebase! Much appreciated... If you could keep it rebased and synced with latest master for another month or two, it would be awesome! I'll take it from then... Possibly by introducing the missing value functionality into the new rewrite... |
|
@raghavrv I completely understand the scarcity of bandwidth. Its much more important to have the rewrite of the branch (and I'm very much hoping it includes missing value support). |
|
Will this PR eventually support a random forest based missing data imputer in the vein of missForest? A quick read through the discussion suggests that it will not and will be mostly focused with fitting decision trees / RFs in the presence of NaN rather than imputing them. Is that correct or did I miss something? I ask because I was thinking about working on a missForest type Imputer but just wanted to make sure I was not duplicating any work. Would much appreciate your feedback. Thanks! |
|
@ashimb9 this branch will allow for feature vectors passed to the I'm not familiar with what a |
|
@Erotemic Yeah, I did not think an Imputer was part of things but just wanted to make sure. Anyway, thank you for responding. PS: In case you were wondering, missForest is an R package for imputing missing values using random forests. |
|
Hey @raghavrv and others, do you know when this PR will be available in the production? It's definitely helpful. Thanks! |
|
Raghav won't be completing this.
I think the main concern here was that it was hard to show that it was, in
practice helpful, to the extent of justifying the added complexity. Do you
have available example datasets to show that this is better than imputation
(with the algorithms of https://github.com/hammerlab//fancyimpute for
example)?
|
|
What is the status of this? Like many people I would be interested in the surrogate splitting method existing in the trees of sklearn. |
|
@DrEhrfurchtgebietend perhaps see #5870. But note that raghavrv is no longer with us |
|
Thank you for having explored this support. In the meantime, the codebase evolved and #23595 is now more relevant than this PR, which I think can be closed. |
Fixes #5870 (Adds support to tree based classifiers, excluding ensemble methods)
For current status, notes, references - https://github.com/raghavrv/sklearn_dev_sandbox/tree/master/tree_methods_missing_val_support
TODO:
*Tree(s)Classifier/*ForestClassifierDepthFirstTreeBuilderBestFirstTreeBuilderClassificationCriterionBestSplitterRandomSplitterBestSparseSplitterRandomSparseSplitterapply_denseapply_sparse_cscdrop_valuesfunction to generate missing values. - [MRG+2-1] ENH add a ValueDropper to artificially insert missing values (NMAR or MCAR) to the dataset #7084NOTE:
rpart- Seems promising with respect to the relative accuracy scores as reported by Ding and Simonoff's paper - Needs some refactoring to our API for this to work - Widely used - importantly this will work even if the training data had no missing values.CC: @agramfort @glouppe @jmschrei @arjoly @tguillemot
Thanks a lot to @glouppe, @agramfort, @TomDLT & @vighneshbirodkar for all the patience and help (in and out of github)!