[MRG] Proxy improvement methods added to entropy/gini#5233
[MRG] Proxy improvement methods added to entropy/gini#5233jmschrei wants to merge 1 commit intoscikit-learn:masterfrom
Conversation
|
Can you add benchmark using covertype and mnist ? |
There was a problem hiding this comment.
I think that the three previous line could be combine without loss of clarity.
|
Hm travis is not happy. I tried that briefly in #5220 and got beaten by exactly the same error. I believe that this is due to numerical instabilities and to approximations of big numbers by the "float system". |
|
I have check if we have the same numerical instabilities for regression, but it doesn't seem to be the case PS: note that |
|
I dont have much time today to dig into this, but it is very important to make sure that variable importances converge to their true theoretical values. (We should add a test for that -- I'll do it.) In addition, variable importances should be invariant with respect to scaling the sample weights, as |
|
The more trees you pass in, the closer to the theoretical values you get, but you don't get them exactly. |
There was a problem hiding this comment.
strange, this should be remove. No?
There was a problem hiding this comment.
Ah, yes. That's weird that multi-output tests were successful with this.
|
I am not getting a significant speed up on covtypes or mnist, and there are other errors, so I am going to close this PR. |
|
Thanks for your hard work ! |
This pull request handles implementing explicit proxy improvement methods for entropy and gini criterion, to compliment the one added in #5203. Entropy is changed by refactoring the equation
-weighted_n_left / weighted_n_total * -sum( count_left_i / weighted_n_left * log( count_left_i / weighted_n_left ) - weighted_n_right / weighted_n_total * -sum( count_i / weighted_n_right * log(count_right_i / weighted_n_right )into
sum( count_left_i * log( count_left_i ) ) + sum( count_right_i * log( count_right_i ) ) - weighted_n_left * log(weighted_n_left) - weighted_n_right * log(weighted_n_right)which collapses some terms, and caches the calculation involving
weighted_n_leftandweighted_n_right.The gini calculation time is collapsed into
Here are time tests. Entropy seems to take basically the same amount of time, whereas gini seems to run slightly faster. I was unsure if this was worth merging, so figured I'd see what you had to say.
@glouppe @arjoly @ogrisel