Modified BaseDecisionTree so that min_weight_fraction_leaf works when…#6947
Modified BaseDecisionTree so that min_weight_fraction_leaf works when…#6947ben519 wants to merge 1 commit intoscikit-learn:masterfrom
Conversation
… sample_weight is None and improved parameter description
| if self.min_weight_fraction_leaf != 0. and sample_weight is not None: | ||
| if self.min_weight_fraction_leaf != 0.: | ||
| if sample_weight is None: | ||
| sample_weight = np.repeat(1., n_samples) |
There was a problem hiding this comment.
Firstly, we don't need to explicitly do this, since all we want is the sum of weights, i.e. "total weight". So you could just do min_weight_leaf = self.min_weight_fraction_leaf * n_samples.
Secondly, I see why you might want this, but it now replicates the function of min_samples_leaf. So if this is the way we go, i.e. rather than a warning, you can actually just use
min_samples_leaf = max(min_samples_leaf, int(ceil(self.min_weight_fraction_leaf * n_samples)))In terms of testing, you should look at existing tests for min_samples_leaf and check that min_weight_fraction_leaf offers the same behaviour.
There was a problem hiding this comment.
When sample_weight is None, then samples are equally weighted in the Cython code. Does this addition really change anything? I believe behaviours should be identical. Do you have counter-examples where the trees that are built are actually different?
There was a problem hiding this comment.
If sample_weight is None then min_weight_fraction_leaf has no effect. I think that's quite clear from the else case here.
There was a problem hiding this comment.
Oh my bad, I was not aware of that else clause. Indeed. Ouch.
There was a problem hiding this comment.
Could we fix this in the cython code?
|
In terms of advice by skype et al, you might find the scikit-learn channel on Gitter helpful. |
|
|
||
| # Set min_weight_leaf from min_weight_fraction_leaf | ||
| if self.min_weight_fraction_leaf != 0. and sample_weight is not None: | ||
| if self.min_weight_fraction_leaf != 0.: |
There was a problem hiding this comment.
You don't actually need this if
|
@ben519 do you intend to finish off your work on this? |
|
@jnothman I definitely can't this week. I can give it a shot this weekend, but I'd be happier if someone more knowledgeable than me took this over. |
|
if no one else is available, i can do it. i'll be traveling the next week but starting sept 3 i'll have ~ a week free i can use to fixing this up. might be a good test to see if all that gsoc work w/ tree paid off |
|
@nelson-liu please go ahead and submit a PR! |
|
Closing this in favor of #7301. |
Reference Issue
Addresses #6945
Changes
min_weight_fraction_leaf should work even when sample_weight is not given (in which case samples are assumed to have equal weight). I also tweaked the description of min_weight_fraction_leaf to make its purpose a bit more clear.
Other comments
I'm fairly new to open-source collaboration as well as scikit-learn. I think my changes have room for improvement. For example, if sample_weight is given as None by the user, I reset it to an array of all 1s. I'm guessing this is a bad solution. However, I could use help figuring out where and how to implement the correct solution. I'm also slightly unsure the best way to test my changes. Would really appreciate someone offering 5 or 10 minutes of their time via Skype or Google Hangout so I can become a contributor and hopefully add more contributions in the future.
… sample_weight is None and improved parameter description