[MRG+3] Fix min_weight_fraction_leaf to work when sample_weights are not provided#7301
Conversation
sklearn/tree/tree.py
Outdated
| if sample_weight is None: | ||
| min_weight_leaf = int(ceil(self.min_weight_fraction_leaf * | ||
| n_samples)) | ||
| min_samples_leaf = max(min_samples_leaf, min_weight_leaf) |
There was a problem hiding this comment.
I'm not sure whether it's better to have this block as what it is, or
min_samples_leaf = max(min_samples_leaf, int(ceil(self.min_weight_fraction_leaf * n_samples)))
min_weight_leaf = 0 # (seeing as min_weight_leaf is unnecessary when sample_weight is None)
or simply omit the min_samples_leaf = max(...) and just set min_weight_leaf since both min_weight_leaf and min_samples_leaf are provided to the Criterion object.
There was a problem hiding this comment.
I don't understand what you mean by:
or simply omit the min_samples_leaf = max(...) and just set min_weight_leaf since both min_weight_leaf and min_samples_leaf are provided to the Criterion object.
But I think I am fine with the first two suggestions.
There was a problem hiding this comment.
ok, i'll just leave it as it is.
There was a problem hiding this comment.
The max seems odd to me. Why not just remove it?
There was a problem hiding this comment.
Yeah, that's what I was trying to propose in my last sentence on my original comment (was probably unclear, sorry). I haven't tested it but it should work... And it seems a bit more logical too
There was a problem hiding this comment.
yeah, it seems like that removing the call to max works as well. The max here is implicit because it's implemented in Criterion anyway.
|
LGTM. |
|
You don't currently test interaction between |
Good idea, their interaction is simply that the max is the bound, right? |
|
maybe test that with no sample weights, the two parameters have the same effect? |
|
@amueller so I'm actually of the opinion now that they shouldn't have the same effect... I pushed a test that checks if they are the same; it crashes and burns, but in a justified manner I think. First, I changed the So onto why they shouldn't have the same effect: Say that we fit on a dataset with 5 samples, and provide no Which is very different. This approach does have the downside that (in this case), setting I think we should take this approach, but I feel like a warning might be good if edit: there are cases where the two parameters have the same effect, but they do not always. |
|
There are probably some hand-crafted values that could yield equal values...I suppose that those could serve as a useful test? Not sure if it is necessary, though. what do you guys think? |
|
since the two parameters don't give the same effect on uniform weighted data when It does seem reasonable for the two grown trees to be equal under this scenario, though. |
|
this is ready to be looked at again; removing the +1 because there have been significant changes to the tests. |
I am sorry but I don't get what you are trying to say. |
sklearn/ensemble/forest.py
Outdated
| min_weight_fraction_leaf : float, optional (default=0.) | ||
| The minimum weighted fraction of the input samples required to be at a | ||
| leaf node. | ||
| leaf node where weights are determined by ``sample_weight`` provided |
There was a problem hiding this comment.
I'd write it as -
The minimum weighted fraction of the sum total of weights (of all the input samples) required
to be at a leaf node.
|
|
On 08/31/2016 07:52 PM, Nelson Liu wrote:
|
@jnothman and I discussed this a bit in #7338 (specifically #7338 (comment)) , could you check it out?
I believe it's because weights can be split, but samples cannot |
The min_weight_leaf condition shouldn't be there. It's plain wrong. The min_samples_leaf condition is only an optimisation, and the same kind of optimisation can't be easily achieved for weight. The real work happens in the splitter. |
|
I'm tempted to follow the second option at #6945 and either raise an error or a warning if |
|
But I'm okay with the "assume sample_weight=1" solution too..? |
That solution would definitely be far easier to implement, but I think that assuming sample_weight=1 is more intuitive. However, it's equally counterintuitive that the results for |
|
No, the difference is not a bug and should not be fixed. On 8 September 2016 at 08:09, Nelson Liu notifications@github.com wrote:
|
|
I'm happy to accept what you implemented too. On 8 September 2016 at 10:58, Joel Nothman joel.nothman@gmail.com wrote:
|
ok
In that vein, I've addressed @raghavrv 's comments. Perhaps this would be good for 0.18? |
sklearn/ensemble/forest.py
Outdated
| The minimum weighted fraction of the input samples required to be at a | ||
| leaf node. | ||
| The minimum weighted fraction of the sum total of weights (of all | ||
| the input samples) required to be at a leaf node. |
There was a problem hiding this comment.
It it worth noting, "Samples have equal weight when sample_weight is not provided, but min_samples_leaf is more efficient."
sklearn/ensemble/forest.py
Outdated
| leaf node. | ||
| The minimum weighted fraction of the sum total of weights (of all | ||
| the input samples) required to be at a leaf node. Samples have | ||
| equal weight when sample_weight is not provided, but |
There was a problem hiding this comment.
I think we should now drop "but min_samples_leaf is more efficient" if we've realised we can use the 2 * min_weight_leaf change.
|
This actually has @ogrisel's +1 above as well. So it's got +2 assuming nothing substantial has changed since then. Also LGTM |
|
Pending that minor change, I should say. |
|
sorry for getting back to this so late, been a bit busy recently. I pushed the changes to the docstrings that were requested, is there anything else needed? |
|
A what's new entry is needed. Please put under 0.19, as I think this has On 28 September 2016 at 03:29, Nelson Liu notifications@github.com wrote:
|
209f967 to
cf72fd4
Compare
|
@jnothman thanks for the reminder. I put this under "Enhancements"; do you think "Bug Fixes" would be better? |
|
If you state "previously it was silently ignored", then it belongs in bug fixes :) |
|
@jnothman good point, that info is important. added and moved to bugfixes. |
|
thanks @nelson-liu @amueller, I've currently assumed this is not for 0.18. Feel free to backport and move the what's new if you disagree. |
…not provided (scikit-learn#7301) * fix min_weight_fraction_leaf when sample_weights is None * fix flake8 error * remove added newline and unnecessary assignment * remove max bc it's implemented in cython and add interaction test * edit weight calculation formula and add test to check equality * remove test that sees if two parameter build the same tree * reword min_weight_fraction_leaf docstring * clarify uniform weight in forest docstrings * update docstrings for all classes * add what's new entry * move whatsnew entry to bug fixes and explain previous behavior
…not provided (scikit-learn#7301) * fix min_weight_fraction_leaf when sample_weights is None * fix flake8 error * remove added newline and unnecessary assignment * remove max bc it's implemented in cython and add interaction test * edit weight calculation formula and add test to check equality * remove test that sees if two parameter build the same tree * reword min_weight_fraction_leaf docstring * clarify uniform weight in forest docstrings * update docstrings for all classes * add what's new entry * move whatsnew entry to bug fixes and explain previous behavior # Conflicts: # doc/whats_new.rst
…not provided (scikit-learn#7301) * fix min_weight_fraction_leaf when sample_weights is None * fix flake8 error * remove added newline and unnecessary assignment * remove max bc it's implemented in cython and add interaction test * edit weight calculation formula and add test to check equality * remove test that sees if two parameter build the same tree * reword min_weight_fraction_leaf docstring * clarify uniform weight in forest docstrings * update docstrings for all classes * add what's new entry * move whatsnew entry to bug fixes and explain previous behavior
…not provided (scikit-learn#7301) * fix min_weight_fraction_leaf when sample_weights is None * fix flake8 error * remove added newline and unnecessary assignment * remove max bc it's implemented in cython and add interaction test * edit weight calculation formula and add test to check equality * remove test that sees if two parameter build the same tree * reword min_weight_fraction_leaf docstring * clarify uniform weight in forest docstrings * update docstrings for all classes * add what's new entry * move whatsnew entry to bug fixes and explain previous behavior
Reference Issue
Fixes #6945, previous PR at #6947
What does this implement/fix? Explain your changes.
Any other comments?