Check sample weight equivalence on sparse data#30137
Check sample weight equivalence on sparse data#30137ogrisel merged 17 commits intoscikit-learn:mainfrom
Conversation
|
The proposed change (add Ideally @ogrisel what test API would we like to have ? I'm currently thinking of:
Is there a proper way to list the sparse containers supported by an estimator ? |
|
|
||
| @ignore_warnings(category=FutureWarning) | ||
| def check_sample_weight_equivalence(name, estimator_orig): | ||
| def _check_sample_weight_equivalence(name, estimator_orig, sparse_container): |
There was a problem hiding this comment.
| def _check_sample_weight_equivalence(name, estimator_orig, sparse_container): | |
| def _check_sample_weight_equivalence(name, estimator_orig, sparse_container=None): |
There was a problem hiding this comment.
I wonder if it wouldn't be better to have two functions for the dense and sparse inputs:
check_sample_weight_equivalence_on_dense_data (or just check_sample_weight_equivalence) and check_sample_weight_equivalence_on_sparse_data.
This way, we could configure different hyperparameter configs in PER_ESTIMATOR_CHECK_PARAMS if needed.
On the other hand, this might introduce many redundant entries in that dict.
Any opinion @adrinjalali?
There was a problem hiding this comment.
yeah we had another case where a parameter would change the behavior of the test and it made sense to create two tests for them. I think it makes sense to try to keep the signature of the tests to name, estimator_orig and the rest should be defined by the test itself.
There was a problem hiding this comment.
I think the two distinct check_sample_weight_equivalence_on_dense_data and check_sample_weight_equivalence_on_sparse_data names would be more convenient for filtering with the -k flag in pytest (we can then easily do only dense, only sparse, or both).
ogrisel
left a comment
There was a problem hiding this comment.
LGTM. @antoinebaker this PR is still marked as "draft" but I do not see unaddressed TODO item anymore.
|
Sorry @ogrisel I forgot to add the TODO item. I still need to yield the tests only for estimators accepting sparse inputs (by filtering on the |
|
@lesteve I have one merged PR #29818 that changes the function What is the recommended pratice in this case for the new changelog system ? or delete the previous changelog and consolidate in the new one: |
If I recall what @lesteve said elsewhere correctly, it's probably best to keep per-PR incremental log entries. Maybe we could add a TODO note in the new entry with a reference to the PR number of the older change to remember to do this consolidation of the changelog at the time of the release. |
|
Even better would be to update the file with the changelog of entry from #29818 in this PR to use the new names and same contents as the contents of a new file for the changelog entry of #30137. When the |
|
Since the sparse tag fix is going to take longer than anticipated, I would rather decouple the two PRs and consider a review (remove the draft flag) while keeping the I think |
|
I clicked the "Update branch" button to merge |
ogrisel
left a comment
There was a problem hiding this comment.
I think I am +1 for merging as such.
This PR does cause some redundancy in the tags._xfail_checks declaration, though. But I don't see any better way. Since some estimators (in particular linear models) use very different code-path for sparse and dense data, I think this is still worth it.
Any second opinion @adrinjalali @jeremiedbb @snath-xoc?
jeremiedbb
left a comment
There was a problem hiding this comment.
LGTM. I just have one question below
|
@ogrisel this LGTM. But do you mind merging the xfail checks PR first? I think you can merge conflicts with that PR here much easier than I can do there. |
…_equivalence_on_sparse_data
| integer (including zero) weights. | ||
| - :func:`utils.estimator_checks.check_sample_weights_invariance` | ||
| replaced by | ||
| :func:`check_sample_weight_equivalence_on_dense_data` |
There was a problem hiding this comment.
It would be more like
| :func:`check_sample_weight_equivalence_on_dense_data` | |
| :func:`~utils.estimator_checks.check_sample_weight_equivalence_on_dense_data` |
but we don't really need to annotate them (or even have a changelog for them) since they're not really public and don't get rendered in the API pages. So this would only raise a warning in sphinx build.
That said, I'm in favor of adding nice docstrings to tests and adding them to the API pages (not in this PR). However, this changelog needs to remove the :func: directives, or the mentions of them completely.
There was a problem hiding this comment.
If I understood correctly the sphinx doc, the tilde suppresses the cross-reference, and I should therefore add it for all the functions of this PR as they are not rendered on the API pages ?
There was a problem hiding this comment.
The tilde only changes the rendered version from the full qual path to only rendering the name of the function, but still trying to resolve the hyperlink, which raises a warning in this case.
There was a problem hiding this comment.
Thanks for the clarification ! I remove the :func: directives completely.
| # FIXME: filter on tags.input_tags.sparse | ||
| # (estimator accepts sparse arrays) | ||
| # once issue #30139 is fixed. | ||
| yield check_sample_weight_equivalence_on_sparse_data |
There was a problem hiding this comment.
shouldn't the check already be there? I'm not sure if I understand why we don't have it here.
There was a problem hiding this comment.
Not yet, the PR #30187 fixing the tags.input_tags.sparse needs to be merged.
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Reference Issues/PRs
Related to issues #30131 and #16298
Continuation of PR #29818
What does this implement/fix? Explain your changes.
Following #30040 (comment) we would like to add a new
check_sample_weight_equivalence_on_sparse_datain the common estimator checks, similar tocheck_sample_weight_equivalencebut on sparse data, as several estimators (for exampleLinearRegression) handle differently sparse or dense data.TODO
filter on estimators accepting sparse inputs by using the tags system ('tags.input_tags.sparse=True`) instead of catching the errorwill be done in a follow-up PR.