-
-
Notifications
You must be signed in to change notification settings - Fork 26.6k
Use check_array to validate y
#25089
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Would this hack still work after this change? |
|
I've not tried to run the code from the SO answer, so technically "I don't know". From a quick read of the code it seems that it relies on the From a super high level view, this change replaces a |
| y = check_array( | ||
| y, ensure_2d=False, dtype=dtype, input_name="y", force_all_finite=False | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CI jobs fail because some tests have y be one-element arrays.
One solution is to adapt check_array regarding
scikit-learn/sklearn/utils/validation.py
Lines 926 to 933 in cbfb6ab
| if ensure_min_samples > 0: | |
| n_samples = _num_samples(array) | |
| if n_samples < ensure_min_samples: | |
| raise ValueError( | |
| "Found array with %d sample(s) (shape=%s) while a" | |
| " minimum of %d is required%s." | |
| % (n_samples, array.shape, ensure_min_samples, context) | |
| ) |
probably by:
- changing the default value of
ensure_min_samplesto0 - changing the check for
ensure_min_samples > 1
Alternatively, we can change this call to:
| y = check_array( | |
| y, ensure_2d=False, dtype=dtype, input_name="y", force_all_finite=False | |
| ) | |
| y = check_array( | |
| y, | |
| ensure_2d=False, | |
| ensure_min_samples=0, | |
| dtype=dtype, | |
| input_name="y", | |
| force_all_finite=False, | |
| ) |
The first proposal is more sensible to me while probably necessitating more changes and adaptation in cascade whilst the second proposal is relatively simple but probably semantically not always correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example one test test_classification_report_output_dict_empty_input checks that you can call classification_report with [] for both y_true and y_pred. So I think we need to allow ensure_min_samples=0.
bfa11e7 to
b19979b
Compare
jjerphan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM given that tests pass.
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz> closes scikit-learn#25073
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz> closes scikit-learn#25073
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz> closes scikit-learn#25073
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz> closes scikit-learn#25073
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz> closes #25073
Reference Issues/PRs
closes #25073 (more precisely this PR combined with #25080 closes it)
What does this implement/fix? Explain your changes.
Uses
check_arrayin_check_yso that we get the same behaviour for converting pandas special data types (that can represent missing values) likeInt64as forX. This is done in the part of the code aroundpandas_requires_conversion.Is this what you had in mind?
cc @glemaitre