MNT Add type annotations for OpenML fetcher#17053
Conversation
| A stream to the OpenML resource | ||
| """ | ||
| def is_gzip(_fsrc): | ||
| def is_gzip_encoded(_fsrc): |
There was a problem hiding this comment.
Renamed because I found the name confusing. It's not actually a gzipped file, but rather that Content-Encoding == 'gzip'in the header. So we need to use gzip for local files when this function is False.
| Whether to raise an error if OpenML returns an acceptable error (e.g., | ||
| date not found). If this argument is set to False, a None is returned | ||
| in case of acceptable errors. Note that all other errors (e.g., 404) | ||
| will still be raised as normal. |
There was a problem hiding this comment.
Remove it and catch the exception directly in one case when it's necessary. This makes this function return a single type, instead of Dict or None.
| if raise_if_error: | ||
| raise ValueError(error_message) | ||
| return None | ||
| raise OpenMLError(error_message) |
There was a problem hiding this comment.
OpenMLError inherits from ValueError, so it is backward compatible.
azure-pipelines.yml
Outdated
| set -ex | ||
| conda create --name flake8_env --yes python=3.8 | ||
| conda activate flake8_env | ||
| source activate flake8_env |
There was a problem hiding this comment.
conda activate was actually not working in master with,
CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
To initialize your shell, run
$ conda init <SHELL_NAME>
and conda init was not working either. We were just not checking for exit status (now done with set -ex) So reverted to source activate as I don't want to debug this.
There was a problem hiding this comment.
one issue with conda is that it expects the shell function to be the entry point and not the executable. So if we have the executable in the PATH. This has already given me a headache in other places.
|
|
||
| ArffDataType = Tuple[List, ...] | ||
|
|
||
| if typing.TYPE_CHECKING: |
There was a problem hiding this comment.
According to TYPE_CHECKING docs this is used for performance reasons. If this is case, let's have a comment here.
There was a problem hiding this comment.
No the reason is that https://pypi.org/project/typing-extensions/ (imported below) is a mypy dependency so we can only import it if we are checking for types and therefore it is installed. I'll add a comment.
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
| raise ValueError("Unreachable code. arff['data'] must be a generator.") | ||
|
|
||
| # calculate chunksize | ||
| first_row = next(arff['data']) |
There was a problem hiding this comment.
The above check is necessary as technically arff['data'] could be a tuple for sparse data, and then next would fails as it's not implemented for tuples.
There was a problem hiding this comment.
Can we directly check that arff['data' is not iterable?
if not isinstance(arff['data'], Iterable)(Not for this PR: This reminds me that we can likely return dataframes with sparse extension arrays)
|
@thomasjpfan I addressed your comments, please let me know if you have others. |
thomasjpfan
left a comment
There was a problem hiding this comment.
I am happy to try out typehints here. Thank you @rth!
| raise ValueError("Unreachable code. arff['data'] must be a generator.") | ||
|
|
||
| # calculate chunksize | ||
| first_row = next(arff['data']) |
There was a problem hiding this comment.
Can we directly check that arff['data' is not iterable?
if not isinstance(arff['data'], Iterable)(Not for this PR: This reminds me that we can likely return dataframes with sparse extension arrays)
| arff_data = arff['data'] | ||
| if isinstance(arff_data, Generator): | ||
| if shape is None: | ||
| raise ValueError( |
There was a problem hiding this comment.
Shouldn't this have test coverage?
| arff_columns = list(attributes) | ||
|
|
||
| if not isinstance(arff['data'], Generator): | ||
| raise ValueError( |
|
Thanks for the review @jnothman , I added a test to cover those two lines and coverage is now green. Merging. |
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
I find that OpenML fetcher code is currently very difficult to read, there is a bunch of private functions with undocumented signatures and manipulating objects with non trivial types.
This PR adds some type annotations for function signatures, to at least make it a bit clearer what is the expected function input and output.
I'll contribute some of it to liac-arff upstream as soon as it drops Python 2 support renatopp/liac-arff#107