ENH improve ARFF parser using pandas#21938
Conversation
| def _split_sparse_columns( | ||
| arff_data: ArffSparseDataType, include_columns: List | ||
| ) -> ArffSparseDataType: | ||
| """Obtains several columns from sparse ARFF representation. Additionally, | ||
| the column indices are re-labelled, given the columns that are not | ||
| included. (e.g., when including [1, 2, 3], the columns will be relabelled | ||
| to [0, 1, 2]). | ||
|
|
||
| Parameters | ||
| ---------- | ||
| arff_data : tuple | ||
| A tuple of three lists of equal size; first list indicating the value, | ||
| second the x coordinate and the third the y coordinate. | ||
|
|
||
| include_columns : list | ||
| A list of columns to include. | ||
|
|
||
| Returns | ||
| ------- | ||
| arff_data_new : tuple | ||
| Subset of arff data with only the include columns indicated by the | ||
| include_columns argument. | ||
| """ | ||
| arff_data_new: ArffSparseDataType = (list(), list(), list()) | ||
| reindexed_columns = { | ||
| column_idx: array_idx for array_idx, column_idx in enumerate(include_columns) | ||
| } | ||
| for val, row_idx, col_idx in zip(arff_data[0], arff_data[1], arff_data[2]): | ||
| if col_idx in include_columns: | ||
| arff_data_new[0].append(val) | ||
| arff_data_new[1].append(row_idx) | ||
| arff_data_new[2].append(reindexed_columns[col_idx]) | ||
| return arff_data_new | ||
|
|
||
|
|
||
| def _sparse_data_to_array( | ||
| arff_data: ArffSparseDataType, include_columns: List | ||
| ) -> np.ndarray: | ||
| # turns the sparse data back into an array (can't use toarray() function, | ||
| # as this does only work on numeric data) | ||
| num_obs = max(arff_data[1]) + 1 | ||
| y_shape = (num_obs, len(include_columns)) | ||
| reindexed_columns = { | ||
| column_idx: array_idx for array_idx, column_idx in enumerate(include_columns) | ||
| } | ||
| # TODO: improve for efficiency | ||
| y = np.empty(y_shape, dtype=np.float64) | ||
| for val, row_idx, col_idx in zip(arff_data[0], arff_data[1], arff_data[2]): | ||
| if col_idx in include_columns: | ||
| y[row_idx, reindexed_columns[col_idx]] = val | ||
| return y |
There was a problem hiding this comment.
Those line are moved from the previous implementation
sklearn/datasets/_arff_parser.py
Outdated
| frame = pd.concat(dfs, ignore_index=True) | ||
| del dfs, first_df | ||
|
|
||
| frame = _cast_frame(frame, columns_info_openml, infer_casting) |
There was a problem hiding this comment.
This is the only difference with the previous implementation where we attempt to infer casting if requested.
sklearn/datasets/_arff_parser.py
Outdated
| return y | ||
|
|
||
|
|
||
| def _cast_frame(frame, columns_info, infer_casting=False): |
There was a problem hiding this comment.
It is the equivalence of _feature_to_dtype. It makes smarter casting.
thomasjpfan
left a comment
There was a problem hiding this comment.
Thank you for working on this @glemaitre . I left a comment on a possible way to make this PR smaller.
|
Still a big diff but this is mainly due to the refactoring of the test file indeed. |
ogrisel
left a comment
There was a problem hiding this comment.
Some first comments, I haven't completed my review yet.
ogrisel
left a comment
There was a problem hiding this comment.
More comments about the test and the behavior of parser="auto", I will do a quick final pass once addressed but the PR LGTM otherwise.
sklearn/datasets/_arff_parser.py
Outdated
| ) | ||
| columns_to_select = feature_names_to_select + target_names_to_select | ||
|
|
||
| nominal_attributes = { |
There was a problem hiding this comment.
Maybe this should be renamed to "categorical" because we do not really know if categorical variables should be treated as nominal or ordinal in ARFF (e.g. for categories encoded with integer values).
thomasjpfan
left a comment
There was a problem hiding this comment.
At a glance, there looks to be refactor of _liac_arff_parser, where _convert_arff_data + _convert_arff_data_dataframe are moved into _liac_arff_parser? There is also a _post_process_frame + tests.
My sense is that the refactor can be done in another PR. How do you feel about splitting the refactor out? (I know we done this once already, but the diff is quite big now)
I can try to refactor the |
Okay, I'll review this PR. My remaining concern for this PR is how parquet support may be coming soon: openml/OpenML#1133 I suspect we will switch to parquet when it comes available on openml. Do we want to maintain our own arff parser using pandas when we switch to parquet? |
We could also make that decision when it happens? :D |
|
I think that we can add support for parquet when it comes. However, it could be nice to have a reader that is fast enough in the meanwhile. Adding support for parquet will also add a new dependency ( |
adrinjalali
left a comment
There was a problem hiding this comment.
I think something's gone wrong with your merge from main. There are all changes from other PRs here.
sklearn/datasets/_openml.py
Outdated
| if parser == "auto": | ||
| parser = "liac-arff" if return_sparse else "pandas" |
There was a problem hiding this comment.
shouldn't we use liac-arff if pandas is not installed instead of raising?
There was a problem hiding this comment.
You can refer to the discussion with @ogrisel #21938 (comment)
Basically, since the two parsers behave slightly differently, changing depending on the libraries installed will be difficult to debug and it fails. I recall such a bug in pandas with numexpr is installed or not, and this is not a good thing.
There was a problem hiding this comment.
A potential fix is given in #22354. But it is ugly, make a drop in performance, and I am not sure that we should spend more time trying to improve the LIAC-ARFF parser.
There was a problem hiding this comment.
Also fetch_openml with default arguments on a dense dataset will already raise an exception in scikit-learn 1.0 because as_frame="auto" arlready requires pandas if the dataset is dense.
So better get parser and as_frame follow the same logic by default.
a525575 to
f9bb5f3
Compare
|
The remaining failure is linked with casting and handling of |
|
I think that's an old enough pandas, and it's not an essential dependency. We could bump the min version. |
thomasjpfan
left a comment
There was a problem hiding this comment.
Gave this a deeper pass. Overall, looks good!
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
thomasjpfan
left a comment
There was a problem hiding this comment.
I went over each example that was changed and the output is the same as main.
I think this is ready to ship. LGTM
ogrisel
left a comment
There was a problem hiding this comment.
LGTM, just a final batch on nitpicks.
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
adrinjalali
left a comment
There was a problem hiding this comment.
Such a PR! It turned out to be a LOT larger than anticipated at first. But very happy with it. It'd be nice if we could merge and have it in the release @jeremiedbb
|
CI fails |
|
The circle ci arm64 failure might be caused by a dependency upgrade? Indeed, and it's being addressed in #23336. |
|
Then shall we merge when CI is green? Not sure where the release is (cc @jeremiedbb ) |
|
already generated the wheels (#23321) and tagged. And the PR on conda-forge is already opened (conda-forge/scikit-learn-feedstock#186) :/ |
|
Pity, then the versions need to be fixed and then we can merge :) |
|
Congrats on the merge! |
build upon #22026
closes #19774
closes #11821
closes #14855
closes #11817
It also addresses 3 remaining example from #21598 by reducing the computational time:
plot_poisson_regression_non_normal_loss.pyplot_sgd_early_stopping.pyplot_stack_predictors.pySome other example should benefit from these changes.
Add an option
parserthat can take"liac-arff"to get the current behaviour,"pandas"to get a C-parser that is much faster, more memory efficient, and infer the proper data type, and"auto"(would default on"pandas"if installed otherwise on"liac-arff").A couple of benchmark: