ENH improve ARFF parser using pandas by glemaitre · Pull Request #21938 · scikit-learn/scikit-learn

glemaitre · 2021-12-10T00:01:08Z

build upon #22026
closes #19774
closes #11821
closes #14855
closes #11817

It also addresses 3 remaining example from #21598 by reducing the computational time:

60s to 30s plot_poisson_regression_non_normal_loss.py
91s to 35s plot_sgd_early_stopping.py
34s to 23s plot_stack_predictors.py

Some other example should benefit from these changes.

Add an option parser that can take "liac-arff" to get the current behaviour, "pandas" to get a C-parser that is much faster, more memory efficient, and infer the proper data type, and "auto" (would default on "pandas" if installed otherwise on "liac-arff").

A couple of benchmark:

glemaitre · 2021-12-10T23:12:35Z

sklearn/datasets/_arff_parser.py

+def _split_sparse_columns(
+    arff_data: ArffSparseDataType, include_columns: List
+) -> ArffSparseDataType:
+    """Obtains several columns from sparse ARFF representation. Additionally,
+    the column indices are re-labelled, given the columns that are not
+    included. (e.g., when including [1, 2, 3], the columns will be relabelled
+    to [0, 1, 2]).
+
+    Parameters
+    ----------
+    arff_data : tuple
+        A tuple of three lists of equal size; first list indicating the value,
+        second the x coordinate and the third the y coordinate.
+
+    include_columns : list
+        A list of columns to include.
+
+    Returns
+    -------
+    arff_data_new : tuple
+        Subset of arff data with only the include columns indicated by the
+        include_columns argument.
+    """
+    arff_data_new: ArffSparseDataType = (list(), list(), list())
+    reindexed_columns = {
+        column_idx: array_idx for array_idx, column_idx in enumerate(include_columns)
+    }
+    for val, row_idx, col_idx in zip(arff_data[0], arff_data[1], arff_data[2]):
+        if col_idx in include_columns:
+            arff_data_new[0].append(val)
+            arff_data_new[1].append(row_idx)
+            arff_data_new[2].append(reindexed_columns[col_idx])
+    return arff_data_new
+
+
+def _sparse_data_to_array(
+    arff_data: ArffSparseDataType, include_columns: List
+) -> np.ndarray:
+    # turns the sparse data back into an array (can't use toarray() function,
+    # as this does only work on numeric data)
+    num_obs = max(arff_data[1]) + 1
+    y_shape = (num_obs, len(include_columns))
+    reindexed_columns = {
+        column_idx: array_idx for array_idx, column_idx in enumerate(include_columns)
+    }
+    # TODO: improve for efficiency
+    y = np.empty(y_shape, dtype=np.float64)
+    for val, row_idx, col_idx in zip(arff_data[0], arff_data[1], arff_data[2]):
+        if col_idx in include_columns:
+            y[row_idx, reindexed_columns[col_idx]] = val
+    return y


Those line are moved from the previous implementation

sklearn/datasets/_arff_parser.py

glemaitre · 2021-12-10T23:14:45Z

sklearn/datasets/_arff_parser.py

+        frame = pd.concat(dfs, ignore_index=True)
+        del dfs, first_df
+
+        frame = _cast_frame(frame, columns_info_openml, infer_casting)


This is the only difference with the previous implementation where we attempt to infer casting if requested.

sklearn/datasets/_arff_parser.py

glemaitre · 2021-12-10T23:17:59Z

sklearn/datasets/_arff_parser.py

+    return y
+
+
+def _cast_frame(frame, columns_info, infer_casting=False):


It is the equivalence of _feature_to_dtype. It makes smarter casting.

thomasjpfan

Thank you for working on this @glemaitre . I left a comment on a possible way to make this PR smaller.

sklearn/datasets/_arff_parser.py

glemaitre · 2022-01-05T09:52:31Z

Still a big diff but this is mainly due to the refactoring of the test file indeed.

ogrisel

Some first comments, I haven't completed my review yet.

sklearn/datasets/_arff_parser.py

sklearn/datasets/_openml.py

sklearn/datasets/_arff_parser.py

ogrisel

More comments about the test and the behavior of parser="auto", I will do a quick final pass once addressed but the PR LGTM otherwise.

sklearn/datasets/tests/test_arff_parser.py

sklearn/datasets/tests/test_openml.py

sklearn/datasets/_openml.py

sklearn/datasets/_arff_parser.py

doc/whats_new/v1.1.rst

ogrisel · 2022-02-01T17:47:47Z

sklearn/datasets/_arff_parser.py

+    )
+    columns_to_select = feature_names_to_select + target_names_to_select
+
+    nominal_attributes = {


Maybe this should be renamed to "categorical" because we do not really know if categorical variables should be treated as nominal or ordinal in ARFF (e.g. for categories encoded with integer values).

sklearn/datasets/_openml.py

thomasjpfan

At a glance, there looks to be refactor of _liac_arff_parser, where _convert_arff_data + _convert_arff_data_dataframe are moved into _liac_arff_parser? There is also a _post_process_frame + tests.

My sense is that the refactor can be done in another PR. How do you feel about splitting the refactor out? (I know we done this once already, but the diff is quite big now)

sklearn/datasets/_arff_parser.py

glemaitre · 2022-02-01T22:32:41Z

At a glance, there looks to be refactor of _liac_arff_parser, where _convert_arff_data + _convert_arff_data_dataframe are moved into _liac_arff_parser?

I can try to refactor the _liac_arff_parser part only. But at the end over the 1,934 lines of changes, there is already 1,750 lines of test refactoring and a couple of lines of variable renaming only. So after refactoring, I don't expect the diff to be reduced by many lines.

thomasjpfan · 2022-02-02T22:11:31Z

So after refactoring, I don't expect the diff to be reduced by many lines.

Okay, I'll review this PR. My remaining concern for this PR is how parquet support may be coming soon: openml/OpenML#1133 I suspect we will switch to parquet when it comes available on openml. Do we want to maintain our own arff parser using pandas when we switch to parquet?

adrinjalali · 2022-02-04T15:20:41Z

Okay, I'll review this PR. My remaining concern for this PR is how parquet support may be coming soon: openml/OpenML#1133 I suspect we will switch to parquet when it comes available on openml. Do we want to maintain our own arff parser using pandas when we switch to parquet?

We could also make that decision when it happens? :D

glemaitre · 2022-02-08T18:09:01Z

I think that we can add support for parquet when it comes. However, it could be nice to have a reader that is fast enough in the meanwhile. Adding support for parquet will also add a new dependency (pyarrow I assume) that is still optional with pandas. So I think that this is worth adding the parser in the meanwhile.

adrinjalali

I think something's gone wrong with your merge from main. There are all changes from other PRs here.

doc/whats_new/v1.1.rst

adrinjalali · 2022-02-09T10:40:37Z

sklearn/datasets/_openml.py

    if parser == "auto":
        parser = "liac-arff" if return_sparse else "pandas"


shouldn't we use liac-arff if pandas is not installed instead of raising?

You can refer to the discussion with @ogrisel #21938 (comment)

Basically, since the two parsers behave slightly differently, changing depending on the libraries installed will be difficult to debug and it fails. I recall such a bug in pandas with numexpr is installed or not, and this is not a good thing.

A potential fix is given in #22354. But it is ugly, make a drop in performance, and I am not sure that we should spend more time trying to improve the LIAC-ARFF parser.

Also fetch_openml with default arguments on a dense dataset will already raise an exception in scikit-learn 1.0 because as_frame="auto" arlready requires pandas if the dataset is dense.

So better get parser and as_frame follow the same logic by default.

glemaitre · 2022-02-09T17:23:05Z

The remaining failure is linked with casting and handling of np.NA that would provide integer missing values but require pandas >= 1.0. So we would need to bump pandas minimum version to have consistent behaviour.

adrinjalali · 2022-02-10T11:23:30Z

I think that's an old enough pandas, and it's not an essential dependency. We could bump the min version.

thomasjpfan

Gave this a deeper pass. Overall, looks good!

sklearn/datasets/_arff_parser.py

sklearn/datasets/tests/test_arff_parser.py

sklearn/utils/_mocking.py

sklearn/datasets/tests/test_openml.py

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

examples/inspection/plot_permutation_importance.py

thomasjpfan

I went over each example that was changed and the output is the same as main.

I think this is ready to ship. LGTM

ogrisel

LGTM, just a final batch on nitpicks.

doc/datasets/loading_other_datasets.rst

sklearn/datasets/tests/test_openml.py

doc/datasets/loading_other_datasets.rst

sklearn/datasets/_arff_parser.py

sklearn/datasets/_openml.py

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

adrinjalali

Such a PR! It turned out to be a LOT larger than anticipated at first. But very happy with it. It'd be nice if we could merge and have it in the release @jeremiedbb

adrinjalali · 2022-05-12T12:22:39Z

CI fails

ogrisel · 2022-05-12T12:26:21Z

The circle ci arm64 failure might be caused by a dependency upgrade?

https://app.circleci.com/pipelines/github/scikit-learn/scikit-learn/26909/workflows/74388600-25aa-456a-8cd8-7136dca5e1a8/jobs/192262

Indeed, and it's being addressed in #23336.

adrinjalali · 2022-05-12T12:27:50Z

Then shall we merge when CI is green? Not sure where the release is (cc @jeremiedbb )

jeremiedbb · 2022-05-12T12:29:53Z

already generated the wheels (#23321) and tagged. And the PR on conda-forge is already opened (conda-forge/scikit-learn-feedstock#186)

:/

adrinjalali · 2022-05-12T12:33:16Z

Pity, then the versions need to be fixed and then we can merge :)

doc/whats_new/v1.1.rst

doc/datasets/loading_other_datasets.rst

ogrisel · 2022-05-13T16:15:24Z

Congrats on the merge!

glemaitre marked this pull request as draft December 10, 2021 00:01

github-actions bot added the module:datasets label Dec 10, 2021

ogrisel mentioned this pull request Dec 10, 2021

fetch_openml with mnist_784 uses excessive memory #19774

Closed

glemaitre commented Dec 10, 2021

View reviewed changes

glemaitre marked this pull request as ready for review December 14, 2021 14:13

thomasjpfan reviewed Dec 19, 2021

View reviewed changes

sklearn/datasets/_arff_parser.py Show resolved Hide resolved

glemaitre mentioned this pull request Dec 20, 2021

MAINT refactor ARFF parser #22026

Merged

ogrisel reviewed Jan 5, 2022

View reviewed changes

glemaitre modified the milestones: 1.2, 1.1 Jan 26, 2022

This was referenced Feb 1, 2022

Accelerate slow examples #21598

Closed

FIX downcast nominal features whenever possible in LIAC-ARFF #22354

Closed

ogrisel reviewed Feb 1, 2022

View reviewed changes

sklearn/datasets/_openml.py Outdated Show resolved Hide resolved

sklearn/datasets/_openml.py Outdated Show resolved Hide resolved

sklearn/datasets/_openml.py Outdated Show resolved Hide resolved

sklearn/datasets/_openml.py Outdated Show resolved Hide resolved

thomasjpfan reviewed Feb 1, 2022

View reviewed changes

sklearn/datasets/_arff_parser.py Outdated Show resolved Hide resolved

github-actions bot added the cython label Feb 9, 2022

adrinjalali reviewed Feb 9, 2022

View reviewed changes

glemaitre force-pushed the pandas_arff_reader branch from a525575 to f9bb5f3 Compare February 9, 2022 10:55

glemaitre removed the cython label Feb 9, 2022

thomasjpfan reviewed Feb 17, 2022

View reviewed changes

jeremiedbb assigned thomasjpfan Mar 2, 2022

glemaitre and others added 5 commits May 3, 2022 14:23

Update sklearn/datasets/_openml.py

18859a2

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

Update sklearn/datasets/_openml.py

5493907

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

fix encoding

60c3881

better doc

1a68491

add new test for parser difference

5a71cd1

ogrisel reviewed May 3, 2022

View reviewed changes

examples/inspection/plot_permutation_importance.py Show resolved Hide resolved

Merge branch 'main' into pandas_arff_reader

9ab3c74

thomasjpfan approved these changes May 4, 2022

View reviewed changes

cmarmo added the Waiting for Reviewer label May 10, 2022

ogrisel approved these changes May 12, 2022

View reviewed changes

glemaitre and others added 2 commits May 12, 2022 14:17

Apply suggestions from code review

1d94eb2

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

Update sklearn/datasets/tests/test_openml.py

90898ca

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

adrinjalali approved these changes May 12, 2022

View reviewed changes

ogrisel modified the milestones: 1.1, 1.2 May 12, 2022

ogrisel reviewed May 12, 2022

View reviewed changes

doc/whats_new/v1.1.rst Outdated Show resolved Hide resolved

glemaitre added 3 commits May 12, 2022 14:42

Merge remote-tracking branch 'origin/main' into pandas_arff_reader

961b581

change version

a599a2f

Merge remote-tracking branch 'origin/main' into pandas_arff_reader

e010572

glemaitre commented May 12, 2022

View reviewed changes

doc/datasets/loading_other_datasets.rst Outdated Show resolved Hide resolved

Update doc/datasets/loading_other_datasets.rst

a8add98

glemaitre merged commit a47d569 into scikit-learn:main May 12, 2022

avm19 pushed a commit to avm19/scikit-learn that referenced this pull request Oct 18, 2022

Parameter parser='auto' in fetch_openml. See scikit-learn#21938

397cc1b

jeremiedbb mentioned this pull request Mar 2, 2023

[MRG] ENH speed up plot_poisson_regression_non_normal_loss.py #21787

Closed

		return y


		def _cast_frame(frame, columns_info, infer_casting=False):

		if parser == "auto":
		parser = "liac-arff" if return_sparse else "pandas"

Uh oh!

Conversation

glemaitre commented Dec 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

glemaitre Dec 10, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

glemaitre Dec 10, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

glemaitre Dec 10, 2021

Choose a reason for hiding this comment

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

glemaitre commented Jan 5, 2022

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ogrisel Feb 1, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

glemaitre commented Feb 1, 2022

Uh oh!

thomasjpfan commented Feb 2, 2022

Uh oh!

adrinjalali commented Feb 4, 2022

Uh oh!

glemaitre commented Feb 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

adrinjalali Feb 9, 2022

Choose a reason for hiding this comment

Uh oh!

glemaitre Feb 9, 2022

Choose a reason for hiding this comment

Uh oh!

glemaitre Feb 9, 2022

glemaitre commented Dec 10, 2021 •

edited

Loading

glemaitre commented Feb 8, 2022 •

edited

Loading

ogrisel Feb 25, 2022 •

edited

Loading

adrinjalali commented May 12, 2022 •

edited

Loading