Datasets cached pre-#548 can no longer be used for `run_model_on_task`

It looks like pre-#548 pickled data was in `np.array` format, where now the default is `pandas.dataframe`. When I now try to use `run_model_on_task` for which I still have a cached dataset with `np.array` as data instead of `pd.DataFrame`, [this line](https://github.com/openml/openml-python/blob/develop/openml/datasets/dataset.py#L520) is called with `(data, dataset_format=="array", [some list of attribute names)`. This raises an error because as far as I can tell `_convert_array_format` assumes that the input data is `pd.DataFrame` if specified dataformat is `"array"`, which makes [this line](https://github.com/openml/openml-python/blob/develop/openml/datasets/dataset.py#L370) raise an error because `np.array` does not have an attribute columns.

The fix seems easy enough, just check if `data` is already of the preferred type, e.g. start the function with
```
def _convert_array_format(data, array_format, attribute_names):
    if array_format == "array" and not scipy.sparse.issparse(data):
        if isinstance(data, np.ndarray):
            return data
        ...
```

Does this make sense? Shall I set up a PR?
@glemaitre @mfeurer 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Datasets cached pre-#548 can no longer be used for `run_model_on_task` #646

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Datasets cached pre-#548 can no longer be used for run_model_on_task #646

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Datasets cached pre-#548 can no longer be used for `run_model_on_task` #646