Skip to content

ENH allows to overwrite read_csv parameter in fetch_openml#25488

Closed
glemaitre wants to merge 3 commits intoscikit-learn:mainfrom
glemaitre:read_csv_overwrite_params
Closed

ENH allows to overwrite read_csv parameter in fetch_openml#25488
glemaitre wants to merge 3 commits intoscikit-learn:mainfrom
glemaitre:read_csv_overwrite_params

Conversation

@glemaitre
Copy link
Copy Markdown
Member

Allows to overwrite the parameter passed to read_csv when reading a dataframe.
It is not intended to be used widely but it could be worth it when things go sideways.

@glemaitre glemaitre marked this pull request as draft January 26, 2023 15:39
@glemaitre glemaitre marked this pull request as ready for review January 27, 2023 13:01
Copy link
Copy Markdown
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR! I am okay with adding this option.

the default options. Internally, we used the default parameters of
:func:`pandas.read_csv` except for the following parameters:

- `header`: set to `None`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be in fetch_openml as part of the public API?

dtype=dtypes,
skipinitialspace=True, # skip spaces after delimiter to follow ARFF specs
)
frame = pd.read_csv(gzip_file, **read_csv_kwargs)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, if there is an exception while reading the data, one would need to enter a debugger to find out where the file is and what the read_csv_kwargs are. I think it would be helpful reraise an exception that outputs the read_csv_kwargs and gzip_file to help with debugging the issue.

@glemaitre
Copy link
Copy Markdown
Member Author

I will close this one. Let's keep in mind that it exists if we really need more flexibility and tweak the parameter in the future.

@glemaitre
Copy link
Copy Markdown
Member Author

OK, so opening back this one. It seems that we will need it if we want to manage ourself some read_csv breaking change between 1.X and 2.X in pandas: #25878 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants