Skip to content

Conversation

@mfeurer
Copy link
Collaborator

@mfeurer mfeurer commented Jul 6, 2020

Closes #918

Issue #918 was caused by having a pickle file created with pandas < 1.0 and then being read with pandas >= 1.0. pandas changed the internals of handling categorical values and therefore the dataframe cannot be unpickled any more. This PR changes the behavior of OpenML-Python, and instead of crashing when failing on unpickling the dataset, it reads the data from arff and emits a warning that the pickle is not readable.

@mfeurer mfeurer requested a review from PGijsbers July 6, 2020 12:25
@mfeurer
Copy link
Collaborator Author

mfeurer commented Jul 6, 2020

Should also fix #898.

# The file is likely corrupt, see #780.
# We deal with this when loading the data in `_load_data`.
return data_pickle_file, data_feather_file, feather_attribute_file
except Exception:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you specifically catch ModuleNotFoundError and any others that are expected? I would prefer not to use a catch-all as if new issues arise, we could possibly just fix them instead.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add a check for the error of #898 too. And document in comments in one sentence why an error is expected. Including the issue number is good, but since they are easily summarized it is nice to be able to have the documentation within the code, e.g.

catch ModuleNotFoundError:
    # 780: Pickled dataframe is likely of pandas<1.0 while attempting to load with pandas>=1.0
    return ...
catch ValueError:
   (maybe check for the specific message)
   # 898: Pickled dataframe pickled with protocol 5 (Py3.8), but loaded with protocol 4 (<Py3.8).
   return ...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I just updated the code.

@codecov-commenter
Copy link

codecov-commenter commented Jul 6, 2020

Codecov Report

Merging #925 into develop will decrease coverage by 0.40%.
The diff coverage is 0.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #925      +/-   ##
===========================================
- Coverage    88.05%   87.65%   -0.41%     
===========================================
  Files           37       37              
  Lines         4363     4383      +20     
===========================================
  Hits          3842     3842              
- Misses         521      541      +20     
Impacted Files Coverage Δ
openml/datasets/dataset.py 81.58% <0.00%> (-4.40%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8f99ff6...4b04349. Read the comment docs.

Copy link
Collaborator

@PGijsbers PGijsbers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See attached comments

Copy link
Collaborator

@PGijsbers PGijsbers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth to have another look to see if we can centralize data loading logic so we do not have to catch the same error in two places. That said, I am also fine merging the PR like this, one step at a time.

@mfeurer mfeurer merged commit 368700e into develop Jul 6, 2020
@mfeurer mfeurer deleted the fix_918 branch July 6, 2020 14:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pickle error: No module named 'pandas.core.categorical'

4 participants