-
-
Notifications
You must be signed in to change notification settings - Fork 12.3k
BUG: unable to read .npy file with non-eval'able metadata #23169
Description
Describe the issue:
I have a .npy file generated from an h5py dataset. The data is a pure Numpy array (no subclass) and the file has been written by Numpy, but the dtype contains metadata set by h5py.
In this particular case, the metadata contains a <class 'bytes'> expression, which cannot be handled by safe_eval . I don't care about the metadata, but it renders the entire file unreadable.
File "/home/sjoerd/miniconda3/envs/active-papers/lib/python3.9/site-packages/numpy/lib/format.py", line 776, in read_array
shape, fortran_order, dtype = _read_array_header(
File "/home/sjoerd/miniconda3/envs/active-papers/lib/python3.9/site-packages/numpy/lib/format.py", line 632, in _read_array_header
raise ValueError(msg.format(header)) from e
ValueError: Cannot parse header: "{'descr': [('paper_ref', ('|O', {'vlen': <class 'bytes'>})), ('path', ('|O', {'vlen': <class 'bytes'>}))], 'fortran_order': False, 'shape': (), } \n"
I would consider this a bug in np.save . Serialization of a dtype descriptor that results in an unreadable header should fail loudly. It would be nice if the np.save could then be re-tried with the metadata stripped.
Reproduce the code example:
import numpy as np
np.load("file.npy")Error message:
Traceback (most recent call last):
File "/home/sjoerd/bayesian_inference_fm/aptool-dump.py", line 65, in dump
arr = np.load(child_target + ".npy")
File "/home/sjoerd/miniconda3/envs/active-papers/lib/python3.9/site-packages/numpy/lib/npyio.py", line 432, in load
return format.read_array(fid, allow_pickle=allow_pickle,
File "/home/sjoerd/miniconda3/envs/active-papers/lib/python3.9/site-packages/numpy/lib/format.py", line 776, in read_array
shape, fortran_order, dtype = _read_array_header(
File "/home/sjoerd/miniconda3/envs/active-papers/lib/python3.9/site-packages/numpy/lib/format.py", line 632, in _read_array_header
raise ValueError(msg.format(header)) from e
ValueError: Cannot parse header: "{'descr': [('paper_ref', ('|O', {'vlen': <class 'bytes'>})), ('path', ('|O', {'vlen': <class 'bytes'>}))], 'fortran_order': False, 'shape': (), } \n"Runtime information:
Numpy version:
1.24.0
3.9.16 | packaged by conda-forge | (main, Feb 1 2023, 21:39:03)
[GCC 11.3.0]
show_runtime:
WARNING: threadpoolctl not found in system! Install it by pip install threadpoolctl. Once installed, try np.show_runtime again for more detailed build information
[{'simd_extensions': {'baseline': ['SSE', 'SSE2', 'SSE3'],
'found': ['SSSE3',
'SSE41',
'POPCNT',
'SSE42',
'AVX',
'F16C',
'FMA3',
'AVX2'],
'not_found': ['AVX512F',
'AVX512CD',
'AVX512_KNL',
'AVX512_KNM',
'AVX512_SKX',
'AVX512_CLX',
'AVX512_CNL',
'AVX512_ICL']}}]
Context for the issue:
A pure Numpy array (no subclasses) saved with np.save should be readable by np.load. This issue gives a counter-example.
I could try to write a patch to fix this issue, if you are interested.