Skip to content

BUG: unable to read .npy file with non-eval'able metadata #23169

@sjdv1982

Description

@sjdv1982

Describe the issue:

I have a .npy file generated from an h5py dataset. The data is a pure Numpy array (no subclass) and the file has been written by Numpy, but the dtype contains metadata set by h5py.
In this particular case, the metadata contains a <class 'bytes'> expression, which cannot be handled by safe_eval . I don't care about the metadata, but it renders the entire file unreadable.

 File "/home/sjoerd/miniconda3/envs/active-papers/lib/python3.9/site-packages/numpy/lib/format.py", line 776, in read_array
    shape, fortran_order, dtype = _read_array_header(
  File "/home/sjoerd/miniconda3/envs/active-papers/lib/python3.9/site-packages/numpy/lib/format.py", line 632, in _read_array_header
    raise ValueError(msg.format(header)) from e
ValueError: Cannot parse header: "{'descr': [('paper_ref', ('|O', {'vlen': <class 'bytes'>})), ('path', ('|O', {'vlen': <class 'bytes'>}))], 'fortran_order': False, 'shape': (), }                                    \n"

I would consider this a bug in np.save . Serialization of a dtype descriptor that results in an unreadable header should fail loudly. It would be nice if the np.save could then be re-tried with the metadata stripped.

Reproduce the code example:

import numpy as np
np.load("file.npy")

Error message:

Traceback (most recent call last):
  File "/home/sjoerd/bayesian_inference_fm/aptool-dump.py", line 65, in dump
    arr = np.load(child_target + ".npy")
  File "/home/sjoerd/miniconda3/envs/active-papers/lib/python3.9/site-packages/numpy/lib/npyio.py", line 432, in load
    return format.read_array(fid, allow_pickle=allow_pickle,
  File "/home/sjoerd/miniconda3/envs/active-papers/lib/python3.9/site-packages/numpy/lib/format.py", line 776, in read_array
    shape, fortran_order, dtype = _read_array_header(
  File "/home/sjoerd/miniconda3/envs/active-papers/lib/python3.9/site-packages/numpy/lib/format.py", line 632, in _read_array_header
    raise ValueError(msg.format(header)) from e
ValueError: Cannot parse header: "{'descr': [('paper_ref', ('|O', {'vlen': <class 'bytes'>})), ('path', ('|O', {'vlen': <class 'bytes'>}))], 'fortran_order': False, 'shape': (), }                                    \n"

Runtime information:

Numpy version:

1.24.0
3.9.16 | packaged by conda-forge | (main, Feb 1 2023, 21:39:03)
[GCC 11.3.0]

show_runtime:

WARNING: threadpoolctl not found in system! Install it by pip install threadpoolctl. Once installed, try np.show_runtime again for more detailed build information
[{'simd_extensions': {'baseline': ['SSE', 'SSE2', 'SSE3'],
'found': ['SSSE3',
'SSE41',
'POPCNT',
'SSE42',
'AVX',
'F16C',
'FMA3',
'AVX2'],
'not_found': ['AVX512F',
'AVX512CD',
'AVX512_KNL',
'AVX512_KNM',
'AVX512_SKX',
'AVX512_CLX',
'AVX512_CNL',
'AVX512_ICL']}}]

Context for the issue:

A pure Numpy array (no subclasses) saved with np.save should be readable by np.load. This issue gives a counter-example.

I could try to write a patch to fix this issue, if you are interested.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions