Conversation
|
If I understand correctly, this allows a programmer to side-step #23169, but doesn't fix it directly. To fix it, |
|
It's true that I use it just before saving HDF5 array in zarr store.
But in your case, if you have already files with metadata, i do not know if
you can force the dtype in np.load or take only the buffer part of the
array, not the dtype part
Le jeu. 9 févr. 2023 à 15:00, Sjoerd de Vries ***@***.***> a
écrit :
… If I understand correctly, this allows a programmer to side-step #23169
<#23169>, but doesn't fix it
directly. To fix it, utils.dtype_without_metadata must be invoked by
np.save.
As it is, every .npy file with metadata is *already* unreadable by np.load,
so no harm would be done.
Please correct me if I am wrong.
—
Reply to this email directly, view it on GitHub
<#23185 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AH75LVJG25HW5GIRFZ7QQBLWWT2ABANCNFSM6AAAAAAUWQRNNA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
|
@ninousf you can. You could also edit the file pretty easily, but both requires deeper dives into the header parsing code. If anyone wants to try and help with a solution, it might make sense to create a simple recipe for how to solve it and then refer to it from the current error. |
I could help, but I am still not sure which problem(s) you would like to have solved.
Personally, I don't need any of these, the current PR will solve all of my problems. |
|
I would be in favor of doing 1., potentially with a warning (not sure its actually worth it since its annoying). All else seems a lot of hassle for no clear gain (since I doubt you need to round-trip via the h5py metadata). I could imagine the error to have some bread-crumbs for unlucky folks who ended up storing with metadata. |
|
In my case, this function solves the problem when a library (like zarr) use descr to build dtype in the save function (not in load for npy format). ` dt = np.dtype({'names': ['r','g','b','a'], It can resolve the problem n.1. It is true that allowing a file to be stored in the knowledge that it cannot be re-read, is a bug for me |
seberg
left a comment
There was a problem hiding this comment.
Was your plan to just use this internally for now? In that case things are maybe a little bit simpler to settle quickly.
While dealing with structured dtypes can be in Python in either case, it does need to recurse via dtypes.fields (or names, but fields might almost be easier, there are also "titles"?).
I understand the urge to run auto-formatting, but we don't do it yet in NumPy and the it does make the diff a bit unwieldy in this case.
numpy/lib/tests/test_utils.py
Outdated
| for k in dir(dt1): | ||
| if not ((k.startswith('__') and k.endswith('__')) | ||
| or k in ['descr', 'metadata', 'newbyteorder']): | ||
| ret &= (getattr(dt1, k) == getattr(dt2, k)) |
There was a problem hiding this comment.
I would write this as:
for ...:
if getattr(dt1, k) != getattr(dt2, k):
return False
return True
There was a problem hiding this comment.
You could also check np.can_can(dt1, dt2, casting="no"), since that should only really work only metadata mismatches. (The only issue should be e.g. "long" and "longlong" which on most systems, except windows, are the same).
So, I am fine with this also, it is at least thorough and easy enough to read.
numpy/lib/utils.py
Outdated
| result : dtype | ||
| Dtype containing no metadata | ||
| """ | ||
| descr = dt.descr |
There was a problem hiding this comment.
We could do this in python as a utility, implementing it as dtype.drop_metadata() or so would have the advantage of easier generalizing user DTypes.
Fields can be nested, so I think this must use a recursive approach. I.e. check dtype.names is None if it is we should iterate dtype.fields, if not check dtype.shape != () (not a subarray). After that, we would have to assume dtype.str works (for now).
Both structured (names and shape) will need to then call the function again recursively.
There was a problem hiding this comment.
ok but this means that drop_metadata is a method of dtype class, right ?
Is there a python file where we can put this method in numpy project ? or we have to code it in cython. I'm not confortable with this language
There was a problem hiding this comment.
If we are OK with private, it could probably stay as a function for now. If not, I think I like a method? Even then it could be implemented in Python (for most of the logic at least), we would need a C stub that forwards it to Python. That should already be done for some other methods or getters/setters.
But if that is what we do, and you are not happy to dig into Python C-API a bit on your own, I can help with that part.
There was a problem hiding this comment.
there something that i do not understand. You suggest dtype.drop_metadata() with no parameter, so i suppose it is a method of dtype class.
And as metadata is a readonly attribute, drop_metadata(self) must return a copy, no?.
Or you want to add a setter to metadata in order to allow self to clean all its metadatas ? In this case drop_metadata(self) returns nothing
There was a problem hiding this comment.
Yes, it would have to return a new dtype. Honestly, dtypes should be immutable anyway. I suppose if there is no metadata to drop, returning the same one would be fine/good.
bdf5773 to
5225af2
Compare
5225af2 to
89d6441
Compare
|
Hmmmm, seems like all commits went away by accident. I think you might have to just open a new PR, since github does this auto-closing when this happens. |
|
next episode #23371 |
To avoid problems with metadata in dtypes when copying hdf5 arrays to zarr or npy file or in other cases:
i add a numpy tool function that creates a dtype from an original one but without any metada inside.
it does the same for all the underlying dtypes too