Add minimal drafts of a few more dtype extensions#135
Conversation
| is to have an array where each element is another array (and each array can | ||
| have a different length). Another use case is to store an array of variable | ||
| length strings. It is important to note that such an array actually just stores the references to the Python objects and not the objects themselves. Accessing | ||
| an element of the array returns the Python object it refers to. |
There was a problem hiding this comment.
This should specify how this data type is actually encoded in the store --- presumably via pickling.
In general this sort of encoding issue might fall under the domain of "filters", which are not yet part of zarr v3 but might be important to add.
In general I have a number of concerns about this data type:
- The pickle format by its nature is inherently tied to Python and cannot be supported by non-Python implementations.
- By the nature of the format, the Python unpickle operation can lead to arbitrary code execution and therefore requires that the source data be fully trusted. I think if it is supported it is likely to be used out of convenience even in situations that don't call for that level of trust. In particular, if it is available by default in zarr-python then opening any untrusted zarr array becomes a potential security vulnerability.
- In cases where the intent is not to store an arbitrary Python object, but e.g. just a variable length string or json value, just using the "object" type leaves the actual type unspecified.
I think it would be much better instead to decouple the zarr v3 data types from numpy data types --- and define as explicit zarr v3 data types:
- variable length unicode and byte strings
- json values
I think it would be best to avoid supporting a general Python object type at all.
There was a problem hiding this comment.
Note: These variable length strings and json values when stored in a numpy array would have to be stored using the numpy object data type. But that does not need to match the data type of the zarr array.
There was a problem hiding this comment.
Yes, I agree that the extension should not be specifically for general Python objects and needs to be revised substantially. We actually did not have any tests in core zarr-python involving object arrays, but I found that they were used in the Xarray test suite. Specifically this method on the DatasetIOBase class that the Zarr backend tests are a subclass of.
| ------------------------------------------- | ||
|
|
||
| These are fixed width strings corresponding to NumPy dtypes with `kind` 'S'. | ||
| For backward compatibility with Python 2's ``str`` these are zero-terminated |
There was a problem hiding this comment.
Zero-terminated strings normally refers to the C null-terminated string convention where the length is not explicitly indicated and instead the end of the string is indicated by a '\0' value. I would call this zero-padded rather than zero-terminated, since there is no zero terminator if the string is exactly equal to the specified length.
There was a problem hiding this comment.
Yeah, I was a bit confused regarding this as the NumPy dtype docs refer to this type as "zero-terminated bytes".
| - Numerical type | ||
| - Size (no. bytes) | ||
| - Byte order | ||
| * - list of (<name>, <type>) tuples |
There was a problem hiding this comment.
NumPy (and zarr v2) supports a third variant, where you have (name, type, shape). The shape specifies the dimensions of an "inner" array of elements.
There was a problem hiding this comment.
Yeah, I had forgotten to list that one and will update it here to describe an optional [, shape]. I think we do test it already.
joshmoore
left a comment
There was a problem hiding this comment.
Great to have these discussions started, @grlee77! Also 👍 for @jbms' comments, but I could also see getting this into the dev branch before they are resolved as with zarr-developers/zarr-python#898 so that we have more people working from the same starting point.
| extensions/filters/v1.0 | ||
| extensions/complex-dtypes/v1.0 | ||
| extensions/datetime-dtypes/v1.0 | ||
| extensions/object-dtypes/v1.0 |
There was a problem hiding this comment.
Mostly as a side note as we work out the process, we may want to lower these numbers depending in our confidence and the intended interpretation.
|
Wonderfully, this has no conflicts despite the recent updates! I'm merging and will leave the comments above unresolved for capturing as the editing of v3 moves forward. (If anyone feels moved to create issues, please just cross-link) Note: as mentioned in gitter, merging PRs into core-protocol-v3.0-dev doesn't imply that a feature is final. |
This adds draft extensions for the remaining dtypes supported for Zarr v2 arrays in zarr-python that are not part of the core v3 spec. Merging this would at least give us a URL to point to for these dtypes in zarr-developers/zarr-python#898.
The language here will eventually need updating to be more spec-like and more implementation neutral (e.g. less NumPy/Python specific). What is there now, at least describes the current behavior in zarr-python.
In the last community call @jbms (from from the tensorstore team) mentioned that NumPy's fixed width unicode arrays are not very desirable and variable length strings are likely of more interest. There is not a variable length string dtype in NumPy, but variable length strings can be stored as object arrays (H5Py for example does this). I know Pandas 1.0 also introduced their own
StringDType, but I haven't looked into the specifics of that.