-
-
Notifications
You must be signed in to change notification settings - Fork 12.1k
Description
Proposed new feature or change:
Based on #25347 (comment)
Of course, this concept is up for discussion, I'd like to have some feedback and therefore, I will also provide multiple options for various aspects.
Currently, Numpy does allow to store arbitrary data within arrays via the dtype object. However, it comes with a lot of overhead when writing contents to disk: contents need to be pickled, which consumes both a lot of space and CPU usage, since serialization is actually necessary. Also, since the memory representation is different from the serialization, object array .npy files cannot be (efficiently) memory mapped.
On the other hand, there are Numpy datatypes (byte, int, float, double, ... and structs thereof) that can be written to disk very efficiently with barely any serialization necessary besides the creation of a header, which is just a few bytes long. Instead, memory is just dumped into a file without any processing which can happen at the speed of the interface. At the time of writing, modern consumer hardware (PCIE 5.0 SSDs with 4 lanes) exists that can read and write with more than 10 GBytes/s. Not too many years ago this used to be the speed of RAM, so memory mapping becomes even more viable nowadays. On the other hand, e.g. large language models (LLMs) and video consume memory like never before with no end in sight, so there will be demand. Even when not using larger arrays then the main memory, memory mapping is convenient by eliminating load times between program runs.
With certain applications however come problems: e.g. both text (LLMs) as well as video can vary in length (ragged arrays), which is currently only supported via the object dtype. While text is being worked on #25347 there is no solution for serialization and there is also no solution for ragged arrays (like e.g. video) yet.
This issue suggests how to handle certain datatypes which are currently only available via the object interface in a way that they can be (de)serialized efficiently - in other words "dumped" from memory and memory mapped. It would also be interesting to preserve the ability to append efficiently to .npy files and also to not modify the file format specified in NEP 1, but just add new dtypes.
Appendability, although not explicitly mentioned in NEP 1 is quite a sought-after property, compare the discussion and the top answer to https://stackoverflow.com/a/30379177 This also shows that the .npy file format is popular not despite but because of its simplicity.
I would like to suggest the following dtypes:
- Reference (alternatively: Enumeration, Reference, Pointer, Offset or Index)
Array Indices would refer to the first axis (0 for C order, -1 for Fortran order), which would also determine the final shape of the array. For example, if the referred-to array has the shape (100, 20, 30) "array of images" and C order, then a Reference would be a (20,30) "image". If the referring array has the shape (120, 5, 10), then the final shape would be (120, 5, 10, 20, 30). I would not recommend to use byte offsets since this makes it more difficult to do certain modifications to the referred-to array: for video one could modify all images (e.g. using a different crop, i.e. different dimensions) without invalidating the References (not possible with byte offsets). - Range (two references from 1. or one reference and one length)
To create ragged arrays, Ranges are required. This would require to modify the concept of a shape. Raggedness can happen in multiple dimensions, depending on how many levels of References/Ranges are used. For the shapes above, using Ranges instead of References would result in the final array to have a shape like (120, 5, 10, *, 20, 30) with * indicating varying lengths. - Strings (can be either zero delimited UNIX style strings, then it's a reference (see 1), otherwise a Range into a byte array)
3.1. Zero delimited is more space efficient both with references and storage
3.2. Ranges into byte arrays use more space, but certain operations can be faster, strings can contain zero (has advantages and disadvantages)
3.3. One could theoretically also implement both String variants and let the user decide
So basically, all of those dtypes are uint64 numbers or tuples with two elements of uint64 numbers. They can point to different arrays that can be stored in different .npy files and/or in memory, making these dtypes memory mappable. Since multiple regular .npy files are used, they can always be appended to. Doing many append operations on multiple files may stress the file system (fragmentation), but makes the implementation easier.
Additional .npy files can be organized in a hierarchical way (one .npy file and all other filenames derived from that) or with multiple files referring to each other like in Microsoft Excel, with the difference that references are not per-cell, but per column. Both options do not exclude each other. It would be possible to make filenames in dtypes just optional. If a filename is provided, it is option 2, otherwise option 1 and filenames are derived automatically.
Option 1: Organize .npy hierarchically with automatic filenames
Let's say we save some "main.npy" which refers to other arrays. Then, all those other arrays can be saved as "main.npy.heap1", "main.npy.heap2", ... if the dtype from main.npy is struct and multiple struct fields are References, Ranges or Strings or just "main.npy.heap" for simple, non-struct dtypes. One could also use a different ending for Strings, like "main.npy.strings" and it would be a (regular .npy) byte array. Although the heap files do not end in .npy, they would be regular .npy files. When the file main.npy gets loaded, all files of all reference dtypes are loaded or memory mapped as well. Even without extra documentation, a user could quickly see that those files belong together, since they share the main .npy file's name. String dtypes would require such approach, if one did not want to embed strings into the .npy file format by introducing a new Numpy file format version and/or sacrificing appendability.
Option 2: Excel-like cross referenced .npy files
If one array has a Reference, Range or String dtype that refers to another array, both could be saved in different files with the filenames being specified explicitly in the dtype description. This would be comparable to cell cross file references in Excel, just on a per-column basis, e.g. "video.npy" could have one Range datatype that refers to "image.npy" with "image.npy" containing all (uncompressed, maybe low res) images of all videos in their order. Another file "video_events.npy" could contain Ranges to parts of those video files where some events happen. So one could have multiple Reference/Range sets into the same data. With option 1 this would require "image.npy" to be duplicated, which might or might not be convenient/possible.
That's it for now. Any feedback, suggestions, questions?