-
Notifications
You must be signed in to change notification settings - Fork 8.3k
Decouple memory buffers from columns. #40222
Description
Use case
There are the following possibilities:
- allocate buffers in shared memory;
- allocate buffers in hugetlbfs;
- allocate buffers in GPU memory or other non-uniform memory;
- borrow a buffer for manipulation by external code;
- make a buffer refcounted and allow references to a sub-range of it;
- the buffer can own or not own its memory allocation;
- make a buffer readonly with mprotect and clear the flag on destruction (note: we already do it in debug build);
- calculate a checksum, verify and assert in destructor.
- make it compressed or offloaded and lazy reconstruct the contents on page fault;
- use a cache for column's data, but make its contents usable directly in-place, without memcpy;
- attach the data from Arrow directly into a column;
- attach the data from NumPy directly into a column;
- in some cases - avoid memcpy from ReadBuffer into Column while reading.
Describe the solution you'd like
A Buffer interface with (char *, size_t) as public members, but with virtual destructor and some virtual methods (create a new empty buffer of another size, realloc if possible, etc).
A factory can be used to create Buffers of various type, and already created Buffer (even empty) can be used as a factory to create more Buffers of the same type.
A parameterization of PODArray can be used to provide a version of this class using polymorphic Buffer, while a normal PODArray remains without any virtual methods.
PODArray, Column, and DataType (for the methods that create Columns) can take the Buffer object to create the Buffers of needed types.
The overhead should be only on construction and destruction, but for reading/writing the contents it is just a naked span of memory (zero overhead).