Skip to content

Commit a7fab04

Browse files
rokjorisvandenbosschepitrou
authored
GH-24868: [C++] Add a Tensor logical value type with varying dimensions, implemented using ExtensionType (#37166)
### Rationale for this change For use cases where underlying datatype and number of dimensions in tensors are equal but not the actual shape we want to add a `VariableShapeTensorType`. See #24868 and huggingface/datasets#5272 ### What changes are included in this PR? This introduces definition of `arrow.variable_shape_tensor` extension and it's C++ implementation and a Python wrapper. ### Are these changes tested? Yes. ### Are there any user-facing changes? This introduces new extension type to the user. * Closes: #24868 Lead-authored-by: Rok Mihevc <rok@mihevc.org> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
1 parent 223739a commit a7fab04

File tree

1 file changed

+103
-0
lines changed

1 file changed

+103
-0
lines changed

docs/source/format/CanonicalExtensions.rst

Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -148,6 +148,109 @@ Fixed shape tensor
148148
by this specification. Instead, this extension type lets one use fixed shape tensors
149149
as elements in a field of a RecordBatch or a Table.
150150

151+
.. _variable_shape_tensor_extension:
152+
153+
Variable shape tensor
154+
=====================
155+
156+
* Extension name: `arrow.variable_shape_tensor`.
157+
158+
* The storage type of the extension is: ``StructArray`` where struct
159+
is composed of **data** and **shape** fields describing a single
160+
tensor per row:
161+
162+
* **data** is a ``List`` holding tensor elements (each list element is
163+
a single tensor). The List's value type is the value type of the tensor,
164+
such as an integer or floating-point type.
165+
* **shape** is a ``FixedSizeList<int32>[ndim]`` of the tensor shape where
166+
the size of the list ``ndim`` is equal to the number of dimensions of the
167+
tensor.
168+
169+
* Extension type parameters:
170+
171+
* **value_type** = the Arrow data type of individual tensor elements.
172+
173+
Optional parameters describing the logical layout:
174+
175+
* **dim_names** = explicit names to tensor dimensions
176+
as an array. The length of it should be equal to the shape
177+
length and equal to the number of dimensions.
178+
179+
``dim_names`` can be used if the dimensions have well-known
180+
names and they map to the physical layout (row-major).
181+
182+
* **permutation** = indices of the desired ordering of the
183+
original dimensions, defined as an array.
184+
185+
The indices contain a permutation of the values [0, 1, .., N-1] where
186+
N is the number of dimensions. The permutation indicates which
187+
dimension of the logical layout corresponds to which dimension of the
188+
physical tensor (the i-th dimension of the logical view corresponds
189+
to the dimension with number ``permutations[i]`` of the physical tensor).
190+
191+
Permutation can be useful in case the logical order of
192+
the tensor is a permutation of the physical order (row-major).
193+
194+
When logical and physical layout are equal, the permutation will always
195+
be ([0, 1, .., N-1]) and can therefore be left out.
196+
197+
* **uniform_shape** = sizes of individual tensor's dimensions which are
198+
guaranteed to stay constant in uniform dimensions and can vary in
199+
non-uniform dimensions. This holds over all tensors in the array.
200+
Sizes in uniform dimensions are represented with int32 values, while
201+
sizes of the non-uniform dimensions are not known in advance and are
202+
represented with null. If ``uniform_shape`` is not provided it is assumed
203+
that all dimensions are non-uniform.
204+
An array containing a tensor with shape (2, 3, 4) and whose first and
205+
last dimensions are uniform would have ``uniform_shape`` (2, null, 4).
206+
This allows for interpreting the tensor correctly without accounting for
207+
uniform dimensions while still permitting optional optimizations that
208+
take advantage of the uniformity.
209+
210+
* Description of the serialization:
211+
212+
The metadata must be a valid JSON object that optionally includes
213+
dimension names with keys **"dim_names"** and ordering of dimensions
214+
with key **"permutation"**.
215+
Shapes of tensors can be defined in a subset of dimensions by providing
216+
key **"uniform_shape"**.
217+
Minimal metadata is an empty string.
218+
219+
- Example with ``dim_names`` metadata for NCHW ordered data (note that the first
220+
logical dimension, ``N``, is mapped to the **data** List array: each element in the List
221+
is a CHW tensor and the List of tensors implicitly constitutes a single NCHW tensor):
222+
223+
``{ "dim_names": ["C", "H", "W"] }``
224+
225+
- Example with ``uniform_shape`` metadata for a set of color images
226+
with fixed height, variable width and three color channels:
227+
228+
``{ "dim_names": ["H", "W", "C"], "uniform_shape": [400, null, 3] }``
229+
230+
- Example of permuted 3-dimensional tensor:
231+
232+
``{ "permutation": [2, 0, 1] }``
233+
234+
For example, if the physical **shape** of an individual tensor
235+
is ``[100, 200, 500]``, this permutation would denote a logical shape
236+
of ``[500, 100, 200]``.
237+
238+
.. note::
239+
240+
With the exception of ``permutation``, the parameters and storage
241+
of VariableShapeTensor relate to the *physical* storage of the tensor.
242+
243+
For example, consider a tensor with::
244+
shape = [10, 20, 30]
245+
dim_names = [x, y, z]
246+
permutations = [2, 0, 1]
247+
248+
This means the logical tensor has names [z, x, y] and shape [30, 10, 20].
249+
250+
.. note::
251+
Values inside each **data** tensor element are stored in row-major/C-contiguous
252+
order according to the corresponding **shape**.
253+
151254
=========================
152255
Community Extension Types
153256
=========================

0 commit comments

Comments
 (0)