Skip to content

Commit 9dec272

Browse files
jorisvandenbosschepaleolimbotpitrouzeroshade
authored
GH-38325: [Python] Expand the Arrow PyCapsule Interface with C Device Data support (#40708)
### Rationale for this change We defined a protocol exposing the C Data Interface (schema, array and stream) in Python through PyCapsule objects and dunder methods `__arrow_c_schema/array/stream__` (#35531 / #37797). We also expanded the C Data Interface with device capabilities: https://arrow.apache.org/docs/dev/format/CDeviceDataInterface.html (#34972). This expands the Python exposure of the interface with support for the newer Device structs. ### What changes are included in this PR? Update the specification to defined two additional dunders: * `__arrow_c_device_array__` returns a pair of PyCapsules containing a C ArrowSchema and ArrowDeviceArray, where the latter uses "arrow_device_array" for the capsule name * `__arrow_c_device_stream__` returns a PyCapsule containing a C ArrowDeviceArrayStream, where the capsule must have a name of "arrow_device_array_stream" ### Are these changes tested? Spec-only change * GitHub Issue: #38325 Lead-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Dewey Dunnington <dewey@dunnington.ca> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Co-authored-by: Matt Topol <zotthewizard@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
1 parent a42ec1d commit 9dec272

2 files changed

Lines changed: 130 additions & 8 deletions

File tree

docs/source/format/CDataInterface/PyCapsuleInterface.rst

Lines changed: 129 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -27,8 +27,8 @@ The Arrow PyCapsule Interface
2727
Rationale
2828
=========
2929

30-
The :ref:`C data interface <c-data-interface>` and
31-
:ref:`C stream interface <c-stream-interface>` allow moving Arrow data between
30+
The :ref:`C data interface <c-data-interface>`, :ref:`C stream interface <c-stream-interface>`
31+
and :ref:`C device interface <c-device-data-interface>` allow moving Arrow data between
3232
different implementations of Arrow. However, these interfaces don't specify how
3333
Python libraries should expose these structs to other libraries. Prior to this,
3434
many libraries simply provided export to PyArrow data structures, using the
@@ -43,7 +43,7 @@ Goals
4343
-----
4444

4545
* Standardize the `PyCapsule`_ objects that represent ``ArrowSchema``, ``ArrowArray``,
46-
and ``ArrowArrayStream``.
46+
``ArrowArrayStream``, ``ArrowDeviceArray`` and ``ArrowDeviceArrayStream``.
4747
* Define standard methods that export Arrow data into such capsule objects,
4848
so that any Python library wanting to accept Arrow data as input can call the
4949
corresponding method instead of hardcoding support for specific Arrow
@@ -80,7 +80,10 @@ Arrow structures are recognized, the following names must be used:
8080
- ``arrow_array``
8181
* - ArrowArrayStream
8282
- ``arrow_array_stream``
83-
83+
* - ArrowDeviceArray
84+
- ``arrow_device_array``
85+
* - ArrowDeviceArrayStream
86+
- ``arrow_device_array_stream``
8487

8588
Lifetime Semantics
8689
------------------
@@ -95,6 +98,10 @@ the data and marked the release callback as null, so there isn’t a risk of
9598
releasing data the consumer is using.
9699
:ref:`Read more in the C Data Interface specification <c-data-interface-released>`.
97100

101+
In case of a device struct, the above mentioned release callback is the
102+
``release`` member of the embedded ``ArrowArray`` structure.
103+
:ref:`Read more in the C Device Interface specification <c-device-data-interface-semantics>`.
104+
98105
Just like in the C Data Interface, the PyCapsule objects defined here can only
99106
be consumed once.
100107

@@ -110,12 +117,17 @@ The interface consists of three separate protocols:
110117
* ``ArrowArrayExportable``, which defines the ``__arrow_c_array__`` method.
111118
* ``ArrowStreamExportable``, which defines the ``__arrow_c_stream__`` method.
112119

120+
Two additional protocols are defined for the Device interface:
121+
122+
* ``ArrowDeviceArrayExportable``, which defines the ``__arrow_c_device_array__`` method.
123+
* ``ArrowDeviceStreamExportable``, which defines the ``__arrow_c_device_stream__`` method.
124+
113125
ArrowSchema Export
114126
------------------
115127

116128
Schemas, fields, and data types can implement the method ``__arrow_c_schema__``.
117129

118-
.. py:method:: __arrow_c_schema__(self) -> object
130+
.. py:method:: __arrow_c_schema__(self)
119131
120132
Export the object as an ArrowSchema.
121133

@@ -129,7 +141,7 @@ ArrowArray Export
129141
Arrays and record batches (contiguous tables) can implement the method
130142
``__arrow_c_array__``.
131143

132-
.. py:method:: __arrow_c_array__(self, requested_schema: object | None = None) -> Tuple[object, object]
144+
.. py:method:: __arrow_c_array__(self, requested_schema=None)
133145
134146
Export the object as a pair of ArrowSchema and ArrowArray structures.
135147

@@ -142,13 +154,32 @@ Arrays and record batches (contiguous tables) can implement the method
142154
respectively. The schema capsule should have the name ``"arrow_schema"``
143155
and the array capsule should have the name ``"arrow_array"``.
144156

157+
Libraries supporting the Device interface can implement a ``__arrow_c_device_array__``
158+
method on those objects, which works the same as ``__arrow_c_array__`` except
159+
for returning an ArrowDeviceArray structure instead of an ArrowArray structure:
160+
161+
.. py:method:: __arrow_c_device_array__(self, requested_schema=None, **kwargs)
162+
163+
Export the object as a pair of ArrowSchema and ArrowDeviceArray structures.
164+
165+
:param requested_schema: A PyCapsule containing a C ArrowSchema representation
166+
of a requested schema. Conversion to this schema is best-effort. See
167+
`Schema Requests`_.
168+
:type requested_schema: PyCapsule or None
169+
:param kwargs: Additional keyword arguments should only be accepted if they have
170+
a default value of ``None``, to allow for future addition of new keywords.
171+
See :ref:`arrow-pycapsule-interface-device-support` for more details.
172+
173+
:return: A pair of PyCapsules containing a C ArrowSchema and ArrowDeviceArray,
174+
respectively. The schema capsule should have the name ``"arrow_schema"``
175+
and the array capsule should have the name ``"arrow_device_array"``.
145176

146177
ArrowStream Export
147178
------------------
148179

149180
Tables / DataFrames and streams can implement the method ``__arrow_c_stream__``.
150181

151-
.. py:method:: __arrow_c_stream__(self, requested_schema: object | None = None) -> object
182+
.. py:method:: __arrow_c_stream__(self, requested_schema=None)
152183
153184
Export the object as an ArrowArrayStream.
154185

@@ -160,6 +191,26 @@ Tables / DataFrames and streams can implement the method ``__arrow_c_stream__``.
160191
:return: A PyCapsule containing a C ArrowArrayStream representation of the
161192
object. The capsule must have a name of ``"arrow_array_stream"``.
162193

194+
Libraries supporting the Device interface can implement a ``__arrow_c_device_stream__``
195+
method on those objects, which works the same as ``__arrow_c_stream__`` except
196+
for returning an ArrowDeviceArrayStream structure instead of an ArrowArrayStream
197+
structure:
198+
199+
.. py:method:: __arrow_c_device_stream__(self, requested_schema=None, **kwargs)
200+
201+
Export the object as an ArrowDeviceArrayStream.
202+
203+
:param requested_schema: A PyCapsule containing a C ArrowSchema representation
204+
of a requested schema. Conversion to this schema is best-effort. See
205+
`Schema Requests`_.
206+
:type requested_schema: PyCapsule or None
207+
:param kwargs: Additional keyword arguments should only be accepted if they have
208+
a default value of ``None``, to allow for future addition of new keywords.
209+
See :ref:`arrow-pycapsule-interface-device-support` for more details.
210+
211+
:return: A PyCapsule containing a C ArrowDeviceArrayStream representation of the
212+
object. The capsule must have a name of ``"arrow_device_array_stream"``.
213+
163214
Schema Requests
164215
---------------
165216

@@ -185,10 +236,64 @@ raise an exception. The requested schema mechanism is only meant to negotiate
185236
between different representations of the same data and not to allow arbitrary
186237
schema transformations.
187238

188-
189239
.. _PyCapsule: https://docs.python.org/3/c-api/capsule.html
190240

191241

242+
.. _arrow-pycapsule-interface-device-support:
243+
244+
Device Support
245+
--------------
246+
247+
The PyCapsule interface has cross hardware support through using the
248+
:ref:`C device interface <c-device-data-interface>`. This means it is possible
249+
to exchange data on non-CPU devices (e.g. CUDA GPUs) and to inspect on what
250+
device the exchanged data lives.
251+
252+
For exchanging the data structures, this interface has two sets of protocol
253+
methods: the standard CPU-only versions (:meth:`__arrow_c_array__` and
254+
:meth:`__arrow_c_stream__`) and the equivalent device-aware versions
255+
(:meth:`__arrow_c_device_array__`, and :meth:`__arrow_c_device_stream__`).
256+
257+
For CPU-only producers, it is allowed to either implement only the standard
258+
CPU-only protocol methods, or either implement both the CPU-only and device-aware
259+
methods. The absence of the device version methods implies CPU-only data. For
260+
CPU-only consumers, it is encouraged to be able to consume both versions of the
261+
protocol.
262+
263+
For a device-aware producer whose data structures can only reside in
264+
non-CPU memory, it is recommended to only implement the device version of the
265+
protocol (e.g. only add ``__arrow_c_device_array__``, and not add ``__arrow_c_array__``).
266+
Producers that have data structures that can live both on CPU or non-CPU devices
267+
can implement both versions of the protocol, but the CPU-only versions
268+
(:meth:`__arrow_c_array__` and :meth:`__arrow_c_stream__`) should be guaranteed
269+
to contain valid pointers for CPU memory (thus, when trying to export non-CPU data,
270+
either raise an error or make a copy to CPU memory).
271+
272+
Producing the ``ArrowDeviceArray`` and ``ArrowDeviceArrayStream`` structures
273+
is expected to not involve any cross-device copying of data.
274+
275+
The device-aware methods (:meth:`__arrow_c_device_array__`, and :meth:`__arrow_c_device_stream__`)
276+
should accept additional keyword arguments (``**kwargs``), if they have a
277+
default value of ``None``. This allows for future addition of new optional
278+
keywords, where the default value for such a new keyword will always be ``None``.
279+
The implementor is responsible for raising a ``NotImplementedError`` for any
280+
additional keyword being passed by the user which is not recognised. For
281+
example:
282+
283+
.. code-block:: python
284+
285+
def __arrow_c_device_array__(self, requested_schema=None, **kwargs):
286+
287+
non_default_kwargs = [
288+
name for name, value in kwargs.items() if value is not None
289+
]
290+
if non_default_kwargs:
291+
raise NotImplementedError(
292+
f"Received unsupported keyword argument(s): {non_default_kwargs}"
293+
)
294+
295+
...
296+
192297
Protocol Typehints
193298
------------------
194299

@@ -217,6 +322,22 @@ function accepts an object implementing one of these protocols.
217322
) -> object:
218323
...
219324
325+
class ArrowDeviceArrayExportable(Protocol):
326+
def __arrow_c_device_array__(
327+
self,
328+
requested_schema: object | None = None,
329+
**kwargs,
330+
) -> Tuple[object, object]:
331+
...
332+
333+
class ArrowDeviceStreamExportable(Protocol):
334+
def __arrow_c_device_stream__(
335+
self,
336+
requested_schema: object | None = None,
337+
**kwargs,
338+
) -> object:
339+
...
340+
220341
Examples
221342
========
222343

docs/source/format/CDeviceDataInterface.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -348,6 +348,7 @@ Notes:
348348
synchronization is needed for an extension device, the producer
349349
should document the type.
350350

351+
.. _c-device-data-interface-semantics:
351352

352353
Semantics
353354
=========

0 commit comments

Comments
 (0)