ARROW-8989: [C++][Doc] Document available compute functions #7695

pitrou · 2020-07-09T19:05:02Z

Also fix glaring bugs in arithmetic kernels
(signed overflow detection was broken).

pitrou · 2020-07-09T19:06:54Z

github-actions · 2020-07-09T19:16:51Z

https://issues.apache.org/jira/browse/ARROW-8989

nealrichardson

I do like this! A lot! And I learned about some new functions I'm eager to try out.

A bunch of questions and suggestions throughout, things that came to me as I was reading. Would love to see them happen here, but feel free to add in some more TODO comments for anything you can't get to and we can revisit later.

nealrichardson · 2020-07-10T15:20:37Z

docs/source/cpp/compute.rst

Yes please, at least an example

nealrichardson · 2020-07-10T15:21:40Z

docs/source/cpp/compute.rst

Will this automatically link to some generated docs that say what the possible options are? If not, can you add them here?

It will. That's the whole point of this markup, and the API docs I added :-)

nealrichardson · 2020-07-10T15:22:15Z

docs/source/cpp/compute.rst

What happens if you don't use a checked version and it does overflow? Do we recommend one or the other for general use?

If we made a recommendation for general use it'd probably be _checked so that users become aware of overflow sooner and can be intentional about handling it

I'd also recommend the checked versions. But I'm not sure we need to recommend anything. AFAIU, the compute layer isn't supposed to be a user-facing API like Pandas.

nealrichardson · 2020-07-10T15:25:02Z

docs/source/cpp/compute.rst

What are the constraints on the pairs of inputs? I haven't tested all combinations, but it seems that generally you can do Array OPERATOR Scalar, Array OPERATOR Array iff the arrays are the same length, etc., and also for ChunkedArray. Some are also defined for RecordBatch and Table too right?

docs/source/cpp/compute.rst

nealrichardson · 2020-07-10T15:36:03Z

docs/source/cpp/compute.rst

What determines whether it's int32 or int64?

Whether the type is List or LargeList.

nealrichardson · 2020-07-10T15:36:29Z

docs/source/cpp/compute.rst

You get the answer by clicking on the CastOptions link not rendered :-) I'm not sure it's worth repeating here.

nealrichardson · 2020-07-10T15:36:51Z

docs/source/cpp/compute.rst

What do StrptimeOptions look like?

I think you can guess the answer now :-)

nealrichardson · 2020-07-10T15:39:21Z

docs/source/cpp/compute.rst

OMG yes please, probably more than one

Suggested change

.. TODO: add C++ cast example

C++ cast example::

std::shared_ptr<arrow::StringArray> inputs = ...;

ARROW_ASSIGN_OR_RAISE(arrow::Datum converted, arrow::compute::Cast(inputs, arrow::int32()));

auto inputs32 = std::static_pointer_cast<arrow::Int32Array>(converted.make_array());

for (int64_t i = 0; i < inputs32.length(); ++i) {

ReportInput(inputs32.Value(i));

}

nealrichardson · 2020-07-10T15:40:23Z

docs/source/cpp/compute.rst

Why isn't dictionary_encode just cast to Dictionary?

Perhaps because it's special enough, though @wesm would know the answer better than me.

dictionary_encode infers the dictionary's type from the argument and always uses int32 indices. Cast requires that you specify the destination type

bkietz · 2020-07-10T19:31:56Z

docs/source/cpp/compute.rst

Suggested change

(null if input is null).

(null if input is null). The output type is Int32 for List, Int64 for LargeList

bkietz · 2020-07-10T19:38:06Z

docs/source/cpp/compute.rst

dictionary_encode infers the dictionary's type from the argument and always uses int32 indices. Cast requires that you specify the destination type

bkietz · 2020-07-10T19:46:17Z

docs/source/cpp/compute.rst

Suggested change

.. TODO: add C++ cast example

C++ cast example::

std::shared_ptr<arrow::StringArray> inputs = ...;

ARROW_ASSIGN_OR_RAISE(arrow::Datum converted, arrow::compute::Cast(inputs, arrow::int32()));

auto inputs32 = std::static_pointer_cast<arrow::Int32Array>(converted.make_array());

for (int64_t i = 0; i < inputs32.length(); ++i) {

ReportInput(inputs32.Value(i));

}

bkietz · 2020-07-10T19:47:42Z

docs/source/cpp/compute.rst

Suggested change

inputs is null.

inputs is null, similar to the way any operation involving ``NaN`` devolves to ``NaN``.

bkietz · 2020-07-10T19:49:16Z

docs/source/cpp/compute.rst

Suggested change

``_kleene``) where null is taken to mean "undefined". For those variants

``_kleene``) where null is taken to mean "undefined". (This is the interpretation of null

used in ``R`` and ``SQL``, for example.) For those variants

bkietz · 2020-07-10T19:55:14Z

docs/source/cpp/compute.rst

Suggested change

operation to each pair of elements gathered from the inputs. Each function

is also available in an overflow-checking variant, suffixed ``_checked``.

operation to each pair of elements gathered from the inputs. Integer overflow is

handled by wrapping; the sum of ``1`` and the maximum value will evaluate to the

minimum value for any integer type. Each function is also available in an overflow-

checking variant which raises an error if overflow would occur, suffixed ``_checked``.

bkietz · 2020-07-10T19:58:43Z

docs/source/cpp/compute.rst

If we made a recommendation for general use it'd probably be _checked so that users become aware of overflow sooner and can be intentional about handling it

bkietz · 2020-07-10T20:13:48Z

docs/source/cpp/compute.rst

Suggested change

.. TODO: describe API and how to invoke compute functions

The inputs and outputs of compute functions are of type :class:`arrow::Datum`, which are

a discriminated union of several shapes of data, including :class:`arrow::Scalar`,

:class:`arrow::Array`, and :class:`arrow::ChunkedArray`.

Compute functions can be invoked by name using the :func:`arrow::compute::CallFunction`

or via dedicated c++ convenience functions.

.. code-block:: cpp

std::shared_ptr<arrow::Int32Array> ints = ...;

std::shared_ptr<arrow::Int32Scalar> increment = ...;

ARROW_ASSIGN_OR_RAISE(arrow::Datum incremented, arrow::compute::CallFunction("add_checked",

{arrow::Datum(ints), arrow::Datum(increment)}));

arrow::ArithmeticOptions options;

options.check_overflow = true;

ARROW_ASSIGN_OR_RAISE(arrow::Datum equivalent, arrow::compute::Add(

{arrow::Datum(ints), arrow::Datum(increment)}, arithmetic_options));

Note: it isn't necessary to use Datum(...) explicitly on most cases because of implicit conversions

This is true, the explicit typing was for clarity of exposition

wesm

This is a great start, I mostly have nit picks about how the documentation for the functions is written (I personally loathe RST table format, and having duplicate entries for functions for different input types seems also quite tedious)

wesm · 2020-07-12T21:17:56Z

docs/source/cpp/compute.rst

Note: it isn't necessary to use Datum(...) explicitly on most cases because of implicit conversions

wesm · 2020-07-12T21:21:26Z

docs/source/cpp/compute.rst

I've always found these reStructuredText tables to be immensely tedious. What would you say about putting the source of the documentation in e.g. a JSON file (which would be much easier to edit) and then generating the RST markup from the JSON? Then if we need to restructure the output in some way we won't have to tear our hair out manually editing these tables

I don't find JSON files easy to edit at all. I'd much rather keep the reST table format. It's not the most natural format to edit in a text editor, but it's still reasonable if your editor has a block selection mode.

wesm · 2020-07-12T21:22:26Z

docs/source/cpp/compute.rst

What is the rationale for having multiple lines for each function?

wesm · 2020-07-12T21:23:19Z

docs/source/cpp/compute.rst

Same comment here, having multiple lines per function seems really tedious

wesm · 2020-07-12T21:24:22Z

docs/source/cpp/compute.rst

These are metafunctions. Probably also want to list array_filter and array_take

I don't know. Does it help the user to know about the non-metafunctions? Personally, I wouldn't know what to do with them.

wesm · 2020-07-12T21:27:23Z

I think either CSV or List tables would be an improvement over the current

https://docutils.sourceforge.io/docs/ref/rst/directives.html#tables

pitrou · 2020-07-13T09:42:52Z

I could give a try to list tables, but otherwise I think CSV or JSON would be a major PITA to edit later.

pitrou · 2020-07-13T09:50:28Z

Ok, list tables may be workable, but they don't make it easy to review docs simply by reading the source reST code. I'd rather keep the usual reST layout, unless you're using an editor that doesn't have a block selection mode at all.

pitrou · 2020-07-13T10:12:15Z

So there are three formats to choose from:

"full" reST table layout:

+--------------------------+------------+---------------------------------------------+---------------------+
| Function names           | Arity      | Input types                                 | Output type         |
+==========================+============+=============================================+=====================+
| equal, not_equal         | Binary     | Numeric, Temporal, Binary- and String-like  | Boolean             |
+--------------------------+------------+---------------------------------------------+---------------------+
| greater, greater_equal,  | Binary     | Numeric, Temporal, Binary- and String-like  | Boolean             |
| less, less_equal         |            |                                             |                     |
+--------------------------+------------+---------------------------------------------+---------------------+

"simplified" reST table layout:

========================================= =========== ============================================== =================
Function names                            Arity       Input types                                    Output type
========================================= =========== ============================================== =================
equal, not_equal                          Binary      Numeric, Temporal, Binary- and String-like     Boolean
greater, greater_equal, less, less_equal  Binary      Numeric, Temporal, Binary- and String-like     Boolean
========================================= =========== ============================================== =================

list table layout:

.. list-table::
   :header-rows: 1

   * - Function names
     - Arity
     - Input types
     - Output type
   * - equal, not_equal
     - Binary
     - Numeric, Temporal, Binary- and String-like
     - Boolean
   * - greater, greater_equal, less, less_equal
     - Binary
     - Numeric, Temporal, Binary- and String-like
     - Boolean

The simplified table layout doesn't allow multi-line cells or cell fusion. It's also not much easier to edit than full table layout.

The list table layout isn't easily reviewable in source format, you have to build the docs to see clearly what happens. And it will get quite unwieldy if you have 10 rows instead of 3 here.

Personally, I would favour the full table layout. Mostly you need to get used to it.

Also fix glaring bugs in arithmetic kernels (signed overflow detection was broken).

bkietz · 2020-07-13T13:20:41Z

I could give a try to list tables, but otherwise I think CSV or JSON would be a major PITA to edit later.

Does that include tables which use the file: directive to refer to an out-of-line table source?

pitrou · 2020-07-13T13:21:39Z

Does that include tables which use the file: directive to refer to an out-of-line table source?

Do you mean you would like to edit a CSV table in a spreadsheet?

wesm · 2020-07-13T17:32:47Z

Do you mean you would like to edit a CSV table in a spreadsheet?

Yeah I think that would be the idea

pitrou · 2020-07-13T17:40:48Z

I think that would be a rather terrible doc writing experience: each time you want to update those docs, you have to fire a different tool for some part of the page. Also, you would have a number of small CSV files to open, not a single one...

wesm · 2020-07-13T17:42:13Z

I guess it's a matter of perspective. I don't feel at all comfortable editing the RST tables, whereas I would be fine editing a CSV file. Many text editors (e.g. emacs, vim) have a CSV editing mode so in many cases a separate tool is not needed

pitrou · 2020-07-13T17:47:06Z

The three options I would be comfortable with are those I proposed in my comment above. I think it would be clumsy to have to edit separate files using a spreadsheet editor to update a single page of the docs, though (especially as there are footnotes from the tables to the main text).

I would also suggest we discuss this later, since this PR probably deserves to be in 1.0 :-) Someone may want to submit a later draft PR converting this doc to a different table format.

wesm

+1, yes let's deal with improving the doc-writing UX as a follow up. Will go ahead and merge this

pitrou force-pushed the ARROW-8989-doc-compute-functions branch from cb40960 to b24c17d Compare July 10, 2020 08:50

nealrichardson reviewed Jul 10, 2020

View reviewed changes

bkietz reviewed Jul 10, 2020

View reviewed changes

wesm reviewed Jul 12, 2020

View reviewed changes

ARROW-8989: [C++][Doc] Document available compute functions

f0962d0

Also fix glaring bugs in arithmetic kernels (signed overflow detection was broken).

pitrou force-pushed the ARROW-8989-doc-compute-functions branch from b24c17d to f0962d0 Compare July 13, 2020 10:16

Address review comments, add fill_null

721daa6

wesm approved these changes Jul 13, 2020

View reviewed changes

wesm closed this in 9d2079c Jul 13, 2020

pitrou deleted the ARROW-8989-doc-compute-functions branch July 13, 2020 17:50

This was referenced Jul 13, 2020

[C++] Document available functions in compute::FunctionRegistry #25110

Closed

[C++/Doc] The IsIn kernel ignores the skip_nulls option of SetLookupOptions #26618

Closed

-.. TODO: add C++ cast example
+C++ cast example::
+    std::shared_ptr<arrow::StringArray> inputs = ...;
+    ARROW_ASSIGN_OR_RAISE(arrow::Datum converted, arrow::compute::Cast(inputs, arrow::int32()));
+    auto inputs32 = std::static_pointer_cast<arrow::Int32Array>(converted.make_array());
+    for (int64_t i = 0; i < inputs32.length(); ++i) {
+      ReportInput(inputs32.Value(i));
+    }

	(null if input is null).
	(null if input is null). The output type is Int32 for List, Int64 for LargeList

	inputs is null.
	inputs is null, similar to the way any operation involving ``NaN`` devolves to ``NaN``.

	``_kleene``) where null is taken to mean "undefined". For those variants
	``_kleene``) where null is taken to mean "undefined". (This is the interpretation of null
	used in ``R`` and ``SQL``, for example.) For those variants

-operation to each pair of elements gathered from the inputs.  Each function
-is also available in an overflow-checking variant, suffixed ``_checked``.
+operation to each pair of elements gathered from the inputs.  Integer overflow is
+handled by wrapping; the sum of ``1`` and the maximum value will evaluate to the
+minimum value for any integer type. Each function is also available in an overflow-
+checking variant which raises an error if overflow would occur, suffixed ``_checked``.

-.. TODO: describe API and how to invoke compute functions
+The inputs and outputs of compute functions are of type :class:`arrow::Datum`, which are
+a discriminated union of several shapes of data, including :class:`arrow::Scalar`,
+:class:`arrow::Array`, and :class:`arrow::ChunkedArray`.
+Compute functions can be invoked by name using the :func:`arrow::compute::CallFunction`
+or via dedicated c++ convenience functions.
+.. code-block:: cpp
+   std::shared_ptr<arrow::Int32Array> ints = ...;
+   std::shared_ptr<arrow::Int32Scalar> increment = ...;
+   ARROW_ASSIGN_OR_RAISE(arrow::Datum incremented, arrow::compute::CallFunction("add_checked",
+       {arrow::Datum(ints), arrow::Datum(increment)}));
+   arrow::ArithmeticOptions options;
+   options.check_overflow = true;
+   ARROW_ASSIGN_OR_RAISE(arrow::Datum equivalent, arrow::compute::Add(
+       {arrow::Datum(ints), arrow::Datum(increment)}, arithmetic_options));

ARROW-8989: [C++][Doc] Document available compute functions #7695

ARROW-8989: [C++][Doc] Document available compute functions #7695

Uh oh!

Conversation

pitrou commented Jul 9, 2020

Uh oh!

pitrou commented Jul 9, 2020

Uh oh!

github-actions bot commented Jul 9, 2020

Uh oh!

nealrichardson left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wesm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

pitrou commented Jul 13, 2020 •

edited

Loading

wesm commented Jul 13, 2020 •

edited

Loading