Skip to content

Conversation

@pitrou
Copy link
Member

@pitrou pitrou commented Jul 9, 2020

Also fix glaring bugs in arithmetic kernels
(signed overflow detection was broken).

@pitrou
Copy link
Member Author

pitrou commented Jul 9, 2020

@nealrichardson You should like this.

@github-actions
Copy link

github-actions bot commented Jul 9, 2020

@pitrou pitrou force-pushed the ARROW-8989-doc-compute-functions branch from cb40960 to b24c17d Compare July 10, 2020 08:50
Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do like this! A lot! And I learned about some new functions I'm eager to try out.

A bunch of questions and suggestions throughout, things that came to me as I was reading. Would love to see them happen here, but feel free to add in some more TODO comments for anything you can't get to and we can revisit later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes please, at least an example

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this automatically link to some generated docs that say what the possible options are? If not, can you add them here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will. That's the whole point of this markup, and the API docs I added :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if you don't use a checked version and it does overflow? Do we recommend one or the other for general use?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we made a recommendation for general use it'd probably be _checked so that users become aware of overflow sooner and can be intentional about handling it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also recommend the checked versions. But I'm not sure we need to recommend anything. AFAIU, the compute layer isn't supposed to be a user-facing API like Pandas.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the constraints on the pairs of inputs? I haven't tested all combinations, but it seems that generally you can do Array OPERATOR Scalar, Array OPERATOR Array iff the arrays are the same length, etc., and also for ChunkedArray. Some are also defined for RecordBatch and Table too right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What determines whether it's int32 or int64?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whether the type is List or LargeList.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You get the answer by clicking on the CastOptions link not rendered :-) I'm not sure it's worth repeating here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do StrptimeOptions look like?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can guess the answer now :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OMG yes please, probably more than one

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.. TODO: add C++ cast example
C++ cast example::
std::shared_ptr<arrow::StringArray> inputs = ...;
ARROW_ASSIGN_OR_RAISE(arrow::Datum converted, arrow::compute::Cast(inputs, arrow::int32()));
auto inputs32 = std::static_pointer_cast<arrow::Int32Array>(converted.make_array());
for (int64_t i = 0; i < inputs32.length(); ++i) {
ReportInput(inputs32.Value(i));
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't dictionary_encode just cast to Dictionary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps because it's special enough, though @wesm would know the answer better than me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dictionary_encode infers the dictionary's type from the argument and always uses int32 indices. Cast requires that you specify the destination type

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(null if input is null).
(null if input is null). The output type is Int32 for List, Int64 for LargeList

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dictionary_encode infers the dictionary's type from the argument and always uses int32 indices. Cast requires that you specify the destination type

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.. TODO: add C++ cast example
C++ cast example::
std::shared_ptr<arrow::StringArray> inputs = ...;
ARROW_ASSIGN_OR_RAISE(arrow::Datum converted, arrow::compute::Cast(inputs, arrow::int32()));
auto inputs32 = std::static_pointer_cast<arrow::Int32Array>(converted.make_array());
for (int64_t i = 0; i < inputs32.length(); ++i) {
ReportInput(inputs32.Value(i));
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
inputs is null.
inputs is null, similar to the way any operation involving ``NaN`` devolves to ``NaN``.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
``_kleene``) where null is taken to mean "undefined". For those variants
``_kleene``) where null is taken to mean "undefined". (This is the interpretation of null
used in ``R`` and ``SQL``, for example.) For those variants

Comment on lines 60 to 61
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
operation to each pair of elements gathered from the inputs. Each function
is also available in an overflow-checking variant, suffixed ``_checked``.
operation to each pair of elements gathered from the inputs. Integer overflow is
handled by wrapping; the sum of ``1`` and the maximum value will evaluate to the
minimum value for any integer type. Each function is also available in an overflow-
checking variant which raises an error if overflow would occur, suffixed ``_checked``.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we made a recommendation for general use it'd probably be _checked so that users become aware of overflow sooner and can be intentional about handling it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.. TODO: describe API and how to invoke compute functions
The inputs and outputs of compute functions are of type :class:`arrow::Datum`, which are
a discriminated union of several shapes of data, including :class:`arrow::Scalar`,
:class:`arrow::Array`, and :class:`arrow::ChunkedArray`.
Compute functions can be invoked by name using the :func:`arrow::compute::CallFunction`
or via dedicated c++ convenience functions.
.. code-block:: cpp
std::shared_ptr<arrow::Int32Array> ints = ...;
std::shared_ptr<arrow::Int32Scalar> increment = ...;
ARROW_ASSIGN_OR_RAISE(arrow::Datum incremented, arrow::compute::CallFunction("add_checked",
{arrow::Datum(ints), arrow::Datum(increment)}));
arrow::ArithmeticOptions options;
options.check_overflow = true;
ARROW_ASSIGN_OR_RAISE(arrow::Datum equivalent, arrow::compute::Add(
{arrow::Datum(ints), arrow::Datum(increment)}, arithmetic_options));

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: it isn't necessary to use Datum(...) explicitly on most cases because of implicit conversions

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is true, the explicit typing was for clarity of exposition

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great start, I mostly have nit picks about how the documentation for the functions is written (I personally loathe RST table format, and having duplicate entries for functions for different input types seems also quite tedious)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: it isn't necessary to use Datum(...) explicitly on most cases because of implicit conversions

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've always found these reStructuredText tables to be immensely tedious. What would you say about putting the source of the documentation in e.g. a JSON file (which would be much easier to edit) and then generating the RST markup from the JSON? Then if we need to restructure the output in some way we won't have to tear our hair out manually editing these tables

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't find JSON files easy to edit at all. I'd much rather keep the reST table format. It's not the most natural format to edit in a text editor, but it's still reasonable if your editor has a block selection mode.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the rationale for having multiple lines for each function?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here, having multiple lines per function seems really tedious

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are metafunctions. Probably also want to list array_filter and array_take

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. Does it help the user to know about the non-metafunctions? Personally, I wouldn't know what to do with them.

@wesm
Copy link
Member

wesm commented Jul 12, 2020

I think either CSV or List tables would be an improvement over the current

https://docutils.sourceforge.io/docs/ref/rst/directives.html#tables

@pitrou
Copy link
Member Author

pitrou commented Jul 13, 2020

I could give a try to list tables, but otherwise I think CSV or JSON would be a major PITA to edit later.

@pitrou
Copy link
Member Author

pitrou commented Jul 13, 2020

Ok, list tables may be workable, but they don't make it easy to review docs simply by reading the source reST code. I'd rather keep the usual reST layout, unless you're using an editor that doesn't have a block selection mode at all.

@pitrou
Copy link
Member Author

pitrou commented Jul 13, 2020

So there are three formats to choose from:

  • "full" reST table layout:
+--------------------------+------------+---------------------------------------------+---------------------+
| Function names           | Arity      | Input types                                 | Output type         |
+==========================+============+=============================================+=====================+
| equal, not_equal         | Binary     | Numeric, Temporal, Binary- and String-like  | Boolean             |
+--------------------------+------------+---------------------------------------------+---------------------+
| greater, greater_equal,  | Binary     | Numeric, Temporal, Binary- and String-like  | Boolean             |
| less, less_equal         |            |                                             |                     |
+--------------------------+------------+---------------------------------------------+---------------------+
  • "simplified" reST table layout:
========================================= =========== ============================================== =================
Function names                            Arity       Input types                                    Output type
========================================= =========== ============================================== =================
equal, not_equal                          Binary      Numeric, Temporal, Binary- and String-like     Boolean
greater, greater_equal, less, less_equal  Binary      Numeric, Temporal, Binary- and String-like     Boolean
========================================= =========== ============================================== =================
  • list table layout:
.. list-table::
   :header-rows: 1

   * - Function names
     - Arity
     - Input types
     - Output type
   * - equal, not_equal
     - Binary
     - Numeric, Temporal, Binary- and String-like
     - Boolean
   * - greater, greater_equal, less, less_equal
     - Binary
     - Numeric, Temporal, Binary- and String-like
     - Boolean

The simplified table layout doesn't allow multi-line cells or cell fusion. It's also not much easier to edit than full table layout.

The list table layout isn't easily reviewable in source format, you have to build the docs to see clearly what happens. And it will get quite unwieldy if you have 10 rows instead of 3 here.

Personally, I would favour the full table layout. Mostly you need to get used to it.

Also fix glaring bugs in arithmetic kernels
(signed overflow detection was broken).
@pitrou pitrou force-pushed the ARROW-8989-doc-compute-functions branch from b24c17d to f0962d0 Compare July 13, 2020 10:16
@bkietz
Copy link
Member

bkietz commented Jul 13, 2020

I could give a try to list tables, but otherwise I think CSV or JSON would be a major PITA to edit later.

Does that include tables which use the file: directive to refer to an out-of-line table source?

@pitrou
Copy link
Member Author

pitrou commented Jul 13, 2020

Does that include tables which use the file: directive to refer to an out-of-line table source?

Do you mean you would like to edit a CSV table in a spreadsheet?

@wesm
Copy link
Member

wesm commented Jul 13, 2020

Do you mean you would like to edit a CSV table in a spreadsheet?

Yeah I think that would be the idea

@pitrou
Copy link
Member Author

pitrou commented Jul 13, 2020

I think that would be a rather terrible doc writing experience: each time you want to update those docs, you have to fire a different tool for some part of the page. Also, you would have a number of small CSV files to open, not a single one...

@wesm
Copy link
Member

wesm commented Jul 13, 2020

I guess it's a matter of perspective. I don't feel at all comfortable editing the RST tables, whereas I would be fine editing a CSV file. Many text editors (e.g. emacs, vim) have a CSV editing mode so in many cases a separate tool is not needed

@pitrou
Copy link
Member Author

pitrou commented Jul 13, 2020

The three options I would be comfortable with are those I proposed in my comment above. I think it would be clumsy to have to edit separate files using a spreadsheet editor to update a single page of the docs, though (especially as there are footnotes from the tables to the main text).

I would also suggest we discuss this later, since this PR probably deserves to be in 1.0 :-) Someone may want to submit a later draft PR converting this doc to a different table format.

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, yes let's deal with improving the doc-writing UX as a follow up. Will go ahead and merge this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants