-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-8989: [C++][Doc] Document available compute functions #7695
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@nealrichardson You should like this. |
cb40960 to
b24c17d
Compare
nealrichardson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do like this! A lot! And I learned about some new functions I'm eager to try out.
A bunch of questions and suggestions throughout, things that came to me as I was reading. Would love to see them happen here, but feel free to add in some more TODO comments for anything you can't get to and we can revisit later.
docs/source/cpp/compute.rst
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes please, at least an example
docs/source/cpp/compute.rst
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this automatically link to some generated docs that say what the possible options are? If not, can you add them here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will. That's the whole point of this markup, and the API docs I added :-)
docs/source/cpp/compute.rst
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if you don't use a checked version and it does overflow? Do we recommend one or the other for general use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we made a recommendation for general use it'd probably be _checked so that users become aware of overflow sooner and can be intentional about handling it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also recommend the checked versions. But I'm not sure we need to recommend anything. AFAIU, the compute layer isn't supposed to be a user-facing API like Pandas.
docs/source/cpp/compute.rst
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the constraints on the pairs of inputs? I haven't tested all combinations, but it seems that generally you can do Array OPERATOR Scalar, Array OPERATOR Array iff the arrays are the same length, etc., and also for ChunkedArray. Some are also defined for RecordBatch and Table too right?
docs/source/cpp/compute.rst
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What determines whether it's int32 or int64?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whether the type is List or LargeList.
docs/source/cpp/compute.rst
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You get the answer by clicking on the CastOptions link not rendered :-) I'm not sure it's worth repeating here.
docs/source/cpp/compute.rst
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do StrptimeOptions look like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can guess the answer now :-)
docs/source/cpp/compute.rst
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OMG yes please, probably more than one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| .. TODO: add C++ cast example | |
| C++ cast example:: | |
| std::shared_ptr<arrow::StringArray> inputs = ...; | |
| ARROW_ASSIGN_OR_RAISE(arrow::Datum converted, arrow::compute::Cast(inputs, arrow::int32())); | |
| auto inputs32 = std::static_pointer_cast<arrow::Int32Array>(converted.make_array()); | |
| for (int64_t i = 0; i < inputs32.length(); ++i) { | |
| ReportInput(inputs32.Value(i)); | |
| } |
docs/source/cpp/compute.rst
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why isn't dictionary_encode just cast to Dictionary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps because it's special enough, though @wesm would know the answer better than me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dictionary_encode infers the dictionary's type from the argument and always uses int32 indices. Cast requires that you specify the destination type
docs/source/cpp/compute.rst
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| (null if input is null). | |
| (null if input is null). The output type is Int32 for List, Int64 for LargeList |
docs/source/cpp/compute.rst
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dictionary_encode infers the dictionary's type from the argument and always uses int32 indices. Cast requires that you specify the destination type
docs/source/cpp/compute.rst
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| .. TODO: add C++ cast example | |
| C++ cast example:: | |
| std::shared_ptr<arrow::StringArray> inputs = ...; | |
| ARROW_ASSIGN_OR_RAISE(arrow::Datum converted, arrow::compute::Cast(inputs, arrow::int32())); | |
| auto inputs32 = std::static_pointer_cast<arrow::Int32Array>(converted.make_array()); | |
| for (int64_t i = 0; i < inputs32.length(); ++i) { | |
| ReportInput(inputs32.Value(i)); | |
| } |
docs/source/cpp/compute.rst
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| inputs is null. | |
| inputs is null, similar to the way any operation involving ``NaN`` devolves to ``NaN``. |
docs/source/cpp/compute.rst
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ``_kleene``) where null is taken to mean "undefined". For those variants | |
| ``_kleene``) where null is taken to mean "undefined". (This is the interpretation of null | |
| used in ``R`` and ``SQL``, for example.) For those variants |
docs/source/cpp/compute.rst
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| operation to each pair of elements gathered from the inputs. Each function | |
| is also available in an overflow-checking variant, suffixed ``_checked``. | |
| operation to each pair of elements gathered from the inputs. Integer overflow is | |
| handled by wrapping; the sum of ``1`` and the maximum value will evaluate to the | |
| minimum value for any integer type. Each function is also available in an overflow- | |
| checking variant which raises an error if overflow would occur, suffixed ``_checked``. |
docs/source/cpp/compute.rst
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we made a recommendation for general use it'd probably be _checked so that users become aware of overflow sooner and can be intentional about handling it
docs/source/cpp/compute.rst
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| .. TODO: describe API and how to invoke compute functions | |
| The inputs and outputs of compute functions are of type :class:`arrow::Datum`, which are | |
| a discriminated union of several shapes of data, including :class:`arrow::Scalar`, | |
| :class:`arrow::Array`, and :class:`arrow::ChunkedArray`. | |
| Compute functions can be invoked by name using the :func:`arrow::compute::CallFunction` | |
| or via dedicated c++ convenience functions. | |
| .. code-block:: cpp | |
| std::shared_ptr<arrow::Int32Array> ints = ...; | |
| std::shared_ptr<arrow::Int32Scalar> increment = ...; | |
| ARROW_ASSIGN_OR_RAISE(arrow::Datum incremented, arrow::compute::CallFunction("add_checked", | |
| {arrow::Datum(ints), arrow::Datum(increment)})); | |
| arrow::ArithmeticOptions options; | |
| options.check_overflow = true; | |
| ARROW_ASSIGN_OR_RAISE(arrow::Datum equivalent, arrow::compute::Add( | |
| {arrow::Datum(ints), arrow::Datum(increment)}, arithmetic_options)); | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: it isn't necessary to use Datum(...) explicitly on most cases because of implicit conversions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is true, the explicit typing was for clarity of exposition
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great start, I mostly have nit picks about how the documentation for the functions is written (I personally loathe RST table format, and having duplicate entries for functions for different input types seems also quite tedious)
docs/source/cpp/compute.rst
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: it isn't necessary to use Datum(...) explicitly on most cases because of implicit conversions
docs/source/cpp/compute.rst
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've always found these reStructuredText tables to be immensely tedious. What would you say about putting the source of the documentation in e.g. a JSON file (which would be much easier to edit) and then generating the RST markup from the JSON? Then if we need to restructure the output in some way we won't have to tear our hair out manually editing these tables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't find JSON files easy to edit at all. I'd much rather keep the reST table format. It's not the most natural format to edit in a text editor, but it's still reasonable if your editor has a block selection mode.
docs/source/cpp/compute.rst
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the rationale for having multiple lines for each function?
docs/source/cpp/compute.rst
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here, having multiple lines per function seems really tedious
docs/source/cpp/compute.rst
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are metafunctions. Probably also want to list array_filter and array_take
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know. Does it help the user to know about the non-metafunctions? Personally, I wouldn't know what to do with them.
|
I think either CSV or List tables would be an improvement over the current https://docutils.sourceforge.io/docs/ref/rst/directives.html#tables |
|
I could give a try to list tables, but otherwise I think CSV or JSON would be a major PITA to edit later. |
|
Ok, list tables may be workable, but they don't make it easy to review docs simply by reading the source reST code. I'd rather keep the usual reST layout, unless you're using an editor that doesn't have a block selection mode at all. |
|
So there are three formats to choose from:
+--------------------------+------------+---------------------------------------------+---------------------+
| Function names | Arity | Input types | Output type |
+==========================+============+=============================================+=====================+
| equal, not_equal | Binary | Numeric, Temporal, Binary- and String-like | Boolean |
+--------------------------+------------+---------------------------------------------+---------------------+
| greater, greater_equal, | Binary | Numeric, Temporal, Binary- and String-like | Boolean |
| less, less_equal | | | |
+--------------------------+------------+---------------------------------------------+---------------------+
========================================= =========== ============================================== =================
Function names Arity Input types Output type
========================================= =========== ============================================== =================
equal, not_equal Binary Numeric, Temporal, Binary- and String-like Boolean
greater, greater_equal, less, less_equal Binary Numeric, Temporal, Binary- and String-like Boolean
========================================= =========== ============================================== =================
.. list-table::
:header-rows: 1
* - Function names
- Arity
- Input types
- Output type
* - equal, not_equal
- Binary
- Numeric, Temporal, Binary- and String-like
- Boolean
* - greater, greater_equal, less, less_equal
- Binary
- Numeric, Temporal, Binary- and String-like
- BooleanThe simplified table layout doesn't allow multi-line cells or cell fusion. It's also not much easier to edit than full table layout. The list table layout isn't easily reviewable in source format, you have to build the docs to see clearly what happens. And it will get quite unwieldy if you have 10 rows instead of 3 here. Personally, I would favour the full table layout. Mostly you need to get used to it. |
Also fix glaring bugs in arithmetic kernels (signed overflow detection was broken).
b24c17d to
f0962d0
Compare
Does that include tables which use the |
Do you mean you would like to edit a CSV table in a spreadsheet? |
Yeah I think that would be the idea |
|
I think that would be a rather terrible doc writing experience: each time you want to update those docs, you have to fire a different tool for some part of the page. Also, you would have a number of small CSV files to open, not a single one... |
|
I guess it's a matter of perspective. I don't feel at all comfortable editing the RST tables, whereas I would be fine editing a CSV file. Many text editors (e.g. emacs, vim) have a CSV editing mode so in many cases a separate tool is not needed |
|
The three options I would be comfortable with are those I proposed in my comment above. I think it would be clumsy to have to edit separate files using a spreadsheet editor to update a single page of the docs, though (especially as there are footnotes from the tables to the main text). I would also suggest we discuss this later, since this PR probably deserves to be in 1.0 :-) Someone may want to submit a later draft PR converting this doc to a different table format. |
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, yes let's deal with improving the doc-writing UX as a follow up. Will go ahead and merge this
Also fix glaring bugs in arithmetic kernels
(signed overflow detection was broken).