Skip to content

Conversation

@wesm
Copy link
Member

@wesm wesm commented Jun 1, 2020

A "metafunction" is one that dispatches to other functions based on the argument types. It does not contain any kernels.

Other stuff in this PR:

  • Set up to remove all but two versions of arrow::compute::Take. Other ones are still there but will be deprecated or removed after the GLib bindings are ported to use either CallFunction or the Take with Datum-Datum.
  • Make "take" and "filter" metafunctions that also deal with RecordBatch, Table arguments
  • Delete tons of now unnecessary binding code from Python and R. Hence the significant LOC reduction

There is one failing R test that I wasn't able to debug easily.

Connected JIRAs: ARROW-7009

@wesm
Copy link
Member Author

wesm commented Jun 1, 2020

@nealrichardson I will need your help on the one failing R test

@kou I removed 6 of the 8 Take APIs -- I think it would be better for GLib to use the Datum/Datum or CallFunction API if possible and perhaps net a code reduction like Python and R have achieved here. What do you think?

@github-actions
Copy link

github-actions bot commented Jun 1, 2020

@wesm
Copy link
Member Author

wesm commented Jun 1, 2020

I figured out the R issue, I'm trying to fix. It's this hack here

https://github.com/apache/arrow/blob/master/r/src/compute.cpp#L180

@wesm
Copy link
Member Author

wesm commented Jun 1, 2020

Fixed

@kou
Copy link
Member

kou commented Jun 1, 2020

I'll take a look this in a few days.

@wesm
Copy link
Member Author

wesm commented Jun 1, 2020

@kou thanks. I'll try to add deprecated APIs for the functions that were removed so that merging need not be blocked on the refactoring

@wesm wesm force-pushed the take-filter-metafunctions branch from 743fe38 to 78c3e3e Compare June 1, 2020 22:36
@wesm
Copy link
Member Author

wesm commented Jun 1, 2020

Done. So GLib can be refactored later.

@wesm wesm requested review from bkietz and jorisvandenbossche June 1, 2020 22:37
@kou
Copy link
Member

kou commented Jun 1, 2020

OK.
I'll work on the GLib part as a separated pull request and add ARROW_DEPRECATED in the pull request.

@wesm
Copy link
Member Author

wesm commented Jun 2, 2020

OK.
I'll work on the GLib part as a separated pull request and add ARROW_DEPRECATED in the pull request.

Perfect

@wesm
Copy link
Member Author

wesm commented Jun 2, 2020

The CI is passing except for this s390x issue

src/arrow/CMakeFiles/arrow_objlib.dir/vendored/uriparser/UriShorten.c.o  /usr/lib/s390x-linux-gnu/libcrypto.so  /usr/lib/s390x-linux-gnu/libssl.so  /usr/lib/s390x-linux-gnu/libbrotlienc.so  /usr/lib/s390x-linux-gnu/libbrotlidec.so  /usr/lib/s390x-linux-gnu/libbrotlicommon.so  orc_ep-install/lib/liborc.a  protobuf_ep-install/lib/libprotobuf.a  /usr/lib/s390x-linux-gnu/libglog.so  -ldl  jemalloc_ep-prefix/src/jemalloc_ep/dist//lib/libjemalloc_pic.a  -lrt  /usr/lib/s390x-linux-gnu/libcrypto.so  /usr/lib/s390x-linux-gnu/libssl.so  /usr/lib/s390x-linux-gnu/libbrotlienc.so  /usr/lib/s390x-linux-gnu/libbrotlidec.so  /usr/lib/s390x-linux-gnu/libbrotlicommon.so  /usr/lib/s390x-linux-gnu/libbz2.so  /usr/lib/s390x-linux-gnu/liblz4.so  /usr/lib/s390x-linux-gnu/libsnappy.so.1.1.8  /usr/lib/s390x-linux-gnu/libz.so  /usr/lib/s390x-linux-gnu/libzstd.so  orc_ep-install/lib/liborc.a  protobuf_ep-install/lib/libprotobuf.a  /usr/lib/s390x-linux-gnu/libglog.so  /usr/lib/s390x-linux-gnu/libcrypto.so  -pthread && :
/usr/bin/ld: final link failed: No space left on device
collect2: error: ld returned 1 exit status

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!
(some nice code removal in the python bindings indeed ;))

Added some small comments.

One more conceptual aspect I am wondering: compute functions on tables that basically just apply the function on each column (like take, but also if eg support for Table would be added to the sum registered function), wouldn't that start to be part of the (higher level) execution plan?

@wesm
Copy link
Member Author

wesm commented Jun 2, 2020

One more conceptual aspect I am wondering: compute functions on tables that basically just apply the function on each column (like take, but also if eg support for Table would be added to the sum registered function), wouldn't that start to be part of the (higher level) execution plan?

Perhaps, but it's not mutually exclusive. Some applications will just want to do CallFunction(name, args) while others might build an expression pipeline with multiple functions and then request that it be evaluated (once we have this capability, which will be relatively soon)

@wesm wesm force-pushed the take-filter-metafunctions branch from 2f518b4 to a7618ea Compare June 2, 2020 15:19
@wesm
Copy link
Member Author

wesm commented Jun 2, 2020

@kiszk the s390x machine on Travis appears to be in bad shape.

It would be good to get this merged sometime today, let me know if there are other comments on the C++ or R side

@kiszk
Copy link
Member

kiszk commented Jun 2, 2020

Oh, I saw https://travis-ci.org/github/apache/arrow/builds one hour ago. And, I got good feeling...

According to my experience, this type of disk full error may disappear when we will run the test after 30 min. or 1 hour.

Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a clean refactor, thanks for doing this. Would you comment here or on the JIRA about other metafunctions (aside from take and filter)?

@wesm
Copy link
Member Author

wesm commented Jun 2, 2020

Would you comment here or on the JIRA about other metafunctions (aside from take and filter)?

These are the only ones I can think of aside from cast at the moment

wesm and others added 3 commits June 2, 2020 17:52
…Port R,

Python bindings modulo one failing R test

Fix Table__to_dataframe for ChunkedArray with zero chunks

Restore deprecated APIs so that GLib will compile and pass tests again

CI fixes

Try to fix R 4.0 toolchain issue again
@wesm wesm force-pushed the take-filter-metafunctions branch from a7618ea to 37753d1 Compare June 2, 2020 22:52
@wesm
Copy link
Member Author

wesm commented Jun 2, 2020

Will merge this once the build passes

@wesm
Copy link
Member Author

wesm commented Jun 3, 2020

@kou this test failure https://github.com/apache/arrow/pull/7318/checks?check_run_id=732734225 seems unrelated to this patch, thoughts? It seems like the test case may be malformed

https://github.com/apache/arrow/blob/master/c_glib/test/test-struct-array.rb#L41

@wesm
Copy link
Member Author

wesm commented Jun 3, 2020

Merging as it doesn't seem like the test failure could be related to this patch (the error showed up after rebasing)

@wesm wesm closed this in 199d089 Jun 3, 2020
@wesm wesm deleted the take-filter-metafunctions branch June 3, 2020 00:28
@kou
Copy link
Member

kou commented Jun 3, 2020

We can ignore the failure.
It's a test code problem. It's a GC related problem. We need to keep reference to the raw buffer data.
I'll fix it later.

nealrichardson added a commit to nealrichardson/arrow that referenced this pull request Jun 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants