-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-5893: [C++][Python][GLib][Ruby][MATLAB][R] Remove arrow::Column class #4841
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
jorisvandenbossche
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a nice simplification.
In practice, it doesn't seem that there are many places where you need to keep track of array / field combinations (what a Column class provided).
cpp/src/arrow/ipc/feather.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's maybe not relevant here (I don't know if feather supports non-nullable fields), but wondering in general: wherever you replace a field with name/type combo, should you also check nullability?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I see that the feather.fbs has no notion of nullabilty, so that answers my comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche @wesm definitely out of scope for this PR, but: should user_metadata get into the read table's fields?
| fields.push_back(::arrow::field(GetColumnName(i), column->type())); | |
| fields.push_back(::arrow::field(GetColumnName(i), column->type(), true, {{"user_metadata"}, {user_metadata}})); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe so. Can you open a JIRA about this? The Feather format is effectively deprecated so not sure how much more energy we want to invest in it (versus having Feather be a wrapper for the primary IPC protocol)
python/pyarrow/table.pxi
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to keep this on the Array/ChunkedArray as well? (similarly to how you moved Column.cast to ChunkedArray.cast)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, will do
|
@wesm - This looks like a nice API change. Since the MATLAB Either way, @rdmello and I are happy to help with making those changes. |
|
@kevingurney feel free to post a link to a branch with any changes and I can cherry-pick them into this PR |
|
Sounds good - thanks! |
|
@kou @shiro615 I have just performed the changes to the GLib C++ code, but it would be much quicker for one of you to fix the unit tests. I wasn't sure how to get the The Ruby library will have to have some minor modifications but it should not be too complicated I apologies for the disruption -- if you could say whether you support this change either here or in the mailing list thread that would be helpful. |
bkietz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a nice simplification
cpp/src/arrow/table.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems unnecessary to make this virtual; won't it always be schema_->field(i)? Maybe this method can be elided altogether
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed
cpp/src/arrow/table.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this method take an index? assert(i == schema_->num_fields())?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can add a column at any position in the table
cpp/src/parquet/arrow/reader.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| fields[idx] = ::arrow::field(old_schema.field(idx)->name(), columns[idx]->type()); | |
| fields[idx] = old_schema.field(idx)->WithType(columns[idx]->type()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
python/pyarrow/feather.pxi
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everywhere else you've replaced Column by ChunkedArray, should that apply here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
python/pyarrow/table.pxi
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Must pass this or names | |
| Schema for the created table. If not passed, names must be passed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
@wesm - Changes for removing the use of I also updated @rdmello - Can you take a quick look through these changes? |
|
@wesm No problem. I take over GLib related part. |
|
@kevingurney @wesm thanks for doing this. I have looked through the MATLAB changes and didn't see any issues. |
|
OK, I'll cherry-pick the changes here then |
|
I finished GLib part. I'll comment whether this change is reasonable in GLib and Ruby parts when I finish Ruby part work. |
|
Thanks @kou! And sorry for the disruption. I'll try to fix up the R package and get a passing build early in the week |
|
I've finished Ruby part. And I've also fixed remained problems in C++ part. I realized that column object is useful to build user friendly API in Ruby. So I implement column class in Ruby. Column object that has metadata (field) and data (chunked array) is useful to represent column information. We can represent column as an array table object and column index but creating a column object is natural in Ruby. Because Ruby is pure object oriented language and uses dynamic typing. But column object is needless in C++ and GLib. So I vote +1 to this change. @shiro615 What do you think about this? Here are changes what I did: GLib:
Ruby:
|
|
Thanks @kou! I agree that from binding to binding the Column concept may or may not make sense. Once we build an initial version of the Data Frame API I think that we will want users to use that instead. That will take some months still |
|
@kou This changes looks good to me. Thanks. |
haven't fixed unit tests yet [skip ci]
…recated StatusCode enum class values in handle_status.cc [skip ci]
They are confused.
c15fa9b to
020f52a
Compare
|
OK. I fixed up the R build -- my first time ever building the R package! To future-proof the Feather API, I changed |
|
+1. I'll wait for CI to run. Thanks everyone for your help on this, good team work =) I will use this PR as a change to test ARROW-5716 (attributing Co-authors in the GitHub UI when merging PRs) |
nealrichardson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One note but in general the R changes look sensible to me. @romainfrancois should take a look, if only to be aware of the change.
| num_columns = function() ipc___feather___TableReader__num_columns(self), | ||
| GetColumnName = function(i) ipc___feather___TableReader__GetColumnName(self, i), | ||
| GetColumn = function(i) shared_ptr(`arrow::Column`, ipc___feather___TableReader__GetColumn(self, i)), | ||
| GetColumn = function(i) shared_ptr(`arrow::Array`, ipc___feather___TableReader__GetColumn(self, i)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just an observation that every other change mapped Column to ChunkedArray but this one is just Array. Should this one also be ChunkedArray?
|
@wesm - I'll take a look at the |
|
Hmm. Python 2.7 is failing. I'll see if I can reproduce locally in a little while |
|
@wesm - I just took a closer look at your changes and they look good to me. I ran all the MATLAB |
…st for pyarrow.table. Fix usages of ChunkedArray.data and add unit test
|
Here's builds in progress on my fork Travis CI: https://travis-ci.org/wesm/arrow/builds/559726306 |
|
Forgot about the docs |
|
Hm a build ran on my Travis CI but it only compares one commit with the last. I just pushed a squashed build here to make sure all the builds run: Travis CI: https://travis-ci.org/wesm/arrow/builds/559761819 @nealrichardson it looks like the R Appveyor build is not taking into account changes in the C++ library, see that https://ci.appveyor.com/project/wesm/arrow/builds/26030853/job/7vn8q3l8e24t83jh?fullLog=true |
|
OK, the builds look good. I'm going to try to merge this with ARROW-5716 |
… class This completely removes `arrow::Column` from the C++ library and adapts the Python bindings to follow, to help assist with mailing list discussion. There are other places that this would touch: - [x] GLib - [x] Ruby - [x] MATLAB - [x] R You can see the evident reduction of boilerplate and simplification in many places -- it is particularly pronounced in parquet/arrow/arrow-reader-writer-test.cc. This refactor also exposed a bug where a Column's data type did not match its Field type. Closes #4841 from wesm/remove-column-class and squashes the following commits: b389d42 <Wes McKinney> Fix up Python/C++ docs to remove references to arrow::Column cc34548 <Wes McKinney> Fix non-deterministic Table.from_pydict behavior. Add unicode/utf8 test for pyarrow.table. Fix usages of ChunkedArray.data and add unit test 4be8d5a <Wes McKinney> UTF-8 py2/3 compatibility issues 9b5a61e <Wes McKinney> Try to fix unicode issue a5ee7b6 <Wes McKinney> Fix MATLAB code (I hope) 2c7c51e <Wes McKinney> code review comments 020f52a <Wes McKinney> Re-render R README fd4473e <Wes McKinney> Fix up R C++ code and unit tests 244d2a7 <Wes McKinney> Begin refactoring R library. Change Feather to return ChunkedArray from GetColumn d30a365 <Sutou Kouhei> Remove unused variable d89366a <Sutou Kouhei> Follow API change 983d378 <Sutou Kouhei> Follow API change 3099328 <Sutou Kouhei> Follow API change f092f1b <Sutou Kouhei> Add missing available annotations 41dd1fe <Sutou Kouhei> Revert needless style change f120a7d <Sutou Kouhei> Suppress a warning with MSVC 62f3649 <Sutou Kouhei> Remove unused lambda captures 0090051 <Sutou Kouhei> Add API index for 1.0.0 bd6002a <Sutou Kouhei> Remove entries 335d7b6 <Sutou Kouhei> Implement Arrow::Column in Ruby fb5ad1c <Sutou Kouhei> Add garrow_chunked_array_get_n_rows() for consistency 60dfb0c <Sutou Kouhei> Add garrow_schema_get_field_index() c4ff972 <Sutou Kouhei> Remove backward compatible column API from GArrowRecordBatch 179aa67 <Sutou Kouhei> Use "column_data" instead of "column" 795b6c1 <Sutou Kouhei> Follow arrow::Column remove ac6f372 <Kevin Gurney> Remove use of arrow::Column from feather_reader.cc. Remove use of deprecated StatusCode enum class values in handle_status.cc fcaaf8c <Wes McKinney> Add pyarrow.ChunkedArray.flatten method. Remove Column from glib, but haven't fixed unit tests yet f0d48cc <Wes McKinney> Adapt Python bindings c161a9a <Wes McKinney> Fix up Parquet, too 93e3cad <Wes McKinney> arrow-tests all passing again a7344a8 <Wes McKinney> Stage 1 of cutting away Column Lead-authored-by: Wes McKinney <wesm+git@apache.org> Co-authored-by: Sutou Kouhei <kou@clear-code.com> Co-authored-by: Kevin Gurney <kevin.p.gurney@gmail.com> Signed-off-by: Wes McKinney <wesm+git@apache.org>
… class This completely removes `arrow::Column` from the C++ library and adapts the Python bindings to follow, to help assist with mailing list discussion. There are other places that this would touch: - [x] GLib - [x] Ruby - [x] MATLAB - [x] R You can see the evident reduction of boilerplate and simplification in many places -- it is particularly pronounced in parquet/arrow/arrow-reader-writer-test.cc. This refactor also exposed a bug where a Column's data type did not match its Field type. Closes #4841 from wesm/remove-column-class and squashes the following commits: b389d42 <Wes McKinney> Fix up Python/C++ docs to remove references to arrow::Column cc34548 <Wes McKinney> Fix non-deterministic Table.from_pydict behavior. Add unicode/utf8 test for pyarrow.table. Fix usages of ChunkedArray.data and add unit test 4be8d5a <Wes McKinney> UTF-8 py2/3 compatibility issues 9b5a61e <Wes McKinney> Try to fix unicode issue a5ee7b6 <Wes McKinney> Fix MATLAB code (I hope) 2c7c51e <Wes McKinney> code review comments 020f52a <Wes McKinney> Re-render R README fd4473e <Wes McKinney> Fix up R C++ code and unit tests 244d2a7 <Wes McKinney> Begin refactoring R library. Change Feather to return ChunkedArray from GetColumn d30a365 <Sutou Kouhei> Remove unused variable d89366a <Sutou Kouhei> Follow API change 983d378 <Sutou Kouhei> Follow API change 3099328 <Sutou Kouhei> Follow API change f092f1b <Sutou Kouhei> Add missing available annotations 41dd1fe <Sutou Kouhei> Revert needless style change f120a7d <Sutou Kouhei> Suppress a warning with MSVC 62f3649 <Sutou Kouhei> Remove unused lambda captures 0090051 <Sutou Kouhei> Add API index for 1.0.0 bd6002a <Sutou Kouhei> Remove entries 335d7b6 <Sutou Kouhei> Implement Arrow::Column in Ruby fb5ad1c <Sutou Kouhei> Add garrow_chunked_array_get_n_rows() for consistency 60dfb0c <Sutou Kouhei> Add garrow_schema_get_field_index() c4ff972 <Sutou Kouhei> Remove backward compatible column API from GArrowRecordBatch 179aa67 <Sutou Kouhei> Use "column_data" instead of "column" 795b6c1 <Sutou Kouhei> Follow arrow::Column remove ac6f372 <Kevin Gurney> Remove use of arrow::Column from feather_reader.cc. Remove use of deprecated StatusCode enum class values in handle_status.cc fcaaf8c <Wes McKinney> Add pyarrow.ChunkedArray.flatten method. Remove Column from glib, but haven't fixed unit tests yet f0d48cc <Wes McKinney> Adapt Python bindings c161a9a <Wes McKinney> Fix up Parquet, too 93e3cad <Wes McKinney> arrow-tests all passing again a7344a8 <Wes McKinney> Stage 1 of cutting away Column Lead-authored-by: Wes McKinney <wesm+git@apache.org> Co-authored-by: Sutou Kouhei <kou@clear-code.com> Co-authored-by: Kevin Gurney <kevin.p.gurney@gmail.com> Signed-off-by: Wes McKinney <wesm+git@apache.org>
This completely removes
arrow::Columnfrom the C++ library and adapts the Python bindings to follow, to help assist with mailing list discussion.There are other places that this would touch:
You can see the evident reduction of boilerplate and simplification in many places -- it is particularly pronounced in parquet/arrow/arrow-reader-writer-test.cc. This refactor also exposed a bug where a Column's data type did not match its Field type.