Skip to content

Conversation

@wesm
Copy link
Member

@wesm wesm commented Jul 10, 2019

This completely removes arrow::Column from the C++ library and adapts the Python bindings to follow, to help assist with mailing list discussion.

There are other places that this would touch:

  • GLib
  • Ruby
  • MATLAB
  • R

You can see the evident reduction of boilerplate and simplification in many places -- it is particularly pronounced in parquet/arrow/arrow-reader-writer-test.cc. This refactor also exposed a bug where a Column's data type did not match its Field type.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a nice simplification.
In practice, it doesn't seem that there are many places where you need to keep track of array / field combinations (what a Column class provided).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's maybe not relevant here (I don't know if feather supports non-nullable fields), but wondering in general: wherever you replace a field with name/type combo, should you also check nullability?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I see that the feather.fbs has no notion of nullabilty, so that answers my comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche @wesm definitely out of scope for this PR, but: should user_metadata get into the read table's fields?

Suggested change
fields.push_back(::arrow::field(GetColumnName(i), column->type()));
fields.push_back(::arrow::field(GetColumnName(i), column->type(), true, {{"user_metadata"}, {user_metadata}}));

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe so. Can you open a JIRA about this? The Feather format is effectively deprecated so not sure how much more energy we want to invest in it (versus having Feather be a wrapper for the primary IPC protocol)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to keep this on the Array/ChunkedArray as well? (similarly to how you moved Column.cast to ChunkedArray.cast)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will do

@kevingurney
Copy link
Member

@wesm - This looks like a nice API change.

Since the MATLAB featherread/featherwrite interfaces currently use arrow::Column under the hood (as you pointed out), should we try to include the MATLAB code changes as part of this pull request, or as a separate pull request?

Either way, @rdmello and I are happy to help with making those changes.

@wesm
Copy link
Member Author

wesm commented Jul 10, 2019

@kevingurney feel free to post a link to a branch with any changes and I can cherry-pick them into this PR

@kevingurney
Copy link
Member

Sounds good - thanks!

@wesm
Copy link
Member Author

wesm commented Jul 10, 2019

@kou @shiro615 I have just performed the changes to the GLib C++ code, but it would be much quicker for one of you to fix the unit tests. I wasn't sure how to get the GARROW_AVAILABLE_IN_... macros to work

The Ruby library will have to have some minor modifications but it should not be too complicated

I apologies for the disruption -- if you could say whether you support this change either here or in the mailing list thread that would be helpful.

Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a nice simplification

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems unnecessary to make this virtual; won't it always be schema_->field(i)? Maybe this method can be elided altogether

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this method take an index? assert(i == schema_->num_fields())?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add a column at any position in the table

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fields[idx] = ::arrow::field(old_schema.field(idx)->name(), columns[idx]->type());
fields[idx] = old_schema.field(idx)->WithType(columns[idx]->type());

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everywhere else you've replaced Column by ChunkedArray, should that apply here as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Must pass this or names
Schema for the created table. If not passed, names must be passed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@kevingurney
Copy link
Member

@wesm - Changes for removing the use of arrow::Column under the hood in the MATLAB featherread/featherwrite interfaces can be found in the branch linked below:

I also updated handle_status.cc because of a build failure I encountered caused by recent changes to arrow::StatusCode.

@rdmello - Can you take a quick look through these changes?

@kou
Copy link
Member

kou commented Jul 11, 2019

@wesm No problem. I take over GLib related part.

@rdmello
Copy link
Contributor

rdmello commented Jul 11, 2019

@kevingurney @wesm thanks for doing this. I have looked through the MATLAB changes and didn't see any issues.

@wesm
Copy link
Member Author

wesm commented Jul 11, 2019

OK, I'll cherry-pick the changes here then

@kou
Copy link
Member

kou commented Jul 11, 2019

I finished GLib part.
I'll work on Ruby part later.

I'll comment whether this change is reasonable in GLib and Ruby parts when I finish Ruby part work.

@wesm
Copy link
Member Author

wesm commented Jul 14, 2019

Thanks @kou! And sorry for the disruption. I'll try to fix up the R package and get a passing build early in the week

@kou
Copy link
Member

kou commented Jul 15, 2019

I've finished Ruby part. And I've also fixed remained problems in C++ part.

I realized that column object is useful to build user friendly API in Ruby. So I implement column class in Ruby. Column object that has metadata (field) and data (chunked array) is useful to represent column information. We can represent column as an array table object and column index but creating a column object is natural in Ruby. Because Ruby is pure object oriented language and uses dynamic typing.

But column object is needless in C++ and GLib. So I vote +1 to this change.

@shiro615 What do you think about this? Here are changes what I did:

GLib:

  • Changed *_get_column() to *_get_column_data(). Because "data" isn't enough information for column. Column needs metadata such as column name.
    • Removing *_get_column() is backward incompatible change. But we can do this for 1.0.0.
  • Removed *_get_columns(). Because users will be confused when *_get_columns() return column data instead of column.
  • Added some missing bindings such as garrow_schema_get_field_index() and garrow_chunked_array_get_n_rows() to implement column object in Ruby.

Ruby:

  • Implemented Arrow::Column in Ruby. It has similar API before but some APIs such as #initialize are backward incompatible.
  • Refactored some features to adjust Arrow::Column related changes.

@wesm wesm changed the title WIP ARROW-5893: [C++][Python] RFC: Remove arrow::Column class ARROW-5893: [C++][Python] RFC: Remove arrow::Column class Jul 15, 2019
@wesm wesm changed the title ARROW-5893: [C++][Python] RFC: Remove arrow::Column class ARROW-5893: [C++][Python][GLib][Ruby][MATLAB][R] Remove arrow::Column class Jul 15, 2019
@wesm
Copy link
Member Author

wesm commented Jul 15, 2019

Thanks @kou! I agree that from binding to binding the Column concept may or may not make sense.

Once we build an initial version of the Data Frame API I think that we will want users to use that instead. That will take some months still

@shiro615
Copy link
Contributor

@kou This changes looks good to me. Thanks.

@wesm wesm force-pushed the remove-column-class branch from c15fa9b to 020f52a Compare July 16, 2019 20:29
@wesm
Copy link
Member Author

wesm commented Jul 16, 2019

OK. I fixed up the R build -- my first time ever building the R package!

To future-proof the Feather API, I changed FeatherReader::GetColumn to return ChunkedArray instead of Array. I wasn't able to test the MATLAB change so please send a follow up PR if I broke it (sorry)

@wesm
Copy link
Member Author

wesm commented Jul 16, 2019

+1. I'll wait for CI to run. Thanks everyone for your help on this, good team work =) I will use this PR as a change to test ARROW-5716 (attributing Co-authors in the GitHub UI when merging PRs)

Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One note but in general the R changes look sensible to me. @romainfrancois should take a look, if only to be aware of the change.

num_columns = function() ipc___feather___TableReader__num_columns(self),
GetColumnName = function(i) ipc___feather___TableReader__GetColumnName(self, i),
GetColumn = function(i) shared_ptr(`arrow::Column`, ipc___feather___TableReader__GetColumn(self, i)),
GetColumn = function(i) shared_ptr(`arrow::Array`, ipc___feather___TableReader__GetColumn(self, i)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an observation that every other change mapped Column to ChunkedArray but this one is just Array. Should this one also be ChunkedArray?

@kevingurney
Copy link
Member

@wesm - I'll take a look at the ChunkedArray changes for the MATLAB code. I can include any changes in a separate PR like you suggested to avoid holding up merging in this PR.

@wesm
Copy link
Member Author

wesm commented Jul 16, 2019

Hmm. Python 2.7 is failing. I'll see if I can reproduce locally in a little while

@kevingurney
Copy link
Member

@wesm - I just took a closer look at your changes and they look good to me. I ran all the MATLAB featherread/featherwrite tests and they pass. Whenever Feather files with multiple chunks become a reality, then we will have to make some more changes to properly support them. Thanks!

@wesm
Copy link
Member Author

wesm commented Jul 17, 2019

Here's builds in progress on my fork

Travis CI: https://travis-ci.org/wesm/arrow/builds/559726306
Appveyor: https://ci.appveyor.com/project/wesm/arrow/builds/26030853

@wesm
Copy link
Member Author

wesm commented Jul 17, 2019

Forgot about the docs

@wesm
Copy link
Member Author

wesm commented Jul 17, 2019

Hm a build ran on my Travis CI but it only compares one commit with the last.

I just pushed a squashed build here to make sure all the builds run:

Travis CI: https://travis-ci.org/wesm/arrow/builds/559761819
Appveyor: https://ci.appveyor.com/project/wesm/arrow/builds/26032806

@nealrichardson it looks like the R Appveyor build is not taking into account changes in the C++ library, see that arrow::Column is in the logs here. I think it's building against master?

https://ci.appveyor.com/project/wesm/arrow/builds/26030853/job/7vn8q3l8e24t83jh?fullLog=true

opened https://issues.apache.org/jira/browse/ARROW-5963

@wesm
Copy link
Member Author

wesm commented Jul 17, 2019

OK, the builds look good. I'm going to try to merge this with ARROW-5716

@wesm wesm closed this Jul 17, 2019
wesm added a commit that referenced this pull request Jul 17, 2019
… class

This completely removes `arrow::Column` from the C++ library and adapts the Python bindings to follow, to help assist with mailing list discussion.

There are other places that this would touch:

- [x] GLib
- [x] Ruby
- [x] MATLAB
- [x] R

You can see the evident reduction of boilerplate and simplification in many places -- it is particularly pronounced in parquet/arrow/arrow-reader-writer-test.cc. This refactor also exposed a bug where a Column's data type did not match its Field type.

Closes #4841 from wesm/remove-column-class and squashes the following commits:

b389d42 <Wes McKinney> Fix up Python/C++ docs to remove references to arrow::Column
cc34548 <Wes McKinney> Fix non-deterministic Table.from_pydict behavior. Add unicode/utf8 test for pyarrow.table. Fix usages of ChunkedArray.data and add unit test
4be8d5a <Wes McKinney> UTF-8 py2/3 compatibility issues
9b5a61e <Wes McKinney> Try to fix unicode issue
a5ee7b6 <Wes McKinney> Fix MATLAB code (I hope)
2c7c51e <Wes McKinney> code review comments
020f52a <Wes McKinney> Re-render R README
fd4473e <Wes McKinney> Fix up R C++ code and unit tests
244d2a7 <Wes McKinney> Begin refactoring R library. Change Feather to return ChunkedArray from GetColumn
d30a365 <Sutou Kouhei>  Remove unused variable
d89366a <Sutou Kouhei>  Follow API change
983d378 <Sutou Kouhei>  Follow API change
3099328 <Sutou Kouhei>  Follow API change
f092f1b <Sutou Kouhei>  Add missing available annotations
41dd1fe <Sutou Kouhei>  Revert needless style change
f120a7d <Sutou Kouhei>  Suppress a warning with MSVC
62f3649 <Sutou Kouhei>  Remove unused lambda captures
0090051 <Sutou Kouhei>  Add API index for 1.0.0
bd6002a <Sutou Kouhei>  Remove entries
335d7b6 <Sutou Kouhei>  Implement Arrow::Column in Ruby
fb5ad1c <Sutou Kouhei>  Add garrow_chunked_array_get_n_rows() for consistency
60dfb0c <Sutou Kouhei>  Add garrow_schema_get_field_index()
c4ff972 <Sutou Kouhei>  Remove backward compatible column API from GArrowRecordBatch
179aa67 <Sutou Kouhei>  Use "column_data" instead of "column"
795b6c1 <Sutou Kouhei>  Follow arrow::Column remove
ac6f372 <Kevin Gurney> Remove use of arrow::Column from feather_reader.cc. Remove use of deprecated StatusCode enum class values in handle_status.cc
fcaaf8c <Wes McKinney> Add pyarrow.ChunkedArray.flatten method. Remove Column from glib, but haven't fixed unit tests yet
f0d48cc <Wes McKinney> Adapt Python bindings
c161a9a <Wes McKinney> Fix up Parquet, too
93e3cad <Wes McKinney> arrow-tests all passing again
a7344a8 <Wes McKinney> Stage 1 of cutting away Column

Lead-authored-by: Wes McKinney <wesm+git@apache.org>
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Co-authored-by: Kevin Gurney <kevin.p.gurney@gmail.com>
Signed-off-by: Wes McKinney <wesm+git@apache.org>
kszucs pushed a commit that referenced this pull request Jul 22, 2019
… class

This completely removes `arrow::Column` from the C++ library and adapts the Python bindings to follow, to help assist with mailing list discussion.

There are other places that this would touch:

- [x] GLib
- [x] Ruby
- [x] MATLAB
- [x] R

You can see the evident reduction of boilerplate and simplification in many places -- it is particularly pronounced in parquet/arrow/arrow-reader-writer-test.cc. This refactor also exposed a bug where a Column's data type did not match its Field type.

Closes #4841 from wesm/remove-column-class and squashes the following commits:

b389d42 <Wes McKinney> Fix up Python/C++ docs to remove references to arrow::Column
cc34548 <Wes McKinney> Fix non-deterministic Table.from_pydict behavior. Add unicode/utf8 test for pyarrow.table. Fix usages of ChunkedArray.data and add unit test
4be8d5a <Wes McKinney> UTF-8 py2/3 compatibility issues
9b5a61e <Wes McKinney> Try to fix unicode issue
a5ee7b6 <Wes McKinney> Fix MATLAB code (I hope)
2c7c51e <Wes McKinney> code review comments
020f52a <Wes McKinney> Re-render R README
fd4473e <Wes McKinney> Fix up R C++ code and unit tests
244d2a7 <Wes McKinney> Begin refactoring R library. Change Feather to return ChunkedArray from GetColumn
d30a365 <Sutou Kouhei>  Remove unused variable
d89366a <Sutou Kouhei>  Follow API change
983d378 <Sutou Kouhei>  Follow API change
3099328 <Sutou Kouhei>  Follow API change
f092f1b <Sutou Kouhei>  Add missing available annotations
41dd1fe <Sutou Kouhei>  Revert needless style change
f120a7d <Sutou Kouhei>  Suppress a warning with MSVC
62f3649 <Sutou Kouhei>  Remove unused lambda captures
0090051 <Sutou Kouhei>  Add API index for 1.0.0
bd6002a <Sutou Kouhei>  Remove entries
335d7b6 <Sutou Kouhei>  Implement Arrow::Column in Ruby
fb5ad1c <Sutou Kouhei>  Add garrow_chunked_array_get_n_rows() for consistency
60dfb0c <Sutou Kouhei>  Add garrow_schema_get_field_index()
c4ff972 <Sutou Kouhei>  Remove backward compatible column API from GArrowRecordBatch
179aa67 <Sutou Kouhei>  Use "column_data" instead of "column"
795b6c1 <Sutou Kouhei>  Follow arrow::Column remove
ac6f372 <Kevin Gurney> Remove use of arrow::Column from feather_reader.cc. Remove use of deprecated StatusCode enum class values in handle_status.cc
fcaaf8c <Wes McKinney> Add pyarrow.ChunkedArray.flatten method. Remove Column from glib, but haven't fixed unit tests yet
f0d48cc <Wes McKinney> Adapt Python bindings
c161a9a <Wes McKinney> Fix up Parquet, too
93e3cad <Wes McKinney> arrow-tests all passing again
a7344a8 <Wes McKinney> Stage 1 of cutting away Column

Lead-authored-by: Wes McKinney <wesm+git@apache.org>
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Co-authored-by: Kevin Gurney <kevin.p.gurney@gmail.com>
Signed-off-by: Wes McKinney <wesm+git@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants