Skip to content

Conversation

@westonpace
Copy link
Member

@westonpace westonpace commented May 17, 2022

This PR is the second PR in a series to merge in #12326. The first was #12872. Some of this is new functionality (from michalursa) and can be found in 7daafeb. The new behavior is mostly in RowTable and RowTableMerge. In addition to these new types this PR moves around, and renames, a lot of existing work revolving around KeyRowArray.

The primary goal of this refactor of old code was to improve the readability and clarity of the code base. I did not make any functional changes to the code and if any functional changes are suggested which modify existing code I will happily discuss them here but defer the changes themselves to follow-up PRs. I would very much appreciate any feedback on naming, making sure we have sufficient test coverage, and overall layout of the code.

It's a very large change but most of it is moving things around. What I'd like input on most:

  • KeyRowArray -> RowTableImpl, RowArray -> RowTable: This is the most meaningful name change. The old name made sense because this data is currently represented physically as an array of rows. However, the data is conceptually tabular. We are storing rows & columns. In particular, I found it confusing that KeyColumnArray was a 1D data structure while KeyRowArray was a 2D table structure.
  • Overall structure: I created a new folder arrow/compute/row and put all row-based utilities in here. Most of the files are now marked as _internal and the content in these files is not used outside of arrow/compute/row. The grouper had previously been alongside the kernel code and it didn't really belong there. It also relies very heavily on the internal structure of the row encoding.
  • Row structure: I documented the file arrow/compute/row/row_internal.h and would appreciate review of the content here.
  • Semi-external API: The "external" API now consists of arrow/compute/row/grouper.h, which is more or less unchanged, and arrow/compute/encode.h which is mostly new code from 7daafeb. We have quite a few tests exercising Grouper and I was unable to easily extract them from the aggregate kernel tests so I left the tests alone. I added tests to the new utilities in encode.h. These utilities are not yet properly external because they depend on RowTableImpl and I would need to migrate them to the PIMPL pattern to remove this dependency. Since that would be a functional change I think it would be best to wait.

@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

michalursa and others added 2 commits May 18, 2022 05:36
…sts for encoding and decoding rows as well as a test for comparing column-major data with row-major data.
@westonpace westonpace force-pushed the feature/ARROW-16590--consolidate-row-major-utilities branch from da41628 to ba12892 Compare May 18, 2022 15:36
@westonpace westonpace requested a review from bkietz May 19, 2022 05:38
@westonpace westonpace marked this pull request as ready for review May 19, 2022 05:38
@westonpace
Copy link
Member Author

I'm still chasing a few things around with lint/format and there is a, possibly legitimate and existing, bug with big endian architectures, but I think this is close enough to start review.

@westonpace
Copy link
Member Author

CC @michalursa

@westonpace
Copy link
Member Author

Per an offline request from @bkietz I've split this PR in two. The first PR is #13218 and contains only the refactoring. The changes & new utilities will be put into a future PR.

@westonpace
Copy link
Member Author

westonpace commented May 23, 2022

The second half of this PR has been moved to #13220

@westonpace westonpace closed this May 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants