-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-16590: [C++] Consolidate files dealing with row-major storage #13218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-16590: [C++] Consolidate files dealing with row-major storage #13218
Conversation
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename pull request title in the following format? or See also: |
|
CI failures appear unrelated. |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to take a quick look at this.
| const FunctionOptions* options; | ||
| }; | ||
|
|
||
| Result<std::vector<const HashAggregateKernel*>> GetKernels( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to expose these APIs here, or can there be a separate header file for internal hash-aggregation APIs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC these are only used in grouped aggregation and in tests, so api_aggregate_internal.h would be appropriate to house anything which is in namespace internal here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes but api_..._internal feels a bit awkward. I created arrow/compute/exec/aggregate.h. This follows the same convention as things like arrow/compute/exec/hash_join.h which contains logic specific to the operators but unaware of the fact its being used in an exec plan. I think it makes sense for the aggregate tests to use this type. It's still using the internal namespace but that's because we need it in the hash kernels tests and at least this keeps the kernels folder cleaner.
Maybe a longer term fix would be to modify the hash aggregate tests to use the exec plan and an aggregate node?
| static void HashMultiColumn(const std::vector<KeyColumnArray>& cols, | ||
| KeyEncoder::KeyEncoderContext* ctx, uint32_t* out_hash); | ||
| static void HashMultiColumn(const std::vector<KeyColumnArray>& cols, LightContext* ctx, | ||
| uint32_t* out_hash); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record, is this a class with only static methods/attributes? This seems like an anti-pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is what it is. It is essentially a namespace to distinguish between 32bit and 64bit implementations. Hashing32::HashBatch will hash rows into uint32_t while Hashing64::HashBatch will hash rows into uint64_t. Would a namespace be an option? (e.g. arrow::compute::hash32::HashBatch)
Alternatively, I suppose we could rename all the functions (e.g. arrow::compute::HashBatch32 and arrow::compute::HashBatch64).
Or we could template all the functions (e.g. arrow::compute::HashBatch<uint32_t> and arrow::compute::HashBatch<uint64_t>)
Do we have a strong style preference here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a strong style preference here?
Hmm, I don't think so. If it's used for templating then I suppose the class is necessary.
| /// allows us to take advantage of these resources without coupling the logic with | ||
| /// the execution engine. | ||
| struct LightContext { | ||
| bool has_avx2() const { return (hardware_flags & arrow::internal::CpuInfo::AVX2) > 0; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this no using CpuInfo::IsSupported(CpuInfo::AVX2)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, the concept here was to be able to attach hardware flags to a specific context rather than needing to disable or enable for the whole library using CpuInfo::EnableFeature(). It and many other things are certainly candidates for follow up refactoring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving this alone for now.
| std::vector<uint32_t> batch_varbinary_cols_base_offsets_; | ||
| }; | ||
|
|
||
| class EncoderInteger { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these all have to be exposed in a .h?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some don't. Any of the encoders that have an AVX2 implemented method do I think. So if I was going to need an internal header anyways it seemed more consistent to just throw them all in. However, I can prune this down to just the encoders needed if that would be better.
| /// For a varying-length binary, size of all encoded fixed-length key columns, | ||
| /// including lengths of varying-length columns, rounded up to the multiple of string | ||
| /// alignment. | ||
| uint32_t fixed_length; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are some sizes or quantities unsigned and other signed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if there is a particular reason.
| // Buffers can only expand during lifetime and never shrink. | ||
| std::unique_ptr<ResizableBuffer> null_masks_; | ||
| // Only used if the table has variable-length columns | ||
| // Stores the offsets into the binary data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is the binary data stored?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a comment but it's stored after the fixed-size fields. So, for example, if you had 2 int32 fields and a string field and 3 rows you might have something like...
| i1 | i2 | s1 |
|---|---|---|
| 1 | 3 | abc |
| 2 | 4 | xy |
// buffers_[1]
0x00000001 0x00000002 0x61 0x62 0x63 0x00000003 0x00000004 0x78 0x79
// offsets_
2, 5, 7, 9
I'm probably off on a few details in that example but that is the rough idea.
| // Called after resize to fix pointers | ||
| void update_buffer_pointers(); | ||
|
|
||
| static constexpr int64_t padding_for_vectors = 64; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| static constexpr int64_t padding_for_vectors = 64; | |
| static constexpr int64_t kPaddingForVectors = 64; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also add a comment explaining what this is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this change should be made but I'd recommend doing so in follow up; I'd prefer to keep this refactor move-only since it's large as it is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went ahead and did the rename. It's a private constant so the scope should be pretty minimal.
| // The number of bytes that can be stored in the table without resizing | ||
| int64_t bytes_capacity_; | ||
|
|
||
| // Mutable to allow lazy evaluation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these be atomic or is the row table not thread safe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The row table is not thread safe. I updated the class comment to mention this fact.
| const FunctionOptions* options; | ||
| }; | ||
|
|
||
| Result<std::vector<const HashAggregateKernel*>> GetKernels( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC these are only used in grouped aggregation and in tests, so api_aggregate_internal.h would be appropriate to house anything which is in namespace internal here
| // Called after resize to fix pointers | ||
| void update_buffer_pointers(); | ||
|
|
||
| static constexpr int64_t padding_for_vectors = 64; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this change should be made but I'd recommend doing so in follow up; I'd prefer to keep this refactor move-only since it's large as it is
| /// allows us to take advantage of these resources without coupling the logic with | ||
| /// the execution engine. | ||
| struct LightContext { | ||
| bool has_avx2() const { return (hardware_flags & arrow::internal::CpuInfo::AVX2) > 0; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, the concept here was to be able to attach hardware flags to a specific context rather than needing to disable or enable for the whole library using CpuInfo::EnableFeature(). It and many other things are certainly candidates for follow up refactoring
9bef021 to
a8b4ae1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these a different thing than {null_masks_, offsets_, rows_)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once I've called AppendEmpty, what am I supposed to do?
65ebdaa to
2de80c9
Compare
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. I rebased after #13364 and addressed some of the remaining code review comments. This conflicts with some of the ongoing refactoring to transition from ExecBatch to ExecSpan so I will merge this as soon as we have a green CI build
2de80c9 to
cbbae69
Compare
…to arrow/compute/row ARROW-16590: Moved GroupBy out of the kernels layer (api_aggregate) and into the exec layer (exec/aggregate). Added some comments and renamed a few fields to adhere to style conventions. ARROW-16590: Fix includes in benchmark to address GroupBy move
Add missing #pragma once
cbbae69 to
d4860c8
Compare
The primary goal of this refactor of old code was to improve the readability and clarity of the code base. I did not make any functional changes to the code and if any functional changes are suggested which modify existing code I will happily discuss them here but defer the changes themselves to follow-up PRs. I would very much appreciate any feedback on naming, making sure we have sufficient test coverage, and overall layout of the code.
KeyColumnArraywas a 1D data structure whileKeyRowArraywas a 2D table structure.