Skip to content

Conversation

@westonpace
Copy link
Member

The primary goal of this refactor of old code was to improve the readability and clarity of the code base. I did not make any functional changes to the code and if any functional changes are suggested which modify existing code I will happily discuss them here but defer the changes themselves to follow-up PRs. I would very much appreciate any feedback on naming, making sure we have sufficient test coverage, and overall layout of the code.

  • KeyRowArray -> RowTableImpl KeyEncoder -> RowTableEncoder: The old name made sense because this data is currently represented physically as an array of rows. However, the data is conceptually tabular. We are storing rows & columns. In particular, I found it confusing that KeyColumnArray was a 1D data structure while KeyRowArray was a 2D table structure.
  • KeyEncoder::Context -> LightContext: There's nothing particular to the key encoder here and I worry keeping it there may lead to fracturing into many different "context" objects.
  • Overall structure: I created a new folder arrow/compute/row and put all row-based utilities in here. Most of the files are now marked as _internal and the content in these files is not used outside of arrow/compute/row. The grouper had previously been alongside the kernel code and it didn't really belong there as it relies very heavily on the internal structure of the row encoding.
  • Row structure: I documented the file arrow/compute/row/row_internal.h

@westonpace westonpace requested a review from bkietz May 23, 2022 21:03
@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@westonpace
Copy link
Member Author

CI failures appear unrelated.

@pitrou pitrou self-requested a review May 30, 2022 15:48
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to take a quick look at this.

const FunctionOptions* options;
};

Result<std::vector<const HashAggregateKernel*>> GetKernels(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to expose these APIs here, or can there be a separate header file for internal hash-aggregation APIs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC these are only used in grouped aggregation and in tests, so api_aggregate_internal.h would be appropriate to house anything which is in namespace internal here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but api_..._internal feels a bit awkward. I created arrow/compute/exec/aggregate.h. This follows the same convention as things like arrow/compute/exec/hash_join.h which contains logic specific to the operators but unaware of the fact its being used in an exec plan. I think it makes sense for the aggregate tests to use this type. It's still using the internal namespace but that's because we need it in the hash kernels tests and at least this keeps the kernels folder cleaner.

Maybe a longer term fix would be to modify the hash aggregate tests to use the exec plan and an aggregate node?

static void HashMultiColumn(const std::vector<KeyColumnArray>& cols,
KeyEncoder::KeyEncoderContext* ctx, uint32_t* out_hash);
static void HashMultiColumn(const std::vector<KeyColumnArray>& cols, LightContext* ctx,
uint32_t* out_hash);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, is this a class with only static methods/attributes? This seems like an anti-pattern.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is what it is. It is essentially a namespace to distinguish between 32bit and 64bit implementations. Hashing32::HashBatch will hash rows into uint32_t while Hashing64::HashBatch will hash rows into uint64_t. Would a namespace be an option? (e.g. arrow::compute::hash32::HashBatch)

Alternatively, I suppose we could rename all the functions (e.g. arrow::compute::HashBatch32 and arrow::compute::HashBatch64).

Or we could template all the functions (e.g. arrow::compute::HashBatch<uint32_t> and arrow::compute::HashBatch<uint64_t>)

Do we have a strong style preference here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a strong style preference here?

Hmm, I don't think so. If it's used for templating then I suppose the class is necessary.

/// allows us to take advantage of these resources without coupling the logic with
/// the execution engine.
struct LightContext {
bool has_avx2() const { return (hardware_flags & arrow::internal::CpuInfo::AVX2) > 0; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this no using CpuInfo::IsSupported(CpuInfo::AVX2)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, the concept here was to be able to attach hardware flags to a specific context rather than needing to disable or enable for the whole library using CpuInfo::EnableFeature(). It and many other things are certainly candidates for follow up refactoring

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving this alone for now.

std::vector<uint32_t> batch_varbinary_cols_base_offsets_;
};

class EncoderInteger {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these all have to be exposed in a .h?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some don't. Any of the encoders that have an AVX2 implemented method do I think. So if I was going to need an internal header anyways it seemed more consistent to just throw them all in. However, I can prune this down to just the encoders needed if that would be better.

/// For a varying-length binary, size of all encoded fixed-length key columns,
/// including lengths of varying-length columns, rounded up to the multiple of string
/// alignment.
uint32_t fixed_length;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are some sizes or quantities unsigned and other signed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if there is a particular reason.

// Buffers can only expand during lifetime and never shrink.
std::unique_ptr<ResizableBuffer> null_masks_;
// Only used if the table has variable-length columns
// Stores the offsets into the binary data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the binary data stored?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment but it's stored after the fixed-size fields. So, for example, if you had 2 int32 fields and a string field and 3 rows you might have something like...

i1 i2 s1
1 3 abc
2 4 xy
// buffers_[1]
0x00000001 0x00000002 0x61 0x62 0x63 0x00000003 0x00000004 0x78 0x79
// offsets_
2, 5, 7, 9

I'm probably off on a few details in that example but that is the rough idea.

// Called after resize to fix pointers
void update_buffer_pointers();

static constexpr int64_t padding_for_vectors = 64;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static constexpr int64_t padding_for_vectors = 64;
static constexpr int64_t kPaddingForVectors = 64;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add a comment explaining what this is?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this change should be made but I'd recommend doing so in follow up; I'd prefer to keep this refactor move-only since it's large as it is

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and did the rename. It's a private constant so the scope should be pretty minimal.

// The number of bytes that can be stored in the table without resizing
int64_t bytes_capacity_;

// Mutable to allow lazy evaluation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these be atomic or is the row table not thread safe?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The row table is not thread safe. I updated the class comment to mention this fact.

const FunctionOptions* options;
};

Result<std::vector<const HashAggregateKernel*>> GetKernels(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC these are only used in grouped aggregation and in tests, so api_aggregate_internal.h would be appropriate to house anything which is in namespace internal here

// Called after resize to fix pointers
void update_buffer_pointers();

static constexpr int64_t padding_for_vectors = 64;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this change should be made but I'd recommend doing so in follow up; I'd prefer to keep this refactor move-only since it's large as it is

/// allows us to take advantage of these resources without coupling the logic with
/// the execution engine.
struct LightContext {
bool has_avx2() const { return (hardware_flags & arrow::internal::CpuInfo::AVX2) > 0; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, the concept here was to be able to attach hardware flags to a specific context rather than needing to disable or enable for the whole library using CpuInfo::EnableFeature(). It and many other things are certainly candidates for follow up refactoring

@westonpace westonpace force-pushed the feature/ARROW-16590--consolidate-row-major-utilities-2 branch from 9bef021 to a8b4ae1 Compare June 3, 2022 02:53
@westonpace westonpace requested a review from pitrou June 3, 2022 03:05
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these a different thing than {null_masks_, offsets_, rows_)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once I've called AppendEmpty, what am I supposed to do?

@wesm wesm force-pushed the feature/ARROW-16590--consolidate-row-major-utilities-2 branch from 65ebdaa to 2de80c9 Compare June 13, 2022 20:03
Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. I rebased after #13364 and addressed some of the remaining code review comments. This conflicts with some of the ongoing refactoring to transition from ExecBatch to ExecSpan so I will merge this as soon as we have a green CI build

@wesm wesm force-pushed the feature/ARROW-16590--consolidate-row-major-utilities-2 branch from 2de80c9 to cbbae69 Compare June 13, 2022 20:32
westonpace and others added 3 commits June 13, 2022 21:02
…to arrow/compute/row

ARROW-16590: Moved GroupBy out of the kernels layer (api_aggregate) and into the exec layer (exec/aggregate).  Added some comments and renamed a few fields to adhere to style conventions.

ARROW-16590: Fix includes in benchmark to address GroupBy move
Add missing #pragma once
@wesm wesm force-pushed the feature/ARROW-16590--consolidate-row-major-utilities-2 branch from cbbae69 to d4860c8 Compare June 14, 2022 02:02
@wesm wesm changed the title ARROW-16590: [C++] Consolidate files dealing with row-major storage ARROW-16590: [C++] Consolidate files dealing with row-major storage Jun 14, 2022
@wesm wesm merged commit 5b859fd into apache:master Jun 14, 2022
@github-actions
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants