ARROW-9773: [C++] Implement Take kernel for ChunkedArray #13857

wjones127 · 2022-08-11T17:22:34Z

No description provided.

github-actions · 2022-08-11T17:23:06Z

https://issues.apache.org/jira/browse/ARROW-9773

wjones127 · 2022-08-12T23:11:43Z

cpp/src/arrow/compute/kernels/vector_selection.cc

I'd welcome some design input on this. How should the output chunking be structured?

My knee jerk reaction is to prefer the existing chunk layouts, just in case the chunking was specially chosen for hardware or data placement reasons.

So, use the chunking of the take indices? I suppose that makes sense.

My intuition would be not to chunk at all. This is what we usually do for vector kernels when there is no natural mapping from input chunking to output chunking.

I think the original motivation here, though, was for things that won't fit in a chunk (not necessarily due to number of rows, due to things like string data)

Possibly, though there's a more general performance concern (you don't want to concatenate the chunks of a large chunked array just to take 10 elements out of it).

(also this comment is about primitive arrays)

I'm kind of liking the idea of using the take indices; potentially gives the user control of the chunking output of take.

Of course, for cases like string and binary data, we'll probably take a different approach and chunk based on what fits.

Ah, I missed that for primitive arrays.

This discussion seems like it might be relevant actually: https://issues.apache.org/jira/browse/ARROW-2532

pitrou · 2022-08-17T19:22:30Z

cpp/src/arrow/compute/kernels/vector_selection.cc

See

arrow/cpp/src/arrow/compute/kernels/vector_sort.cc

Lines 690 to 711 in 8474ee5

// Preprocessed sort key.

struct ResolvedSortKey {

ResolvedSortKey(const std::shared_ptr<Array>& array, SortOrder order)

: type(GetPhysicalType(array->type())),

owned_array(GetPhysicalArray(*array, type)),

array(*owned_array),

order(order),

null_count(array->null_count()) {}

using LocationType = int64_t;

template <typename ArrayType>

ResolvedChunk<ArrayType> GetChunk(int64_t index) const {

return {&checked_cast<const ArrayType&>(array), index};

}

const std::shared_ptr<DataType> type;

std::shared_ptr<Array> owned_array;

const Array& array;

SortOrder order;

int64_t null_count;

};

and

arrow/cpp/src/arrow/compute/kernels/vector_sort.cc

Lines 844 to 889 in 8474ee5

// Preprocessed sort key.

struct ResolvedSortKey {

ResolvedSortKey(const std::shared_ptr<DataType>& type, ArrayVector chunks,

SortOrder order, int64_t null_count)

: type(GetPhysicalType(type)),

owned_chunks(std::move(chunks)),

chunks(GetArrayPointers(owned_chunks)),

order(order),

null_count(null_count) {}

using LocationType = ::arrow::internal::ChunkLocation;

template <typename ArrayType>

ResolvedChunk<ArrayType> GetChunk(::arrow::internal::ChunkLocation loc) const {

return {checked_cast<const ArrayType*>(chunks[loc.chunk_index]),

loc.index_in_chunk};

}

// Make a vector of ResolvedSortKeys for the sort keys and the given table.

// `batches` must be a chunking of `table`.

static Result<std::vector<ResolvedSortKey>> Make(

const Table& table, const RecordBatchVector& batches,

const std::vector<SortKey>& sort_keys) {

auto factory = [&](const SortField& f) {

const auto& type = table.schema()->field(f.field_index)->type();

// We must expose a homogenous chunking for all ResolvedSortKey,

// so we can't simply pass `table.column(f.field_index)`

ArrayVector chunks(batches.size());

std::transform(batches.begin(), batches.end(), chunks.begin(),

[&](const std::shared_ptr<RecordBatch>& batch) {

return batch->column(f.field_index);

});

return ResolvedSortKey(type, std::move(chunks), f.order,

table.column(f.field_index)->null_count());

};

return ::arrow::compute::internal::ResolveSortKeys<ResolvedSortKey>(

*table.schema(), sort_keys, factory);

}

std::shared_ptr<DataType> type;

ArrayVector owned_chunks;

std::vector<const Array*> chunks;

SortOrder order;

int64_t null_count;

};

for ideas of how to parametrize this.

wjones127 · 2022-09-16T22:04:26Z

@ursabot please benchmark lang=C++

ursabot · 2022-09-16T22:04:30Z

Benchmark runs are scheduled for baseline = e63a13a and contender = a43fa07. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Only ['Python'] langs are supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Failed ⬇️1.19% ⬆️1.02%] test-mac-arm
[Skipped ⚠️ Only ['JavaScript', 'Python', 'R'] langs are supported on ursa-i9-9960x] ursa-i9-9960x
[Finished ⬇️0.53% ⬆️1.49%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] a43fa07e test-mac-arm
[Finished] a43fa07e ursa-thinkcentre-m75q
[Finished] e63a13aa ec2-t3-xlarge-us-east-2
[Failed] e63a13aa test-mac-arm
[Failed] e63a13aa ursa-i9-9960x
[Finished] e63a13aa ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

wjones127 · 2022-09-19T19:24:53Z

I started by working on the Take implementation for primitive values, so I could familiarize myself with how the Take kernels work. But based on the benchmark I added, it seems like I actually made performance much worse (except in the monotonic case)! I suspect this is because having to use ChunkResolver for every index is more expensive than just copying the values into a contiguous array. Does that sounds reasonable? Or is there something obviously wrong?

Benchmark results

Baseline:

--------------------------------------------------------------------------------------------------------------
Benchmark                                                    Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------
TakeChunkedInt64RandomIndicesNoNulls/4194304/1000     19362199 ns     19358250 ns           36 items_per_second=216.668M/s null_percent=0.1 size=4.1943M
TakeChunkedInt64RandomIndicesNoNulls/4194304/10       19504339 ns     19494278 ns           36 items_per_second=215.156M/s null_percent=10 size=4.1943M
TakeChunkedInt64RandomIndicesNoNulls/4194304/2        34162071 ns     34146150 ns           20 items_per_second=122.834M/s null_percent=50 size=4.1943M
TakeChunkedInt64RandomIndicesNoNulls/4194304/1        10458465 ns     10455803 ns           66 items_per_second=401.146M/s null_percent=100 size=4.1943M
TakeChunkedInt64RandomIndicesNoNulls/4194304/0        12260952 ns     12258093 ns           54 items_per_second=342.166M/s null_percent=0 size=4.1943M
TakeChunkedInt64RandomIndicesWithNulls/4194304/1000   19419778 ns     19412389 ns           36 items_per_second=216.063M/s null_percent=0.1 size=4.1943M
TakeChunkedInt64RandomIndicesWithNulls/4194304/10     29953237 ns     29944261 ns           23 items_per_second=140.07M/s null_percent=10 size=4.1943M
TakeChunkedInt64RandomIndicesWithNulls/4194304/2      51350571 ns     51330500 ns           14 items_per_second=81.7117M/s null_percent=50 size=4.1943M
TakeChunkedInt64RandomIndicesWithNulls/4194304/1       3319791 ns      3318972 ns          214 items_per_second=1.26374G/s null_percent=100 size=4.1943M
TakeChunkedInt64RandomIndicesWithNulls/4194304/0      12277404 ns     12275145 ns           55 items_per_second=341.691M/s null_percent=0 size=4.1943M
TakeChunkedInt64MonotonicIndices/4194304/1000         24581060 ns     24574690 ns           29 items_per_second=170.676M/s null_percent=0.1 size=4.1943M
TakeChunkedInt64MonotonicIndices/4194304/10           22506711 ns     22501129 ns           31 items_per_second=186.404M/s null_percent=10 size=4.1943M
TakeChunkedInt64MonotonicIndices/4194304/2            20736080 ns     20730853 ns           34 items_per_second=202.322M/s null_percent=50 size=4.1943M
TakeChunkedInt64MonotonicIndices/4194304/1            16202271 ns     16196349 ns           43 items_per_second=258.966M/s null_percent=100 size=4.1943M
TakeChunkedInt64MonotonicIndices/4194304/0            15727504 ns     15721614 ns           44 items_per_second=266.786M/s null_percent=0 size=4.1943M

Proposed:

--------------------------------------------------------------------------------------------------------------
Benchmark                                                    Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------
TakeChunkedInt64RandomIndicesNoNulls/4194304/1000    142831500 ns    142791200 ns            5 items_per_second=29.3737M/s null_percent=0.1 size=4.1943M
TakeChunkedInt64RandomIndicesNoNulls/4194304/10      144134633 ns    144110400 ns            5 items_per_second=29.1048M/s null_percent=10 size=4.1943M
TakeChunkedInt64RandomIndicesNoNulls/4194304/2       125704833 ns    125667167 ns            6 items_per_second=33.3763M/s null_percent=50 size=4.1943M
TakeChunkedInt64RandomIndicesNoNulls/4194304/1        84408114 ns     84386875 ns            8 items_per_second=49.7033M/s null_percent=100 size=4.1943M
TakeChunkedInt64RandomIndicesNoNulls/4194304/0        88094063 ns     88072375 ns            8 items_per_second=47.6234M/s null_percent=0 size=4.1943M
TakeChunkedInt64RandomIndicesWithNulls/4194304/1000  111903111 ns    111859500 ns            6 items_per_second=37.4962M/s null_percent=0.1 size=4.1943M
TakeChunkedInt64RandomIndicesWithNulls/4194304/10    113359923 ns    113286667 ns            6 items_per_second=37.0238M/s null_percent=10 size=4.1943M
TakeChunkedInt64RandomIndicesWithNulls/4194304/2      95110995 ns     95098625 ns            8 items_per_second=44.1048M/s null_percent=50 size=4.1943M
TakeChunkedInt64RandomIndicesWithNulls/4194304/1       1613900 ns      1613515 ns          437 items_per_second=2.59948G/s null_percent=100 size=4.1943M
TakeChunkedInt64RandomIndicesWithNulls/4194304/0      88383021 ns     88365750 ns            8 items_per_second=47.4653M/s null_percent=0 size=4.1943M
TakeChunkedInt64MonotonicIndices/4194304/1000         23783853 ns     23776276 ns           29 items_per_second=176.407M/s null_percent=0.1 size=4.1943M
TakeChunkedInt64MonotonicIndices/4194304/10           24145126 ns     24140310 ns           29 items_per_second=173.747M/s null_percent=10 size=4.1943M
TakeChunkedInt64MonotonicIndices/4194304/2            23058231 ns     23046233 ns           30 items_per_second=181.995M/s null_percent=50 size=4.1943M
TakeChunkedInt64MonotonicIndices/4194304/1            16306472 ns     16301465 ns           43 items_per_second=257.296M/s null_percent=100 size=4.1943M
TakeChunkedInt64MonotonicIndices/4194304/0            15400245 ns     15398652 ns           46 items_per_second=272.381M/s null_percent=0 size=4.1943M

pitrou · 2022-09-19T19:49:21Z

@wjones127 The numbers seem a bit low to be honest, but perhaps that's just me. I haven't looked at the implementation.
@lidavidm What do you think?

wjones127 · 2022-09-19T21:15:50Z

The numbers seem a bit low to be honest

You are correct on that. Both too low in test and baseline, by about the same factor. I was creating too large of a chunked array for the indices.

Benchmark results

Baseline:

--------------------------------------------------------------------------------------------------------------
Benchmark                                                    Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------
TakeChunkedInt64RandomIndicesNoNulls/4194304/1000      2770528 ns      2769802 ns          232 items_per_second=1.5143G/s null_percent=0.1 size=4.1943M
TakeChunkedInt64RandomIndicesNoNulls/4194304/10        2802730 ns      2802061 ns          246 items_per_second=1.49686G/s null_percent=10 size=4.1943M
TakeChunkedInt64RandomIndicesNoNulls/4194304/2         4277424 ns      4276390 ns          164 items_per_second=980.805M/s null_percent=50 size=4.1943M
TakeChunkedInt64RandomIndicesNoNulls/4194304/1         2143412 ns      2142790 ns          305 items_per_second=1.9574G/s null_percent=100 size=4.1943M
TakeChunkedInt64RandomIndicesNoNulls/4194304/0         1886756 ns      1886230 ns          374 items_per_second=2.22364G/s null_percent=0 size=4.1943M
TakeChunkedInt64RandomIndicesWithNulls/4194304/1000    2799584 ns      2799301 ns          249 items_per_second=1.49834G/s null_percent=0.1 size=4.1943M
TakeChunkedInt64RandomIndicesWithNulls/4194304/10      3865247 ns      3864123 ns          179 items_per_second=1085.45M/s null_percent=10 size=4.1943M
TakeChunkedInt64RandomIndicesWithNulls/4194304/2       6030057 ns      6028330 ns          103 items_per_second=695.765M/s null_percent=50 size=4.1943M
TakeChunkedInt64RandomIndicesWithNulls/4194304/1       1441460 ns      1440831 ns          498 items_per_second=2.91103G/s null_percent=100 size=4.1943M
TakeChunkedInt64RandomIndicesWithNulls/4194304/0       1923944 ns      1923507 ns          371 items_per_second=2.18055G/s null_percent=0 size=4.1943M
TakeChunkedInt64MonotonicIndices/4194304/1000          1827538 ns      1827005 ns          383 items_per_second=2.29573G/s null_percent=0.1 size=4.1943M
TakeChunkedInt64MonotonicIndices/4194304/10            1806226 ns      1805824 ns          387 items_per_second=2.32265G/s null_percent=10 size=4.1943M
TakeChunkedInt64MonotonicIndices/4194304/2             2600097 ns      2599818 ns          269 items_per_second=1.61331G/s null_percent=50 size=4.1943M
TakeChunkedInt64MonotonicIndices/4194304/1             1069649 ns      1069057 ns          667 items_per_second=3.92337G/s null_percent=100 size=4.1943M
TakeChunkedInt64MonotonicIndices/4194304/0              949020 ns       948810 ns          738 items_per_second=4.42059G/s null_percent=0 size=4.1943M

Proposed:

--------------------------------------------------------------------------------------------------------------
Benchmark                                                    Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------
TakeChunkedInt64RandomIndicesNoNulls/4194304/1000     12734298 ns     12731491 ns           53 items_per_second=329.443M/s null_percent=0.1 size=4.1943M
TakeChunkedInt64RandomIndicesNoNulls/4194304/10       13030870 ns     13027741 ns           54 items_per_second=321.952M/s null_percent=10 size=4.1943M
TakeChunkedInt64RandomIndicesNoNulls/4194304/2        11699983 ns     11697067 ns           60 items_per_second=358.577M/s null_percent=50 size=4.1943M
TakeChunkedInt64RandomIndicesNoNulls/4194304/1         8201179 ns      8200176 ns           85 items_per_second=511.489M/s null_percent=100 size=4.1943M
TakeChunkedInt64RandomIndicesNoNulls/4194304/0         8281752 ns      8280094 ns           85 items_per_second=506.553M/s null_percent=0 size=4.1943M
TakeChunkedInt64RandomIndicesWithNulls/4194304/1000   10054480 ns     10052444 ns           54 items_per_second=417.242M/s null_percent=0.1 size=4.1943M
TakeChunkedInt64RandomIndicesWithNulls/4194304/10     10256613 ns     10254956 ns           68 items_per_second=409.003M/s null_percent=10 size=4.1943M
TakeChunkedInt64RandomIndicesWithNulls/4194304/2       8735435 ns      8734213 ns           80 items_per_second=480.215M/s null_percent=50 size=4.1943M
TakeChunkedInt64RandomIndicesWithNulls/4194304/1         98487 ns        98479 ns         7191 items_per_second=42.5911G/s null_percent=100 size=4.1943M
TakeChunkedInt64RandomIndicesWithNulls/4194304/0       8187746 ns      8186131 ns           84 items_per_second=512.367M/s null_percent=0 size=4.1943M
TakeChunkedInt64MonotonicIndices/4194304/1000          2008840 ns      2007886 ns          352 items_per_second=2.08892G/s null_percent=0.1 size=4.1943M
TakeChunkedInt64MonotonicIndices/4194304/10            2467236 ns      2466807 ns          285 items_per_second=1.7003G/s null_percent=10 size=4.1943M
TakeChunkedInt64MonotonicIndices/4194304/2             3366623 ns      3365923 ns          208 items_per_second=1.24611G/s null_percent=50 size=4.1943M
TakeChunkedInt64MonotonicIndices/4194304/1             1295032 ns      1294838 ns          538 items_per_second=3.23925G/s null_percent=100 size=4.1943M
TakeChunkedInt64MonotonicIndices/4194304/0             1240969 ns      1240848 ns          566 items_per_second=3.38019G/s null_percent=0 size=4.1943M

wjones127 · 2022-09-20T20:44:06Z

After some more testing it seems like concatenating buffers and indexing into that always wins over using ChunkResolver, both in primitive and string case.

From my quick test of Take on chunked string arrays:

Concatenating: items_per_second=50.3014M/s
ChunkResolver: items_per_second=6.9115M/s

Unless there is a more performant way, we might just only have Take kernels specialized for ChunkedArrays for String / Binary / List (and also Struct since it will then need to handle rechunking of child arrays.)

@edponce I know you looked at ChunkResolver recently in #12055. Do these results seem reasonable to you?

String ChunkedArray Take Benchmark Code

void BenchStringTest() {
    // Create chunked string array
    int32_t string_min_length = 0, string_max_length = 32;
    const int64_t n_chunks = 10;
    const int64_t array_size = args.size / n_chunks;
    ArrayVector chunks;
    for (int64_t i = 0; i < n_chunks; ++i) {
      auto chunk = std::static_pointer_cast<StringArray>(
          rand.String(args.size, string_min_length, string_max_length, 0));
      chunks.push_back(chunk);
    }
    auto values = ChunkedArray(chunks);

    // Create indices
    auto indices =
        rand.Int32(values.length(), 0, static_cast<int32_t>(values.length() - 1), 0);

    for (auto _ : state) {
      TypedBufferBuilder<int32_t> offset_builder;
      TypedBufferBuilder<uint8_t> data_builder;

      const int32_t* indices_values = indices->data()->GetValues<int32_t>(1);

      if (concat_chunks) {
        // Concat the chunks
        ASSIGN_OR_ABORT(std::shared_ptr<Array> values_combined,
                        Concatenate(values.chunks()));

        const uint8_t* values_data = values_combined->data()->GetValues<uint8_t>(1);
        const int32_t* values_offsets = values_combined->data()->GetValues<int32_t>(2);

        // for each value
        for (int i = 0; i < indices->length(); ++i) {
          int32_t index = indices_values[i];
          // get the offset and size
          int32_t offset = values_offsets[index];
          int64_t length = values_offsets[index + 1] - offset;
          // throw them on the builder
          data_builder.UnsafeAppend(values_data + offset, length);
        }

      } else {
        using arrow::internal::ChunkLocation;
        using arrow::internal::ChunkResolver;

        ChunkResolver resolver(values.chunks());

        std::vector<const uint8_t*> values_data(values.num_chunks());
        std::vector<const int32_t*> values_offsets(values.num_chunks());
        for (int i = 0; i < values.num_chunks(); ++i) {
          values_data[i] = values.chunks()[i]->data()->GetValues<uint8_t>(1);
          values_offsets[i] = values.chunks()[i]->data()->GetValues<int32_t>(2);
        }

        // for each index
        for (int i = 0; i < indices->length(); ++i) {
          // Resolve the location
          ChunkLocation location = resolver.Resolve(indices_values[i]);

          // Get the offset and size
          int32_t offset = values_offsets[location.chunk_index][location.index_in_chunk];
          int32_t length =
              values_offsets[location.chunk_index][location.index_in_chunk + 1] - offset;

          // throw them on the builder
          data_builder.UnsafeAppend(values_data[location.chunk_index] + offset, length);
        }
      }
    }
  }

edponce · 2022-09-21T19:24:44Z

@wjones127 I think the benchmark test is not doing a fair comparison as there is no need on copying the data and offsets into the temporary std::vectors. The first loop is not necessary. Nevertheless, there are only 10 chunks, so I wouldn't expect a significant penalty from it but better measure than assume.

Also, IIRC the binary search in ChunkResolver performs best when "taking" values pertaining to the same chunk (e.g., large chunks) because the previous chunk is cached.

edponce · 2022-09-21T19:49:21Z

Now, definitely a fixed array will be quicker to access (simple offset calculation) than a binary search across allocations that are not necessarily contiguous in hardware and may even reside in different OS pages. I'd be curious how the benchmark compares when using a large number of chunks: 10, 100, 1000 which is where the concatenation penalty is noticeable. Obviously, the sizes of the chunks also matter.

wjones127 · 2022-09-22T18:45:20Z

I'd be curious how the benchmark compares when using a large number of chunks: 10, 100, 1000 which is where the concatenation penalty is noticeable. Obviously, the sizes of the chunks also matter.

Reran with a variety of chunk sizes and total array sizes. It seems like ChunkResolver is only better in the extreme case of extremely small chunks (chunk size of 1000 and size of 4,194 means each chunk has about 4 elements).

pitrou · 2022-09-22T19:23:12Z

@wjones127 These numbers are for the random or monotonic use case?

edponce · 2022-09-22T19:25:18Z

@wjones127 Thanks for sharing these benchmarks. Are these results measured without the extra overhead of the temporary std::vector for the ChunkResolver case?

It is reasonable that the smaller the chunk size the better performance ChunkResolver cases are, due to chunk caching and the higher probability of hitting the same chunk consecutively.

Performance is a tricky business bc it depends on the metrics you are evaluating for. Both approaches have advantage and disadvantage. If you have a large dataset and are somewhat memory constrained, the concatenation approach may not be adequate due to the extra storage. The ChunkResolver is the most general solution with least overhead on memory use and still reasonable performance. AFAIK, Arrow does not tracks memory statistics to permit selecting which of these approaches should be used. Well, maybe adding an option for the client code to decide but this does not seem to follow Arrow's general design.

wjones127 · 2022-09-22T23:17:33Z

@wjones127 These numbers are for the random or monotonic use case?

That's for random. Here is it including monotonic, which makes it more complex:

So it seems like it's better in the monotonic case, but worse in the random case. My benchmark code is available here: https://github.com/wjones127/arrow/blob/370d0870c68627224aedcfb79cfd7ceb7d0dfa99/cpp/src/arrow/compute/kernels/vector_selection_benchmark.cc#L206-L277

Are these results measured without the extra overhead of the temporary std::vector for the ChunkResolver case?

I hadn't removed it. Removed it in the test that I'm showing results for above.

The ChunkResolver is the most general solution with least overhead on memory use and still reasonable performance.

In some cases it seems like it would be a serious regression; so I'm trying to figure out which cases those are if we can avoid using ChunkResolver in those cases.

It's hard to say if that extra memory usage is that significant. I feel like some extra memory usage will always happen within compute function. This is large since it needs to operate on the entire chunk, rather than just a chunk at a time. But also with memory pools we can rapidly reuse memory; so I imagine for example if we are running Take() on a Table with multiple string columns, the memory used temporarily for the first one could be re-used when processing the second.

edponce · 2022-09-24T02:59:09Z

From the results above, before performing the Take operation what information do we know that could allow us to select the adequate strategy?

The main factor driving the differences is the indices access order (random vs monotonic). I do not think we can identify a priori if the take indices are monotonic or random. If so, we can clearly select a strategy. Please correct me if I'm wrong here.
Number of chunks and size we can get from the chunked array.

Now let's try to very hand-wavy summarize some observations based on logical array size.

Random order

1K --> concat is ~2x faster
10K --> concat is ~4x faster
1M and 10M --> concat is ~1.5x faster

Monotonic order

1K and 10K --> concat is significantly faster for up to 10's number of chunks, ChunkResolver is faster for 100 and 1K chunks
1M and 10M --> ChunkResolver is ~1.5x faster

Based on this a general decision rule could be, if indices are random or array size and number of chunks is not that large, use concat, otherwise ChunkResolver. But if we can't identify access order, then it does looks like concat would be a better choice assuming most use cases of Take make random accesses.

An alternative approach could be to add a configuration flag in Arrow that states "optimize for speed or optimize for storage", and this could be used to select strategies all throughout.

pitrou · 2022-09-24T07:59:48Z

cpp/src/arrow/compute/kernels/vector_selection.cc

+            if (values.IsValid(indices_data[position])) {
              // value is not null
-              out[position] = values_data[indices_data[position]];
+              out[position] = values.GetValue(indices_data[position]);


I'll note that in the IsValid -> GetValue sequence the chunked resolution is called twice. The compiler might be able to optimize away the second call but that's not guaranteed.

…instead of TakeCA (#40206) ### Rationale for this change `take` concatenates chunks when it's applied to a chunked `values` array, but when the `indices` arrays is also `chunked` it concatenates `values` more than once -- one `Concatenate` call with `values.chunks()` for every chunk in `indices`. This PR doesn't remove the concatenation, but ensures it's done only once instead of `indices.size()` times. ### What changes are included in this PR? - Adding return type to the `TakeXX` names (-> `TakeXXY`) to makes code easier to understand - Adding benchmarks to `TakeCCC` — copied from #13857 - Remove the concatenation from the loop body (!) ### Are these changes tested? By existing tests. ### Are there any user-facing changes? A faster compute kernel. * GitHub Issue: #40207 Authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>

github-actions bot added the Component: C++ label Aug 11, 2022

wjones127 commented Aug 12, 2022

View reviewed changes

pitrou reviewed Aug 17, 2022

View reviewed changes

wjones127 added 9 commits September 14, 2022 14:29

feat: implement dispatch for ChunkedArray take

7ea173c

Draft implementation for ChunkedPrimitiveTake

d7e7c10

Use existing indices chunk layout

e36e031

Reduce code duplication

a9ea85c

Add boolean chunked take

882ffd5

Partial impl

00b9792

more progress

0353d11

Fix dispatch

173bb2a

Fix a few bugs

97cb201

wjones127 force-pushed the ARROW-9773-take-chunking branch from 895e2da to 97cb201 Compare September 15, 2022 22:33

wjones127 added 2 commits September 16, 2022 08:48

Cleanup redundant classes

02c23b6

fix a few bugs

a43fa07

wjones127 added 4 commits September 19, 2022 08:19

Prefer TakeAA over TakeCC

612862a

Cleanup

7e5ad7a

Revert most filter changes

f7ed194

Add chunked take benchmark

66c4780

pitrou reviewed Sep 24, 2022

View reviewed changes

asfimport mentioned this pull request Nov 23, 2022

[C++] Take kernel can't handle ChunkedArrays that don't fit in an Array #25822

Open

wjones127 closed this Jun 10, 2023

felipecrv mentioned this pull request Feb 23, 2024

GH-40207: [C++] TakeCC: Concatenate only once and delegate to TakeAA instead of TakeCA #40206

Merged

felipecrv added a commit to felipecrv/arrow that referenced this pull request Feb 27, 2024

TakeChunked benchmarks from apache#13857

367ff9c

	// Preprocessed sort key.
	struct ResolvedSortKey {
	ResolvedSortKey(const std::shared_ptr<Array>& array, SortOrder order)
	: type(GetPhysicalType(array->type())),
	owned_array(GetPhysicalArray(*array, type)),
	array(*owned_array),
	order(order),
	null_count(array->null_count()) {}

	using LocationType = int64_t;

	template <typename ArrayType>
	ResolvedChunk<ArrayType> GetChunk(int64_t index) const {
	return {&checked_cast<const ArrayType&>(array), index};
	}

	const std::shared_ptr<DataType> type;
	std::shared_ptr<Array> owned_array;
	const Array& array;
	SortOrder order;
	int64_t null_count;
	};

	// Preprocessed sort key.
	struct ResolvedSortKey {
	ResolvedSortKey(const std::shared_ptr<DataType>& type, ArrayVector chunks,
	SortOrder order, int64_t null_count)
	: type(GetPhysicalType(type)),
	owned_chunks(std::move(chunks)),
	chunks(GetArrayPointers(owned_chunks)),
	order(order),
	null_count(null_count) {}

	using LocationType = ::arrow::internal::ChunkLocation;

	template <typename ArrayType>
	ResolvedChunk<ArrayType> GetChunk(::arrow::internal::ChunkLocation loc) const {
	return {checked_cast<const ArrayType*>(chunks[loc.chunk_index]),
	loc.index_in_chunk};
	}

	// Make a vector of ResolvedSortKeys for the sort keys and the given table.
	// `batches` must be a chunking of `table`.
	static Result<std::vector<ResolvedSortKey>> Make(
	const Table& table, const RecordBatchVector& batches,
	const std::vector<SortKey>& sort_keys) {
	auto factory = [&](const SortField& f) {
	const auto& type = table.schema()->field(f.field_index)->type();
	// We must expose a homogenous chunking for all ResolvedSortKey,
	// so we can't simply pass `table.column(f.field_index)`
	ArrayVector chunks(batches.size());
	std::transform(batches.begin(), batches.end(), chunks.begin(),
	[&](const std::shared_ptr<RecordBatch>& batch) {
	return batch->column(f.field_index);
	});
	return ResolvedSortKey(type, std::move(chunks), f.order,
	table.column(f.field_index)->null_count());
	};

	return ::arrow::compute::internal::ResolveSortKeys<ResolvedSortKey>(
	*table.schema(), sort_keys, factory);
	}

	std::shared_ptr<DataType> type;
	ArrayVector owned_chunks;
	std::vector<const Array*> chunks;
	SortOrder order;
	int64_t null_count;
	};

ARROW-9773: [C++] Implement Take kernel for ChunkedArray #13857

ARROW-9773: [C++] Implement Take kernel for ChunkedArray #13857

Uh oh!

Conversation

wjones127 commented Aug 11, 2022

Uh oh!

github-actions bot commented Aug 11, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

drin Aug 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wjones127 commented Sep 16, 2022

Uh oh!

ursabot commented Sep 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wjones127 commented Sep 19, 2022

Uh oh!

pitrou commented Sep 19, 2022

Uh oh!

wjones127 commented Sep 19, 2022

Uh oh!

wjones127 commented Sep 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

edponce commented Sep 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

edponce commented Sep 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wjones127 commented Sep 22, 2022

Uh oh!

pitrou commented Sep 22, 2022

Uh oh!

edponce commented Sep 22, 2022

Uh oh!

wjones127 commented Sep 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

edponce commented Sep 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

drin Aug 16, 2022 •

edited

Loading

ursabot commented Sep 16, 2022 •

edited

Loading

wjones127 commented Sep 20, 2022 •

edited

Loading

edponce commented Sep 21, 2022 •

edited

Loading

edponce commented Sep 21, 2022 •

edited

Loading

wjones127 commented Sep 22, 2022 •

edited

Loading

edponce commented Sep 24, 2022 •

edited

Loading