GH-39398: [C++][Parquet] Use std::count in ColumnReader ReadLevels#39397
Conversation
|
This is not a MINOR change: https://github.com/apache/arrow/blob/main/CONTRIBUTING.md#Minor-Fixes |
|
The change looks good to me, I'll check some optimization options in compiler for this out of curiousity, would you mind provide the compiler you're using? |
Maybe I'll approve it if you relate this patch to an issue. Also, would you mind optimize |
|
|
I use clang++17 with libstdc++, -O3 -march=x86-64-v3 -fno-omit-frame-pointer -std=c++2a. And use jemalloc as allocator |
mapleFU
left a comment
There was a problem hiding this comment.
+1
I'll wait for a few days waiting for other review it after holiday
| } | ||
| int values_to_read = | ||
| std::count(def_levels, def_levels + num_def_levels, this->max_def_level_); | ||
| total_values = this->ReadValues(values_to_read, values); |
There was a problem hiding this comment.
I'm ok with this but maybe this does not optimize, because values_to_read is a value here, the compiler might well optimize it ( since strict aliasing cannot mixing int16_t* and int64_t*
|
Could you check the build failure on AppVeyor? https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/48852924#L1696 |
|
Aha I think def_level is int16, but actually we need inc int64_t... |
|
@ursabot please benchmark |
|
Benchmark runs are scheduled for commit 23e7c4e. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete. |
|
Thanks for your patience. Conbench analyzed the 6 benchmarking runs that have been run so far on PR commit 23e7c4e. There were 3 benchmark results indicating a performance regression:
The full Conbench report has more details. |
diff --git a/cpp/src/parquet/column_reader.cc b/cpp/src/parquet/column_reader.cc
index a49e58afb..aa11dc57e 100644
--- a/cpp/src/parquet/column_reader.cc
+++ b/cpp/src/parquet/column_reader.cc
@@ -1059,10 +1059,11 @@ class TypedColumnReaderImpl : public TypedColumnReader<DType>,
// If the field is required and non-repeated, there are no definition levels
if (this->max_def_level_ > 0 && def_levels != nullptr) {
- *num_def_levels = this->ReadDefinitionLevels(batch_size, def_levels);
+ int64_t num_def_levels_value = this->ReadDefinitionLevels(batch_size, def_levels);
// TODO(wesm): this tallying of values-to-decode can be performed with better
// cache-efficiency if fused with the level decoding.
- for (int64_t i = 0; i < *num_def_levels; ++i) {
+ *num_def_levels = num_def_levels_value;
+ for (int64_t i = 0; i < num_def_levels_value; ++i) {
if (def_levels[i] == this->max_def_level_) {
++(*values_to_read);
}
@@ -1909,7 +1910,8 @@ class TypedRecordReader : public TypedColumnReaderImpl<DType>,
// When reading dense we need to figure out number of values to read.
const int16_t* def_levels = this->def_levels();
- for (int64_t i = start_levels_position; i < levels_position_; ++i) {
+ int64_t levels_position = levels_position_;
+ for (int64_t i = start_levels_position; i < levels_position; ++i) {
if (def_levels[i] == this->max_def_level_) {
++(*values_to_read);
}I think code like this might minimal, and I don't know if this could be ok in this case... Also I don't know whether regression is related, will wait some days. This patch run faster on my MacOS |
|
Does your code transforms into avx? Because you still write to *values_to_read |
Emm since this is not a reduction, and compiler find there is no strict aliasing problem, I think it might not matter, but it needs further testing. We can waiting for pitrou back from holiday and take a look |
|
It might be a benchmark coverage issue. @Hattonuri if this is showing improvements on your own benchmarks, it might make sense to add more parquet benchmarks in a separate PR so that regressions aren't introduced afterwards. |
|
My benchmark is just a reader based on low-level parquet api reading a single dataset prefix. As it's written using static reflection - it does not use much time. And then i look on perf report and watch asm code of long functions |
|
Maybe we can draft a read-level benchmark in parquet-column-reader-benchmark. in current |
I think turning it into a real benchmark that we can monitor for regressions, without that there is a higher risk the change gets reverted.
This sounds like a reasonable approach. Having benchmarks at different levels, for different data types makes a lot of sense to me. |
|
Hmmm @emkornfield |
Maybe we should make ReadLevels public/visible for testing? Or at least for posterity, can we post the before after of the benchmark? |
|
I'll draft a pr for benchmark only and posting the data there |
|
@emkornfield I've create a draft pr: #39486 In my M1 MacOS with Release (-O2) Origin: New: Also cc @pitrou because you're expert here |
|
This looks good to me, I will wait other committers to approve and merge it :-) I might go for abseil for benchmarking #39403 , since it's caller is much more than the code in this patch |
|
This benchmarks shows a massive improvement: Thank you @Hattonuri ! |
|
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 3af69c8. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 3 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…els (apache#39397) ### Rationale for this change I've found that for-loop here https://github.com/apache/arrow/blob/7c3480e2f028f5881242f227f42155cf833efee7/cpp/src/parquet/column_reader.cc#L1055-L1073 transforms into 0xc0c2f0 <ReadLevels()+96> inc %rdx 0xc0c2f3 <ReadLevels()+99> cmp %rax,%rdx 0xc0c2f6 <ReadLevels()+102> jge 0xc0c30c <ReadLevels()+124> 0xc0c2f8 <ReadLevels()+104> cmp %cx,(%r14,%rdx,2) 0xc0c2fd <ReadLevels()+109> jne 0xc0c2f0 <ReadLevels()+96> 0xc0c2ff <ReadLevels()+111> incq 0x0(%rbp) 0xc0c303 <ReadLevels()+115> mov (%rbx),%rax 0xc0c306 <ReadLevels()+118> jmp 0xc0c2f0 <ReadLevels()+96> That means that it uses iteration element by element and changes reference with incq I think that the reason is that values_to_read and num_def_levels are not set as restrict. So the compiler can not optimize this to a more efficient way(for example using simd) On my flamegraph this part showed ~10% of time spent In this file there also some for loops which could easily be changed to std::count, but they do not touch references and I don't know the reason why std::count was not used in the all cpp/src/parquet/ directory - so I didn't change much ### What changes are included in this PR? Using `std::count` in `parquet/column_reader.cc` to avoid loop not being optimized ### Are these changes tested? They are tested with unittest but not benched because I don't know what bench will show performance rise here( ### Are there any user-facing changes? * Closes: apache#39398 Authored-by: Dmitry Stasenko <dmitry.stasenko@pinely.com> Signed-off-by: mwish <maplewish117@gmail.com>
Rationale for this change
I've found that for-loop here
arrow/cpp/src/parquet/column_reader.cc
Lines 1055 to 1073 in 7c3480e
transforms into
0xc0c2f0 <ReadLevels()+96> inc %rdx
0xc0c2f3 <ReadLevels()+99> cmp %rax,%rdx
0xc0c2f6 <ReadLevels()+102> jge 0xc0c30c <ReadLevels()+124>
0xc0c2f8 <ReadLevels()+104> cmp %cx,(%r14,%rdx,2)
0xc0c2fd <ReadLevels()+109> jne 0xc0c2f0 <ReadLevels()+96>
0xc0c2ff <ReadLevels()+111> incq 0x0(%rbp)
0xc0c303 <ReadLevels()+115> mov (%rbx),%rax
0xc0c306 <ReadLevels()+118> jmp 0xc0c2f0 <ReadLevels()+96>
That means that it uses iteration element by element and changes reference with incq
I think that the reason is that values_to_read and num_def_levels are not set as restrict. So the compiler can not optimize this to a more efficient way(for example using simd)
On my flamegraph this part showed ~10% of time spent
In this file there also some for loops which could easily be changed to std::count, but they do not touch references and I don't know the reason why std::count was not used in the all cpp/src/parquet/ directory - so I didn't change much
What changes are included in this PR?
Using
std::countinparquet/column_reader.ccto avoid loop not being optimizedAre these changes tested?
They are tested with unittest but not benched because I don't know what bench will show performance rise here(
Are there any user-facing changes?