Skip to content

GH-39398: [C++][Parquet] Use std::count in ColumnReader ReadLevels#39397

Merged
mapleFU merged 3 commits intoapache:mainfrom
Hattonuri:dstasenko/use_count_column_reader_read_levels
Jan 9, 2024
Merged

GH-39398: [C++][Parquet] Use std::count in ColumnReader ReadLevels#39397
mapleFU merged 3 commits intoapache:mainfrom
Hattonuri:dstasenko/use_count_column_reader_read_levels

Conversation

@Hattonuri
Copy link
Copy Markdown
Contributor

@Hattonuri Hattonuri commented Dec 29, 2023

Rationale for this change

I've found that for-loop here

void ReadLevels(int64_t batch_size, int16_t* def_levels, int16_t* rep_levels,
int64_t* num_def_levels, int64_t* values_to_read) {
batch_size =
std::min(batch_size, this->num_buffered_values_ - this->num_decoded_values_);
// If the field is required and non-repeated, there are no definition levels
if (this->max_def_level_ > 0 && def_levels != nullptr) {
*num_def_levels = this->ReadDefinitionLevels(batch_size, def_levels);
// TODO(wesm): this tallying of values-to-decode can be performed with better
// cache-efficiency if fused with the level decoding.
for (int64_t i = 0; i < *num_def_levels; ++i) {
if (def_levels[i] == this->max_def_level_) {
++(*values_to_read);
}
}
} else {
// Required field, read all values
*values_to_read = batch_size;
}

transforms into

0xc0c2f0 <ReadLevels()+96> inc %rdx
0xc0c2f3 <ReadLevels()+99> cmp %rax,%rdx
0xc0c2f6 <ReadLevels()+102> jge 0xc0c30c <ReadLevels()+124>
0xc0c2f8 <ReadLevels()+104> cmp %cx,(%r14,%rdx,2)
0xc0c2fd <ReadLevels()+109> jne 0xc0c2f0 <ReadLevels()+96>
0xc0c2ff <ReadLevels()+111> incq 0x0(%rbp)
0xc0c303 <ReadLevels()+115> mov (%rbx),%rax
0xc0c306 <ReadLevels()+118> jmp 0xc0c2f0 <ReadLevels()+96>

That means that it uses iteration element by element and changes reference with incq
I think that the reason is that values_to_read and num_def_levels are not set as restrict. So the compiler can not optimize this to a more efficient way(for example using simd)

On my flamegraph this part showed ~10% of time spent

In this file there also some for loops which could easily be changed to std::count, but they do not touch references and I don't know the reason why std::count was not used in the all cpp/src/parquet/ directory - so I didn't change much

What changes are included in this PR?

Using std::count in parquet/column_reader.cc to avoid loop not being optimized

Are these changes tested?

They are tested with unittest but not benched because I don't know what bench will show performance rise here(

Are there any user-facing changes?

@Hattonuri Hattonuri requested a review from wgtmac as a code owner December 29, 2023 22:28
@github-actions github-actions bot added the awaiting review Awaiting review label Dec 29, 2023
@kou kou changed the title MINOR: [C++] [Parquet] Use std::count in ColumnReader ReadLevels MINOR: [C++][Parquet] Use std::count in ColumnReader ReadLevels Dec 30, 2023
@kou
Copy link
Copy Markdown
Member

kou commented Dec 30, 2023

This is not a MINOR change: https://github.com/apache/arrow/blob/main/CONTRIBUTING.md#Minor-Fixes
Could you open a new issue for this?

@mapleFU
Copy link
Copy Markdown
Member

mapleFU commented Dec 30, 2023

The change looks good to me, I'll check some optimization options in compiler for this

out of curiousity, would you mind provide the compiler you're using?

@mapleFU
Copy link
Copy Markdown
Member

mapleFU commented Dec 30, 2023

I think that the reason is that values_to_read and num_def_levels are not set as restrict. So the compiler can not optimize this to a more efficient way(for example using simd)

Maybe max_def_level_ is also related. values_to_read and max_def_level_ is hard for compiler to analyze

I'll approve it if you relate this patch to an issue. Also, would you mind optimize ReadDenseForOptional?

@Hattonuri Hattonuri changed the title MINOR: [C++][Parquet] Use std::count in ColumnReader ReadLevels GH-39398: [C++] [Parquet] Use std::count in ColumnReader ReadLevels Dec 30, 2023
@github-actions
Copy link
Copy Markdown

⚠️ GitHub issue #39398 has been automatically assigned in GitHub to PR creator.

@Hattonuri
Copy link
Copy Markdown
Contributor Author

Hattonuri commented Dec 30, 2023

out of curiousity, would you mind provide the compiler you're using?

I use clang++17 with libstdc++, -O3 -march=x86-64-v3 -fno-omit-frame-pointer -std=c++2a. And use jemalloc as allocator

Copy link
Copy Markdown
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

I'll wait for a few days waiting for other review it after holiday

}
int values_to_read =
std::count(def_levels, def_levels + num_def_levels, this->max_def_level_);
total_values = this->ReadValues(values_to_read, values);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with this but maybe this does not optimize, because values_to_read is a value here, the compiler might well optimize it ( since strict aliasing cannot mixing int16_t* and int64_t*

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Dec 30, 2023
@kou
Copy link
Copy Markdown
Member

kou commented Dec 30, 2023

Could you check the build failure on AppVeyor?

https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/48852924#L1696

C:/projects/arrow/cpp/src/parquet/column_reader.cc(1195): error C2220: the following warning is treated as an error
C:/projects/arrow/cpp/src/parquet/column_reader.cc(1165): note: while compiling class template member function 'int64_t parquet::`anonymous-namespace'::TypedColumnReaderImpl<parquet::BooleanType>::ReadBatchSpaced(int64_t,int16_t *,int16_t *,bool *,uint8_t *,int64_t,int64_t *,int64_t *,int64_t *)'
C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30133\include\type_traits(325): note: see reference to class template instantiation 'parquet::`anonymous-namespace'::TypedColumnReaderImpl<parquet::BooleanType>' being compiled
C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30133\include\memory(1440): note: see reference to class template instantiation 'std::is_convertible<_Yty *,_Ty *>' being compiled
        with
        [
            _Yty=parquet::`anonymous-namespace'::TypedColumnReaderImpl<parquet::BooleanType>,
            _Ty=parquet::ColumnReader
        ]
C:/projects/arrow/cpp/src/parquet/column_reader.cc(1290): note: see reference to class template instantiation 'std::_SP_pointer_compatible<parquet::`anonymous-namespace'::TypedColumnReaderImpl<parquet::BooleanType>,parquet::ColumnReader>' being compiled
C:/projects/arrow/cpp/src/parquet/column_reader.cc(1195): warning C4244: 'initializing': conversion from '__int64' to 'int', possible loss of data

@mapleFU
Copy link
Copy Markdown
Member

mapleFU commented Dec 31, 2023

Aha I think def_level is int16, but actually we need inc int64_t...

@mapleFU
Copy link
Copy Markdown
Member

mapleFU commented Jan 2, 2024

@ursabot please benchmark

@ursabot
Copy link
Copy Markdown

ursabot commented Jan 2, 2024

Benchmark runs are scheduled for commit 23e7c4e. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

@conbench-apache-arrow
Copy link
Copy Markdown

Thanks for your patience. Conbench analyzed the 6 benchmarking runs that have been run so far on PR commit 23e7c4e.

There were 3 benchmark results indicating a performance regression:

The full Conbench report has more details.

@mapleFU
Copy link
Copy Markdown
Member

mapleFU commented Jan 2, 2024

diff --git a/cpp/src/parquet/column_reader.cc b/cpp/src/parquet/column_reader.cc
index a49e58afb..aa11dc57e 100644
--- a/cpp/src/parquet/column_reader.cc
+++ b/cpp/src/parquet/column_reader.cc
@@ -1059,10 +1059,11 @@ class TypedColumnReaderImpl : public TypedColumnReader<DType>,
 
     // If the field is required and non-repeated, there are no definition levels
     if (this->max_def_level_ > 0 && def_levels != nullptr) {
-      *num_def_levels = this->ReadDefinitionLevels(batch_size, def_levels);
+      int64_t num_def_levels_value = this->ReadDefinitionLevels(batch_size, def_levels);
       // TODO(wesm): this tallying of values-to-decode can be performed with better
       // cache-efficiency if fused with the level decoding.
-      for (int64_t i = 0; i < *num_def_levels; ++i) {
+      *num_def_levels = num_def_levels_value;
+      for (int64_t i = 0; i < num_def_levels_value; ++i) {
         if (def_levels[i] == this->max_def_level_) {
           ++(*values_to_read);
         }
@@ -1909,7 +1910,8 @@ class TypedRecordReader : public TypedColumnReaderImpl<DType>,
 
     // When reading dense we need to figure out number of values to read.
     const int16_t* def_levels = this->def_levels();
-    for (int64_t i = start_levels_position; i < levels_position_; ++i) {
+    int64_t levels_position = levels_position_;
+    for (int64_t i = start_levels_position; i < levels_position; ++i) {
       if (def_levels[i] == this->max_def_level_) {
         ++(*values_to_read);
       }

I think code like this might minimal, and I don't know if this could be ok in this case...

Also I don't know whether regression is related, will wait some days. This patch run faster on my MacOS

@Hattonuri
Copy link
Copy Markdown
Contributor Author

Does your code transforms into avx? Because you still write to *values_to_read

@kou kou changed the title GH-39398: [C++] [Parquet] Use std::count in ColumnReader ReadLevels GH-39398: [C++][Parquet] Use std::count in ColumnReader ReadLevels Jan 2, 2024
@mapleFU
Copy link
Copy Markdown
Member

mapleFU commented Jan 2, 2024

Because you still write to *values_to_read

Emm since this is not a reduction, and compiler find there is no strict aliasing problem, I think it might not matter, but it needs further testing. We can waiting for pitrou back from holiday and take a look

@emkornfield
Copy link
Copy Markdown
Contributor

It might be a benchmark coverage issue. @Hattonuri if this is showing improvements on your own benchmarks, it might make sense to add more parquet benchmarks in a separate PR so that regressions aren't introduced afterwards.

@Hattonuri
Copy link
Copy Markdown
Contributor Author

My benchmark is just a reader based on low-level parquet api reading a single dataset prefix. As it's written using static reflection - it does not use much time. And then i look on perf report and watch asm code of long functions

@mapleFU
Copy link
Copy Markdown
Member

mapleFU commented Jan 3, 2024

Maybe we can draft a read-level benchmark in parquet-column-reader-benchmark. in current parquet-column-reader-benchmark we can find some optimization, but it's a bit indirect

@emkornfield
Copy link
Copy Markdown
Contributor

My benchmark is just a reader based on low-level parquet api reading a single dataset prefix. As it's written using static reflection - it does not use much time. And then i look on perf report and watch asm code of long functions

I think turning it into a real benchmark that we can monitor for regressions, without that there is a higher risk the change gets reverted.

Maybe we can draft a read-level benchmark in parquet-column-reader-benchmark. in current parquet-column-reader-benchmark we can find some optimization, but it's a bit indirect

This sounds like a reasonable approach. Having benchmarks at different levels, for different data types makes a lot of sense to me.

@mapleFU
Copy link
Copy Markdown
Member

mapleFU commented Jan 4, 2024

Hmmm @emkornfield SKip and Read in parquet-column-reader benchmark get faster in my MacOS, ReadLevels itself doesn't have any benchmark directly, since it's a bit hacking to get it because it's private :-(

@emkornfield
Copy link
Copy Markdown
Contributor

Hmmm @emkornfield SKip and Read in parquet-column-reader benchmark get faster in my MacOS, ReadLevels itself doesn't have any benchmark directly, since it's a bit hacking to get it because it's private :-(

Maybe we should make ReadLevels public/visible for testing? Or at least for posterity, can we post the before after of the benchmark?

@mapleFU
Copy link
Copy Markdown
Member

mapleFU commented Jan 6, 2024

I'll draft a pr for benchmark only and posting the data there

@mapleFU
Copy link
Copy Markdown
Member

mapleFU commented Jan 6, 2024

@emkornfield I've create a draft pr: #39486

In my M1 MacOS with Release (-O2)

Origin:

-------------------------------------------------------------------------------------------------------------
Benchmark                                                   Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------
ColumnReaderReadLevels/Repetition:1/BatchSize:1000    5309956 ns      5145464 ns          138 bytes_per_second=505.577M/s
ColumnReaderReadLevels/Repetition:2/BatchSize:1000    6481672 ns      6169167 ns          114 bytes_per_second=449.652M/s

New:

-------------------------------------------------------------------------------------------------------------
Benchmark                                                   Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------
ColumnReaderReadLevels/Repetition:1/BatchSize:1000     712139 ns       686560 ns         1087 bytes_per_second=3.70027G/s
ColumnReaderReadLevels/Repetition:2/BatchSize:1000    1705093 ns      1651455 ns          440 bytes_per_second=1.64035G/s

Also cc @pitrou because you're expert here

@Hattonuri
Copy link
Copy Markdown
Contributor Author

Hattonuri commented Jan 7, 2024

@mapleFU Could please you run your benchmark for #39403? (and for implementation with changed another if to assert)

@mapleFU
Copy link
Copy Markdown
Member

mapleFU commented Jan 8, 2024

This looks good to me, I will wait other committers to approve and merge it :-)

I might go for abseil for benchmarking #39403 , since it's caller is much more than the code in this patch

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting committer review Awaiting committer review labels Jan 8, 2024
@mapleFU mapleFU requested a review from pitrou January 8, 2024 07:05
@pitrou
Copy link
Copy Markdown
Member

pitrou commented Jan 8, 2024

This benchmarks shows a massive improvement:
https://conbench.ursa.dev/compare/benchmark-results/0658e4fc116778c480000601eaada951...06593d73f0097fa88000c545dd84021f/

Thank you @Hattonuri !

@mapleFU mapleFU merged commit 3af69c8 into apache:main Jan 9, 2024
@mapleFU mapleFU removed the awaiting merge Awaiting merge label Jan 9, 2024
@conbench-apache-arrow
Copy link
Copy Markdown

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 3af69c8.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 3 possible false positives for unstable benchmarks that are known to sometimes produce them.

dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…els (apache#39397)

### Rationale for this change

I've found that for-loop here
https://github.com/apache/arrow/blob/7c3480e2f028f5881242f227f42155cf833efee7/cpp/src/parquet/column_reader.cc#L1055-L1073

transforms into

0xc0c2f0 <ReadLevels()+96>      inc    %rdx
0xc0c2f3 <ReadLevels()+99>      cmp    %rax,%rdx
0xc0c2f6 <ReadLevels()+102>     jge    0xc0c30c <ReadLevels()+124>
0xc0c2f8 <ReadLevels()+104>     cmp    %cx,(%r14,%rdx,2)
0xc0c2fd <ReadLevels()+109>     jne    0xc0c2f0 <ReadLevels()+96>
0xc0c2ff <ReadLevels()+111>     incq   0x0(%rbp)                                                   
0xc0c303 <ReadLevels()+115>     mov    (%rbx),%rax
0xc0c306 <ReadLevels()+118>     jmp    0xc0c2f0 <ReadLevels()+96>

That means that it uses iteration element by element and changes reference with incq
I think that the reason is that values_to_read and num_def_levels are not set as restrict. So the compiler can not optimize this to a more efficient way(for example using simd)

On my flamegraph this part showed ~10% of time spent

In this file there also some for loops which could easily be changed to std::count, but they do not touch references and I don't know the reason why std::count was not used in the all cpp/src/parquet/ directory - so I didn't change much

### What changes are included in this PR?

Using `std::count` in `parquet/column_reader.cc` to avoid loop not being optimized

### Are these changes tested?
They are tested with unittest but not benched because I don't know what bench will show performance rise here(

### Are there any user-facing changes?

* Closes: apache#39398

Authored-by: Dmitry Stasenko <dmitry.stasenko@pinely.com>
Signed-off-by: mwish <maplewish117@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[C++] [Parquet] Use std::count in parquet ColumnReader

6 participants