GH-39398: [C++][Parquet] Use std::count in ColumnReader ReadLevels by Hattonuri · Pull Request #39397 · apache/arrow

Hattonuri · 2023-12-29T22:28:53Z

Rationale for this change

I've found that for-loop here

Lines 1055 to 1073 in 7c3480e

    
           void ReadLevels(int64_t batch_size, int16_t* def_levels, int16_t* rep_levels, 
        
                           int64_t* num_def_levels, int64_t* values_to_read) { 
        
             batch_size = 
        
                 std::min(batch_size, this->num_buffered_values_ - this->num_decoded_values_); 
        
             // If the field is required and non-repeated, there are no definition levels 
        
             if (this->max_def_level_ > 0 && def_levels != nullptr) { 
        
               *num_def_levels = this->ReadDefinitionLevels(batch_size, def_levels); 
        
               // TODO(wesm): this tallying of values-to-decode can be performed with better 
        
               // cache-efficiency if fused with the level decoding. 
        
               for (int64_t i = 0; i < *num_def_levels; ++i) { 
        
                 if (def_levels[i] == this->max_def_level_) { 
        
                   ++(*values_to_read); 
        
                 } 
        
               } 
        
             } else { 
        
               // Required field, read all values 
        
               *values_to_read = batch_size; 
        
             }

transforms into

0xc0c2f0 <ReadLevels()+96> inc %rdx
0xc0c2f3 <ReadLevels()+99> cmp %rax,%rdx
0xc0c2f6 <ReadLevels()+102> jge 0xc0c30c <ReadLevels()+124>
0xc0c2f8 <ReadLevels()+104> cmp %cx,(%r14,%rdx,2)
0xc0c2fd <ReadLevels()+109> jne 0xc0c2f0 <ReadLevels()+96>
0xc0c2ff <ReadLevels()+111> incq 0x0(%rbp)
0xc0c303 <ReadLevels()+115> mov (%rbx),%rax
0xc0c306 <ReadLevels()+118> jmp 0xc0c2f0 <ReadLevels()+96>

That means that it uses iteration element by element and changes reference with incq
I think that the reason is that values_to_read and num_def_levels are not set as restrict. So the compiler can not optimize this to a more efficient way(for example using simd)

On my flamegraph this part showed ~10% of time spent

In this file there also some for loops which could easily be changed to std::count, but they do not touch references and I don't know the reason why std::count was not used in the all cpp/src/parquet/ directory - so I didn't change much

What changes are included in this PR?

Using std::count in parquet/column_reader.cc to avoid loop not being optimized

Are these changes tested?

They are tested with unittest but not benched because I don't know what bench will show performance rise here(

Are there any user-facing changes?

Closes: [C++] [Parquet] Use std::count in parquet ColumnReader #39398

kou · 2023-12-30T00:28:31Z

This is not a MINOR change: https://github.com/apache/arrow/blob/main/CONTRIBUTING.md#Minor-Fixes
Could you open a new issue for this?

mapleFU · 2023-12-30T05:09:14Z

The change looks good to me, I'll check some optimization options in compiler for this

out of curiousity, would you mind provide the compiler you're using?

mapleFU · 2023-12-30T05:21:25Z

I think that the reason is that values_to_read and num_def_levels are not set as restrict. So the compiler can not optimize this to a more efficient way(for example using simd)

Maybe max_def_level_ is also related. values_to_read and max_def_level_ is hard for compiler to analyze

I'll approve it if you relate this patch to an issue. Also, would you mind optimize ReadDenseForOptional?

github-actions · 2023-12-30T12:28:51Z

⚠️ GitHub issue #39398 has been automatically assigned in GitHub to PR creator.

Hattonuri · 2023-12-30T13:21:31Z

out of curiousity, would you mind provide the compiler you're using?

I use clang++17 with libstdc++, -O3 -march=x86-64-v3 -fno-omit-frame-pointer -std=c++2a. And use jemalloc as allocator

mapleFU

+1

I'll wait for a few days waiting for other review it after holiday

mapleFU · 2023-12-30T15:12:51Z

cpp/src/parquet/column_reader.cc

-      }
+      int values_to_read =
+          std::count(def_levels, def_levels + num_def_levels, this->max_def_level_);
      total_values = this->ReadValues(values_to_read, values);


I'm ok with this but maybe this does not optimize, because values_to_read is a value here, the compiler might well optimize it ( since strict aliasing cannot mixing int16_t* and int64_t*

kou · 2023-12-30T23:18:45Z

Could you check the build failure on AppVeyor?

https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/48852924#L1696

C:/projects/arrow/cpp/src/parquet/column_reader.cc(1195): error C2220: the following warning is treated as an error
C:/projects/arrow/cpp/src/parquet/column_reader.cc(1165): note: while compiling class template member function 'int64_t parquet::`anonymous-namespace'::TypedColumnReaderImpl<parquet::BooleanType>::ReadBatchSpaced(int64_t,int16_t *,int16_t *,bool *,uint8_t *,int64_t,int64_t *,int64_t *,int64_t *)'
C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30133\include\type_traits(325): note: see reference to class template instantiation 'parquet::`anonymous-namespace'::TypedColumnReaderImpl<parquet::BooleanType>' being compiled
C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30133\include\memory(1440): note: see reference to class template instantiation 'std::is_convertible<_Yty *,_Ty *>' being compiled
        with
        [
            _Yty=parquet::`anonymous-namespace'::TypedColumnReaderImpl<parquet::BooleanType>,
            _Ty=parquet::ColumnReader
        ]
C:/projects/arrow/cpp/src/parquet/column_reader.cc(1290): note: see reference to class template instantiation 'std::_SP_pointer_compatible<parquet::`anonymous-namespace'::TypedColumnReaderImpl<parquet::BooleanType>,parquet::ColumnReader>' being compiled
C:/projects/arrow/cpp/src/parquet/column_reader.cc(1195): warning C4244: 'initializing': conversion from '__int64' to 'int', possible loss of data

mapleFU · 2023-12-31T00:47:38Z

Aha I think def_level is int16, but actually we need inc int64_t...

mapleFU · 2024-01-02T05:35:19Z

@ursabot please benchmark

ursabot · 2024-01-02T05:35:25Z

Benchmark runs are scheduled for commit 23e7c4e. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

conbench-apache-arrow · 2024-01-02T10:54:17Z

Thanks for your patience. Conbench analyzed the 6 benchmarking runs that have been run so far on PR commit 23e7c4e.

There were 3 benchmark results indicating a performance regression:

Pull Request Run on arm64-m6g-linux-compute at 2024-01-02 07:01:59Z
- ThreadedTaskGroup (C++) with params=threads:8/task_cost:100000/real_time, source=cpp-micro, suite=arrow-thread-pool-benchmark
Pull Request Run on ursa-thinkcentre-m75q at 2024-01-02 09:21:52Z
- CopyEmptyVector (C++) with params=<SMALL_VECTOR(int)>, source=cpp-micro, suite=arrow-small-vector-benchmark
and 1 more (see the report linked below)

The full Conbench report has more details.

mapleFU · 2024-01-02T14:00:28Z

diff --git a/cpp/src/parquet/column_reader.cc b/cpp/src/parquet/column_reader.cc
index a49e58afb..aa11dc57e 100644
--- a/cpp/src/parquet/column_reader.cc
+++ b/cpp/src/parquet/column_reader.cc
@@ -1059,10 +1059,11 @@ class TypedColumnReaderImpl : public TypedColumnReader<DType>,
 
     // If the field is required and non-repeated, there are no definition levels
     if (this->max_def_level_ > 0 && def_levels != nullptr) {
-      *num_def_levels = this->ReadDefinitionLevels(batch_size, def_levels);
+      int64_t num_def_levels_value = this->ReadDefinitionLevels(batch_size, def_levels);
       // TODO(wesm): this tallying of values-to-decode can be performed with better
       // cache-efficiency if fused with the level decoding.
-      for (int64_t i = 0; i < *num_def_levels; ++i) {
+      *num_def_levels = num_def_levels_value;
+      for (int64_t i = 0; i < num_def_levels_value; ++i) {
         if (def_levels[i] == this->max_def_level_) {
           ++(*values_to_read);
         }
@@ -1909,7 +1910,8 @@ class TypedRecordReader : public TypedColumnReaderImpl<DType>,
 
     // When reading dense we need to figure out number of values to read.
     const int16_t* def_levels = this->def_levels();
-    for (int64_t i = start_levels_position; i < levels_position_; ++i) {
+    int64_t levels_position = levels_position_;
+    for (int64_t i = start_levels_position; i < levels_position; ++i) {
       if (def_levels[i] == this->max_def_level_) {
         ++(*values_to_read);
       }

I think code like this might minimal, and I don't know if this could be ok in this case...

Also I don't know whether regression is related, will wait some days. This patch run faster on my MacOS

Hattonuri · 2024-01-02T14:42:31Z

Does your code transforms into avx? Because you still write to *values_to_read

mapleFU · 2024-01-02T14:51:28Z

Because you still write to *values_to_read

Emm since this is not a reduction, and compiler find there is no strict aliasing problem, I think it might not matter, but it needs further testing. We can waiting for pitrou back from holiday and take a look

emkornfield · 2024-01-02T19:58:17Z

It might be a benchmark coverage issue. @Hattonuri if this is showing improvements on your own benchmarks, it might make sense to add more parquet benchmarks in a separate PR so that regressions aren't introduced afterwards.

Hattonuri · 2024-01-02T20:43:00Z

My benchmark is just a reader based on low-level parquet api reading a single dataset prefix. As it's written using static reflection - it does not use much time. And then i look on perf report and watch asm code of long functions

mapleFU · 2024-01-03T02:40:24Z

Maybe we can draft a read-level benchmark in parquet-column-reader-benchmark. in current parquet-column-reader-benchmark we can find some optimization, but it's a bit indirect

emkornfield · 2024-01-03T18:12:37Z

My benchmark is just a reader based on low-level parquet api reading a single dataset prefix. As it's written using static reflection - it does not use much time. And then i look on perf report and watch asm code of long functions

I think turning it into a real benchmark that we can monitor for regressions, without that there is a higher risk the change gets reverted.

Maybe we can draft a read-level benchmark in parquet-column-reader-benchmark. in current parquet-column-reader-benchmark we can find some optimization, but it's a bit indirect

This sounds like a reasonable approach. Having benchmarks at different levels, for different data types makes a lot of sense to me.

mapleFU · 2024-01-04T11:34:58Z

Hmmm @emkornfield SKip and Read in parquet-column-reader benchmark get faster in my MacOS, ReadLevels itself doesn't have any benchmark directly, since it's a bit hacking to get it because it's private :-(

emkornfield · 2024-01-06T06:47:49Z

Hmmm @emkornfield SKip and Read in parquet-column-reader benchmark get faster in my MacOS, ReadLevels itself doesn't have any benchmark directly, since it's a bit hacking to get it because it's private :-(

Maybe we should make ReadLevels public/visible for testing? Or at least for posterity, can we post the before after of the benchmark?

mapleFU · 2024-01-06T06:56:49Z

I'll draft a pr for benchmark only and posting the data there

mapleFU · 2024-01-06T08:48:19Z

@emkornfield I've create a draft pr: #39486

In my M1 MacOS with Release (-O2)

Origin:

-------------------------------------------------------------------------------------------------------------
Benchmark                                                   Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------
ColumnReaderReadLevels/Repetition:1/BatchSize:1000    5309956 ns      5145464 ns          138 bytes_per_second=505.577M/s
ColumnReaderReadLevels/Repetition:2/BatchSize:1000    6481672 ns      6169167 ns          114 bytes_per_second=449.652M/s

New:

-------------------------------------------------------------------------------------------------------------
Benchmark                                                   Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------
ColumnReaderReadLevels/Repetition:1/BatchSize:1000     712139 ns       686560 ns         1087 bytes_per_second=3.70027G/s
ColumnReaderReadLevels/Repetition:2/BatchSize:1000    1705093 ns      1651455 ns          440 bytes_per_second=1.64035G/s

Also cc @pitrou because you're expert here

Hattonuri · 2024-01-07T23:09:30Z

@mapleFU Could please you run your benchmark for #39403? (and for implementation with changed another if to assert)

mapleFU · 2024-01-08T02:59:29Z

This looks good to me, I will wait other committers to approve and merge it :-)

I might go for abseil for benchmarking #39403 , since it's caller is much more than the code in this patch

pitrou · 2024-01-08T14:55:48Z

This benchmarks shows a massive improvement:
https://conbench.ursa.dev/compare/benchmark-results/0658e4fc116778c480000601eaada951...06593d73f0097fa88000c545dd84021f/

Thank you @Hattonuri !

conbench-apache-arrow · 2024-01-10T12:24:22Z

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 3af69c8.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 3 possible false positives for unstable benchmarks that are known to sometimes produce them.

…els (apache#39397) ### Rationale for this change I've found that for-loop here https://github.com/apache/arrow/blob/7c3480e2f028f5881242f227f42155cf833efee7/cpp/src/parquet/column_reader.cc#L1055-L1073 transforms into 0xc0c2f0 <ReadLevels()+96> inc %rdx 0xc0c2f3 <ReadLevels()+99> cmp %rax,%rdx 0xc0c2f6 <ReadLevels()+102> jge 0xc0c30c <ReadLevels()+124> 0xc0c2f8 <ReadLevels()+104> cmp %cx,(%r14,%rdx,2) 0xc0c2fd <ReadLevels()+109> jne 0xc0c2f0 <ReadLevels()+96> 0xc0c2ff <ReadLevels()+111> incq 0x0(%rbp) 0xc0c303 <ReadLevels()+115> mov (%rbx),%rax 0xc0c306 <ReadLevels()+118> jmp 0xc0c2f0 <ReadLevels()+96> That means that it uses iteration element by element and changes reference with incq I think that the reason is that values_to_read and num_def_levels are not set as restrict. So the compiler can not optimize this to a more efficient way(for example using simd) On my flamegraph this part showed ~10% of time spent In this file there also some for loops which could easily be changed to std::count, but they do not touch references and I don't know the reason why std::count was not used in the all cpp/src/parquet/ directory - so I didn't change much ### What changes are included in this PR? Using `std::count` in `parquet/column_reader.cc` to avoid loop not being optimized ### Are these changes tested? They are tested with unittest but not benched because I don't know what bench will show performance rise here( ### Are there any user-facing changes? * Closes: apache#39398 Authored-by: Dmitry Stasenko <dmitry.stasenko@pinely.com> Signed-off-by: mwish <maplewish117@gmail.com>

Use std::count in ColumnReader ReadLevels

a01174a

Hattonuri requested a review from wgtmac as a code owner December 29, 2023 22:28

github-actions bot added the awaiting review Awaiting review label Dec 29, 2023

kou changed the title ~~MINOR: [C++] [Parquet] Use std::count in ColumnReader ReadLevels~~ MINOR: [C++][Parquet] Use std::count in ColumnReader ReadLevels Dec 30, 2023

Hattonuri changed the title ~~MINOR: [C++][Parquet] Use std::count in ColumnReader ReadLevels~~ GH-39398: [C++] [Parquet] Use std::count in ColumnReader ReadLevels Dec 30, 2023

Use std::count in ColumnReader operations

c507f90

github-actions bot added Component: Parquet Component: C++ labels Dec 30, 2023

mapleFU approved these changes Dec 30, 2023

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Dec 30, 2023

change int to int64_t on values_to_read

23e7c4e

kou changed the title ~~GH-39398: [C++] [Parquet] Use std::count in ColumnReader ReadLevels~~ GH-39398: [C++][Parquet] Use std::count in ColumnReader ReadLevels Jan 2, 2024

emkornfield approved these changes Jan 8, 2024

View reviewed changes

github-actions bot added awaiting merge Awaiting merge and removed awaiting committer review Awaiting committer review labels Jan 8, 2024

mapleFU requested a review from pitrou January 8, 2024 07:05

pitrou approved these changes Jan 8, 2024

View reviewed changes

mapleFU merged commit 3af69c8 into apache:main Jan 9, 2024

mapleFU removed the awaiting merge Awaiting merge label Jan 9, 2024

	void ReadLevels(int64_t batch_size, int16_t* def_levels, int16_t* rep_levels,
	int64_t* num_def_levels, int64_t* values_to_read) {
	batch_size =
	std::min(batch_size, this->num_buffered_values_ - this->num_decoded_values_);

	// If the field is required and non-repeated, there are no definition levels
	if (this->max_def_level_ > 0 && def_levels != nullptr) {
	*num_def_levels = this->ReadDefinitionLevels(batch_size, def_levels);
	// TODO(wesm): this tallying of values-to-decode can be performed with better
	// cache-efficiency if fused with the level decoding.
	for (int64_t i = 0; i < *num_def_levels; ++i) {
	if (def_levels[i] == this->max_def_level_) {
	++(*values_to_read);
	}
	}
	} else {
	// Required field, read all values
	*values_to_read = batch_size;
	}

Conversation

Hattonuri commented Dec 29, 2023 • edited by mapleFU Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

kou commented Dec 30, 2023

Uh oh!

mapleFU commented Dec 30, 2023

Uh oh!

mapleFU commented Dec 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 30, 2023

Uh oh!

Hattonuri commented Dec 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mapleFU left a comment

Choose a reason for hiding this comment

Uh oh!

mapleFU Dec 30, 2023

Choose a reason for hiding this comment

Uh oh!

kou commented Dec 30, 2023

Uh oh!

mapleFU commented Dec 31, 2023

Uh oh!

mapleFU commented Jan 2, 2024

Uh oh!

ursabot commented Jan 2, 2024

Uh oh!

conbench-apache-arrow bot commented Jan 2, 2024

Uh oh!

mapleFU commented Jan 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Hattonuri commented Jan 2, 2024

Uh oh!

mapleFU commented Jan 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

emkornfield commented Jan 2, 2024

Uh oh!

Hattonuri commented Jan 2, 2024

Uh oh!

mapleFU commented Jan 3, 2024

Uh oh!

emkornfield commented Jan 3, 2024

Uh oh!

mapleFU commented Jan 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

emkornfield commented Jan 6, 2024

Uh oh!

mapleFU commented Jan 6, 2024

Uh oh!

mapleFU commented Jan 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Hattonuri commented Jan 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mapleFU commented Jan 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pitrou commented Jan 8, 2024

Uh oh!

conbench-apache-arrow bot commented Jan 10, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Hattonuri commented Dec 29, 2023 •

edited by mapleFU

Loading

mapleFU commented Dec 30, 2023 •

edited

Loading

Hattonuri commented Dec 30, 2023 •

edited

Loading

mapleFU commented Jan 2, 2024 •

edited

Loading

mapleFU commented Jan 2, 2024 •

edited

Loading

mapleFU commented Jan 4, 2024 •

edited

Loading

mapleFU commented Jan 6, 2024 •

edited

Loading

Hattonuri commented Jan 7, 2024 •

edited

Loading

mapleFU commented Jan 8, 2024 •

edited

Loading