GH-39402: [C++] bit_util TrailingBits faster#39403
GH-39402: [C++] bit_util TrailingBits faster#39403Hattonuri wants to merge 2 commits intoapache:mainfrom
Conversation
|
|
There was a problem hiding this comment.
Thanks @Hattonuri and wish you have a happy new year!
From your godbolt optimizations, seems the current impl might enable bzhi when during the current implemention. And the second is doing
The current:
mov rax, rdi
bzhi rax, rax, rsi
ret
Your impl:
shlx rax, rax, rsi
not rax
and rax, rdi
ret
I'm not an expert on this. Would it be faster in some case? And maybe we should also testing this on ARM @cyb70289
|
|
||
| // Returns the 'num_bits' least-significant bits of 'v'. | ||
| static inline uint64_t TrailingBits(uint64_t v, int num_bits) { | ||
| if (ARROW_PREDICT_FALSE(num_bits == 0)) return 0; |
There was a problem hiding this comment.
(so generally, checks num_bits == 0 is not highly related to the optimization?
There was a problem hiding this comment.
i think related, because ((v >> 0) << 0) ^ v == v ^ v == 0
There was a problem hiding this comment.
and i think that is the main reason for performance increase
There was a problem hiding this comment.
and i think that is the main reason for performance increase
As I benchmark. Without -march=skylake, removing the ARROW_PREDICT_FALSE(num_bits == 0) would not make it faster or slower( seems quickbench runs faster because shr is slow?). And as mentioned #39403 (comment) . It might affect the command generation later. Would you mind testing that?
|
I found a similar logic in snappy: https://github.com/google/snappy/blob/main/snappy.cc#L1008 Would we do the similar thing? (Also cc @pitrou because this might bmi2 related?) |
|
Please, can you post actual benchmarks of reading Parquet files? Micro-benchmarks of a tiny helper function are not that interesting. |
|
@ursabot please benchmark |
|
Benchmark runs are scheduled for commit 7f736fb. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete. |
|
I found with most compilers With Current: after: Would you mind test which is faster with a CPU with avx2 and bmi2 enabled? @Hattonuri After I try them they generate same code with clang17.0.1 and argument |
|
Oh I think I've found the reason... uint64_t TrailingBits2(uint64_t v, int num_bits) {
if (__builtin_expect(num_bits == 0, 0)) return 0;
if (__builtin_expect(num_bits >= 64, 0)) return v;
return ((v >> num_bits) << num_bits) ^ v;
}Also generate |
|
About benchmarks Tested my program in which parquet library take ~60% of time. dstasenko@bench-prod15:~$ for i in real 0m46.382s real 0m46.416s real 0m46.711s real 0m46.406s real 0m46.369s real 0m47.161s real 0m47.344s real 0m47.233s real 0m47.300s real 0m48.072s |
|
I tested variant with mask 0xffffffffffffffff and saw no difference with mine |
|
What did your compiler and instruction like? Maybe I can take a testing tomorrow. I'm mentioning this because I'm afraid this change might cause some vendoring become even slower... |
|
I also tried to remove inline word but nothing changed |
|
Your options is same as here: #39403 (comment) . This is proved to optimize. I'll try using AVX2 tomorrow.
Emmm you can force not inline if you like... |
Would you mind add the |
|
Thanks for your patience. Conbench analyzed the 6 benchmarking runs that have been run so far on PR commit 7f736fb. There were 2 benchmark results indicating a performance regression:
The full Conbench report has more details. |
|
Aha since on some machine the performance even got worse... |
static inline uint64_t TrailingBits(uint64_t v, int num_bits) {
if (ARROW_PREDICT_FALSE(num_bits == 0)) return 0;
if (ARROW_PREDICT_FALSE(num_bits >= 64)) return v;
return ((v >> num_bits) << num_bits) ^ v;
|
| if (ARROW_PREDICT_FALSE(num_bits >= 64)) return v; | ||
| int n = 64 - num_bits; | ||
| return (v << n) >> n; | ||
| return ((v >> num_bits) << num_bits) ^ v; |
There was a problem hiding this comment.
What about return v & ~(-1ULL << num_bits); ?
It enables gcc to optimize the code with bmi2 bzhi.
https://godbolt.org/z/oq9zx4nhf
And looks it's slightly faster even without bmi2.
https://quick-bench.com/q/rgBQzUFls9IP48xm3JiRPtJoI_M
There was a problem hiding this comment.
Aha I've tried on clang 17.0.1 and it doesn't generate bzhi, compiler is so hacking...
I will try on Arm when have time. This change looks reasonable to me. |
|
What do you think about changing if on 64 bits to assertion? |
This changes the code behaviour. I don't think we can do it. |
You remind me that might because the port scheduling. Checking might uses same ports with BMI...So in clang they might generate other instruction... |
|
We can run llvm-mca tool at godbolt. Looks the non-bzhi code might be better (higher IPC, etc). |
|
By the way, in gcc on llvm mca https://godbolt.org/z/MKrEsdPxd my variant shows the best "total cycles" and best "IPC" score 🤔 |
|
As i understood, IPC is higher but we need to compare "instructions" field, because IPC is higher but IPC*Total cycles. Because from two compiled representations of snappy these multiplications is the same. So the difference only on instruction intensiveness. But why my implementation in non optimized way has less total cycles remains.... |
|
https://www.agner.org/optimize/instruction_tables.pdf |
|
Coming back to this PR. Is there any Arrow benchmark improve after this change? Are the two regressions related? |
|
(Perhaps not, and via the post https://abseil.io/fast/39 , maybe we should benchmark the user of this function) |
I compared total cycles, not total instructions |
|
perhaps for each |
|
can we merge this?) |
|
I'm becoming very busy in these few days, maybe you can draft a micro benchmark Also cc @pitrou |
|
I think your ReadLevels benchmark should work fine, because ReadLevels is the function that uses TrailingBits the most on flamegraph in the PR |
|
@Hattonuri Could you rebase on git main? |
|
I've merged a pr about readLevels. This patch didn't change benchmark result on my M1 Mac. I'll try it on x86 tomorrow. Would you mind rebase it? |
|
Sorry, i forgot about first ping( |
|
This PR decreases performance here (AMD Ryzen 9 3900X, gcc 12.3.0):
|
|
However, it seems that another micro-benchmark becomes slightly faster:
|
|
In both cases, the difference is rather minor (up to 10% on micro-benchmarks). |
|
@ursabot please benchmark |
|
Benchmark runs are scheduled for commit 4722067. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete. |
void DefLevelsToBitmap(const int16_t* def_levels, int64_t num_def_levels,
LevelInfo level_info, ValidityBitmapInputOutput* output) {
// It is simpler to rely on rep_level here until PARQUET-1899 is done and the code
// is deleted in a follow-up release.
if (level_info.rep_level > 0) {
#if defined(ARROW_HAVE_RUNTIME_BMI2)
if (CpuInfo::GetInstance()->HasEfficientBmi2()) {
return DefLevelsToBitmapBmi2WithRepeatedParent(def_levels, num_def_levels,
level_info, output);
}
#endif
standard::DefLevelsToBitmapSimd</*has_repeated_parent=*/true>(
def_levels, num_def_levels, level_info, output);
} else {
standard::DefLevelsToBitmapSimd</*has_repeated_parent=*/false>(
def_levels, num_def_levels, level_info, output);
}
}@pitrou is bmi2 enabled in your |
|
On my CPU, it shouldn't, no. |
|
Thanks for your patience. Conbench analyzed the 5 benchmarking runs that have been run so far on PR commit 4722067. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
|
Thank you for your contribution. Unfortunately, this pull request has been marked as stale because it has had no activity in the past 365 days. Please remove the stale label or comment below, or this PR will be closed in 14 days. Feel free to re-open this if it has been closed in error. If you do not have repository permissions to reopen the PR, please tag a maintainer. |

Rationale for this change
TrailingBits operation is called on every read operation in parquets. And it takes significant amount of time from reading levels

My change implements the same functionality but faster
What changes are included in this PR?
Are these changes tested?
https://quick-bench.com/q/3IbTOnH4rShshgE7pwcX6dCbJuY
https://godbolt.org/z/K6YToW7x3
And also i tested the same for-loop(but with higher upper limit) with Checking for equality
Are there any user-facing changes?