Skip to content

GH-39402: [C++] bit_util TrailingBits faster#39403

Closed
Hattonuri wants to merge 2 commits intoapache:mainfrom
Hattonuri:faster_trailing_bits
Closed

GH-39402: [C++] bit_util TrailingBits faster#39403
Hattonuri wants to merge 2 commits intoapache:mainfrom
Hattonuri:faster_trailing_bits

Conversation

@Hattonuri
Copy link
Copy Markdown
Contributor

@Hattonuri Hattonuri commented Jan 1, 2024

Rationale for this change

TrailingBits operation is called on every read operation in parquets. And it takes significant amount of time from reading levels
image

My change implements the same functionality but faster

What changes are included in this PR?

Are these changes tested?

https://quick-bench.com/q/3IbTOnH4rShshgE7pwcX6dCbJuY

https://godbolt.org/z/K6YToW7x3

And also i tested the same for-loop(but with higher upper limit) with Checking for equality

Are there any user-facing changes?

@github-actions
Copy link
Copy Markdown

github-actions bot commented Jan 1, 2024

⚠️ GitHub issue #39402 has been automatically assigned in GitHub to PR creator.

Copy link
Copy Markdown
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Hattonuri and wish you have a happy new year!

From your godbolt optimizations, seems the current impl might enable bzhi when during the current implemention. And the second is doing

The current:

        mov     rax, rdi
        bzhi    rax, rax, rsi
        ret

Your impl:

        shlx    rax, rax, rsi
        not     rax
        and     rax, rdi
        ret

I'm not an expert on this. Would it be faster in some case? And maybe we should also testing this on ARM @cyb70289


// Returns the 'num_bits' least-significant bits of 'v'.
static inline uint64_t TrailingBits(uint64_t v, int num_bits) {
if (ARROW_PREDICT_FALSE(num_bits == 0)) return 0;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(so generally, checks num_bits == 0 is not highly related to the optimization?

Copy link
Copy Markdown
Contributor Author

@Hattonuri Hattonuri Jan 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think related, because ((v >> 0) << 0) ^ v == v ^ v == 0

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and i think that is the main reason for performance increase

Copy link
Copy Markdown
Member

@mapleFU mapleFU Jan 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and i think that is the main reason for performance increase

As I benchmark. Without -march=skylake, removing the ARROW_PREDICT_FALSE(num_bits == 0) would not make it faster or slower( seems quickbench runs faster because shr is slow?). And as mentioned #39403 (comment) . It might affect the command generation later. Would you mind testing that?

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jan 1, 2024
@mapleFU
Copy link
Copy Markdown
Member

mapleFU commented Jan 1, 2024

I found a similar logic in snappy: https://github.com/google/snappy/blob/main/snappy.cc#L1008

Would we do the similar thing? (Also cc @pitrou because this might bmi2 related?)

@pitrou
Copy link
Copy Markdown
Member

pitrou commented Jan 1, 2024

Please, can you post actual benchmarks of reading Parquet files? Micro-benchmarks of a tiny helper function are not that interesting.

@pitrou
Copy link
Copy Markdown
Member

pitrou commented Jan 1, 2024

@ursabot please benchmark

@ursabot
Copy link
Copy Markdown

ursabot commented Jan 1, 2024

Benchmark runs are scheduled for commit 7f736fb. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

@mapleFU
Copy link
Copy Markdown
Member

mapleFU commented Jan 1, 2024

I found with most compilers TrailingBits3 are just equal to TrailingBits2.

uint64_t TrailingBits2(uint64_t v, int num_bits) {
  if (__builtin_expect(num_bits >= 64, 0)) return v;
  return ((v >> num_bits) << num_bits) ^ v;
}

uint64_t TrailingBits3(uint64_t v, int num_bits) {
  if (__builtin_expect(num_bits >= 64, 0))
    return v;
  uint64_t mask = 0xffffffffffffffff;
  return v & ~(mask << num_bits);
}

With -O3/-O2 and without -march=skylake, the generated code is like https://quick-bench.com/q/3IbTOnH4rShshgE7pwcX6dCbJuY . The generated code is:

Current:

        neg     cl
        shl     rax, cl
        shr     rax, cl

after:

        not     rdx
        mov     rax, rdx
        and     rax, rdi

Would you mind test which is faster with a CPU with avx2 and bmi2 enabled? @Hattonuri

After I try them they generate same code with clang17.0.1 and argument -std=c++20 -O3 -march=skylake. The differences is listed here: #39403 (review) . The current code might uses bzhi because it matches (d) rules here: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/X86/X86ISelDAGToDAG.cpp#L3710-L3716

@mapleFU
Copy link
Copy Markdown
Member

mapleFU commented Jan 1, 2024

Oh I think I've found the reason...

uint64_t TrailingBits2(uint64_t v, int num_bits) {
  if (__builtin_expect(num_bits == 0, 0)) return 0;
  if (__builtin_expect(num_bits >= 64, 0)) return v;
  return ((v >> num_bits) << num_bits) ^ v;
}

Also generate bzhi. Seems we need rethink the problem here: #39403 (comment)

@Hattonuri
Copy link
Copy Markdown
Contributor Author

Hattonuri commented Jan 1, 2024

About benchmarks

Tested my program in which parquet library take ~60% of time.
Data is stored on ram fs to exclude any disk problems. And made some runs before to exclude things like "binary code is warmed in cache"
First is with my PR, second is what it was
We can see that on average we get -1sec in user time. which is 2% performance increase.
And 2% / 0.6 ~ 3.3% performance increase in parquet library

dstasenko@bench-prod15:~$ for i in seq 5; do echo $i; time ./parquet_playground ; done
1

real 0m46.382s
user 0m46.185s
sys 0m0.180s
2

real 0m46.416s
user 0m46.120s
sys 0m0.276s
3

real 0m46.711s
user 0m46.460s
sys 0m0.232s
4

real 0m46.406s
user 0m46.185s
sys 0m0.200s
5

real 0m46.369s
user 0m46.150s
sys 0m0.200s
dstasenko@bench-prod15:~$ for i in seq 5; do echo $i; time ./parquet_playground2 ; done
1

real 0m47.161s
user 0m46.967s
sys 0m0.176s
2

real 0m47.344s
user 0m47.083s
sys 0m0.244s
3

real 0m47.233s
user 0m47.009s
sys 0m0.204s
4

real 0m47.300s
user 0m47.060s
sys 0m0.240s
5

real 0m48.072s
user 0m47.109s
sys 0m0.944s

@Hattonuri
Copy link
Copy Markdown
Contributor Author

I tested variant with mask 0xffffffffffffffff and saw no difference with mine
But i saw strange thing that on raptorlake none of 3 variants transformed into bzhi but the processor support both avx2 and bmi2
And i don't have skylake benchmark :(
I test on 13th Gen intel core i9

@mapleFU
Copy link
Copy Markdown
Member

mapleFU commented Jan 1, 2024

What did your compiler and instruction like? Maybe I can take a testing tomorrow. I'm mentioning this because I'm afraid this change might cause some vendoring become even slower...

@Hattonuri
Copy link
Copy Markdown
Contributor Author

Hattonuri commented Jan 1, 2024

I use /usr/bin/twix-clang++-17 -march=raptorlake -O3 -g -fno-omit-frame-pointer -std=c++2a with jemalloc and libstdc++

And this transform into instructions like that
But on master it transforms into two jumps(because of two ifs)
image

@Hattonuri
Copy link
Copy Markdown
Contributor Author

I also tried to remove inline word but nothing changed

@mapleFU
Copy link
Copy Markdown
Member

mapleFU commented Jan 1, 2024

Your options is same as here: #39403 (comment) . This is proved to optimize.

I'll try using AVX2 tomorrow.

I also tried to remove inline word but nothing changed

Emmm you can force not inline if you like...

@mapleFU
Copy link
Copy Markdown
Member

mapleFU commented Jan 1, 2024

And this transform into instructions like that
But on master it transforms into two jumps(because of two ifs)

Would you mind add the if (ARROW_PREDICT_FALSE(...) back and test?

@conbench-apache-arrow
Copy link
Copy Markdown

Thanks for your patience. Conbench analyzed the 6 benchmarking runs that have been run so far on PR commit 7f736fb.

There were 2 benchmark results indicating a performance regression:

The full Conbench report has more details.

@mapleFU
Copy link
Copy Markdown
Member

mapleFU commented Jan 2, 2024

Aha since on some machine the performance even got worse...

@mapleFU
Copy link
Copy Markdown
Member

mapleFU commented Jan 2, 2024

static inline uint64_t TrailingBits(uint64_t v, int num_bits) {
  if (ARROW_PREDICT_FALSE(num_bits == 0)) return 0;
  if (ARROW_PREDICT_FALSE(num_bits >= 64)) return v;
  return ((v >> num_bits) << num_bits) ^ v;
  1. Would you mind change to the code above and I'll rerun a benchmark?
  2. Maybe I should wait for Yibo's idea on ARM machine

if (ARROW_PREDICT_FALSE(num_bits >= 64)) return v;
int n = 64 - num_bits;
return (v << n) >> n;
return ((v >> num_bits) << num_bits) ^ v;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about return v & ~(-1ULL << num_bits); ?
It enables gcc to optimize the code with bmi2 bzhi.
https://godbolt.org/z/oq9zx4nhf
And looks it's slightly faster even without bmi2.
https://quick-bench.com/q/rgBQzUFls9IP48xm3JiRPtJoI_M

Copy link
Copy Markdown
Member

@mapleFU mapleFU Jan 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha I've tried on clang 17.0.1 and it doesn't generate bzhi, compiler is so hacking...

Link: https://godbolt.org/z/h83oxvoWj

@cyb70289
Copy link
Copy Markdown
Contributor

cyb70289 commented Jan 2, 2024

2. Maybe I should wait for Yibo's idea on ARM machine

I will try on Arm when have time. This change looks reasonable to me.

@Hattonuri
Copy link
Copy Markdown
Contributor Author

What do you think about changing if on 64 bits to assertion?

@cyb70289
Copy link
Copy Markdown
Contributor

cyb70289 commented Jan 3, 2024

What do you think about changing if on 64 bits to assertion?

This changes the code behaviour. I don't think we can do it.

@mapleFU
Copy link
Copy Markdown
Member

mapleFU commented Jan 3, 2024

What do you think about changing if on 64 bits to assertion?

You remind me that might because the port scheduling. Checking might uses same ports with BMI...So in clang they might generate other instruction...

@cyb70289
Copy link
Copy Markdown
Contributor

cyb70289 commented Jan 3, 2024

We can run llvm-mca tool at godbolt. Looks the non-bzhi code might be better (higher IPC, etc).
https://godbolt.org/z/Kh1PascMs

@Hattonuri
Copy link
Copy Markdown
Contributor Author

Hattonuri commented Jan 3, 2024

By the way, in gcc on llvm mca https://godbolt.org/z/MKrEsdPxd my variant shows the best "total cycles" and best "IPC" score 🤔

@Hattonuri
Copy link
Copy Markdown
Contributor Author

As i understood, IPC is higher but we need to compare "instructions" field, because IPC is higher but IPC*Total cycles. Because from two compiled representations of snappy these multiplications is the same. So the difference only on instruction intensiveness.

But why my implementation in non optimized way has less total cycles remains....

@mapleFU
Copy link
Copy Markdown
Member

mapleFU commented Jan 4, 2024

https://www.agner.org/optimize/instruction_tables.pdf
Some instr might be expansive

@cyb70289
Copy link
Copy Markdown
Contributor

cyb70289 commented Jan 4, 2024

Coming back to this PR. Is there any Arrow benchmark improve after this change? Are the two regressions related?

@mapleFU
Copy link
Copy Markdown
Member

mapleFU commented Jan 4, 2024

(Perhaps not, and via the post https://abseil.io/fast/39 , maybe we should benchmark the user of this function)

@Hattonuri
Copy link
Copy Markdown
Contributor Author

https://www.agner.org/optimize/instruction_tables.pdf
Some instr might be expansive

I compared total cycles, not total instructions

@mapleFU
Copy link
Copy Markdown
Member

mapleFU commented Jan 4, 2024

perhaps for each ret it do a counting :-(

@Hattonuri
Copy link
Copy Markdown
Contributor Author

can we merge this?)

@mapleFU
Copy link
Copy Markdown
Member

mapleFU commented Jan 11, 2024

I'm becoming very busy in these few days, maybe you can draft a micro benchmark

Also cc @pitrou

@Hattonuri
Copy link
Copy Markdown
Contributor Author

I think your ReadLevels benchmark should work fine, because ReadLevels is the function that uses TrailingBits the most on flamegraph in the PR

@pitrou
Copy link
Copy Markdown
Member

pitrou commented Jan 11, 2024

@Hattonuri Could you rebase on git main?

@mapleFU
Copy link
Copy Markdown
Member

mapleFU commented Feb 5, 2024

#39705

I've merged a pr about readLevels. This patch didn't change benchmark result on my M1 Mac. I'll try it on x86 tomorrow. Would you mind rebase it?

cc @Hattonuri @pitrou

@Hattonuri
Copy link
Copy Markdown
Contributor Author

Sorry, i forgot about first ping(

@pitrou
Copy link
Copy Markdown
Member

pitrou commented Feb 7, 2024

This PR decreases performance here (AMD Ryzen 9 3900X, gcc 12.3.0):

  • before:
--------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                  Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------------------------------------
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1              2325 ns         2326 ns       298740 bytes_per_second=6.48372Gi/s items_per_second=3.48092G/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7              8060 ns         8059 ns        87115 bytes_per_second=1.87109Gi/s items_per_second=1.00453G/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1024            668 ns          670 ns      1046591 bytes_per_second=22.5109Gi/s items_per_second=12.0854G/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1              2122 ns         2123 ns       330590 bytes_per_second=7.10169Gi/s items_per_second=3.81269G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1              1948 ns         1949 ns       356752 bytes_per_second=7.7368Gi/s items_per_second=4.15367G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1              1777 ns         1778 ns       395234 bytes_per_second=8.48193Gi/s items_per_second=4.5537G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7              7568 ns         7568 ns        91365 bytes_per_second=1.99251Gi/s items_per_second=1.06972G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1          1252 ns         1257 ns       560376 bytes_per_second=11.9933Gi/s items_per_second=6.43886G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7          1257 ns         1263 ns       556189 bytes_per_second=11.9407Gi/s items_per_second=6.41061G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1024       1316 ns         1322 ns       550602 bytes_per_second=11.4078Gi/s items_per_second=6.12452G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1          1300 ns         1306 ns       517873 bytes_per_second=11.5467Gi/s items_per_second=6.19908G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1          1312 ns         1318 ns       456593 bytes_per_second=11.4407Gi/s items_per_second=6.14218G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1          1250 ns         1256 ns       562519 bytes_per_second=12.0082Gi/s items_per_second=6.44683G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7          1301 ns         1307 ns       519667 bytes_per_second=11.5369Gi/s items_per_second=6.19383G/s
  • after:
--------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                  Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------------------------------------
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1              2563 ns         2566 ns       273640 bytes_per_second=5.87793Gi/s items_per_second=3.15569G/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7              8134 ns         8135 ns        85865 bytes_per_second=1.85376Gi/s items_per_second=995.23M/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1024            725 ns          726 ns       956867 bytes_per_second=20.7575Gi/s items_per_second=11.1441G/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1              2304 ns         2306 ns       302796 bytes_per_second=6.53916Gi/s items_per_second=3.51068G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1              2087 ns         2089 ns       334483 bytes_per_second=7.21913Gi/s items_per_second=3.87574G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1              1849 ns         1851 ns       377499 bytes_per_second=8.14629Gi/s items_per_second=4.37351G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7              8214 ns         8216 ns        85257 bytes_per_second=1.83534Gi/s items_per_second=985.339M/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1          1370 ns         1372 ns       508348 bytes_per_second=10.9904Gi/s items_per_second=5.90044G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7          1366 ns         1368 ns       507420 bytes_per_second=11.024Gi/s items_per_second=5.91848G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1024       1367 ns         1369 ns       510455 bytes_per_second=11.0135Gi/s items_per_second=5.91285G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1          1377 ns         1379 ns       507504 bytes_per_second=10.9347Gi/s items_per_second=5.87051G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1          1367 ns         1369 ns       511657 bytes_per_second=11.0171Gi/s items_per_second=5.91476G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1          1332 ns         1333 ns       524574 bytes_per_second=11.3119Gi/s items_per_second=6.07305G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7          1373 ns         1374 ns       510605 bytes_per_second=10.9713Gi/s items_per_second=5.89016G/s

@pitrou
Copy link
Copy Markdown
Member

pitrou commented Feb 7, 2024

However, it seems that another micro-benchmark becomes slightly faster:

  • before:
---------------------------------------------------------------------------------------------------------
Benchmark                                               Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------------
BM_DefinitionLevelsToBitmapRepeatedAllMissing        2394 ns         2394 ns       289688 bytes_per_second=815.98Mi/s
BM_DefinitionLevelsToBitmapRepeatedAllPresent        4292 ns         4291 ns       163130 bytes_per_second=455.15Mi/s
BM_DefinitionLevelsToBitmapRepeatedMostPresent       4191 ns         4191 ns       167253 bytes_per_second=466.054Mi/s
  • after:
---------------------------------------------------------------------------------------------------------
Benchmark                                               Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------------
BM_DefinitionLevelsToBitmapRepeatedAllMissing        2312 ns         2311 ns       303845 bytes_per_second=845.034Mi/s
BM_DefinitionLevelsToBitmapRepeatedAllPresent        4110 ns         4109 ns       170384 bytes_per_second=475.335Mi/s
BM_DefinitionLevelsToBitmapRepeatedMostPresent       4049 ns         4048 ns       173085 bytes_per_second=482.458Mi/s

@pitrou
Copy link
Copy Markdown
Member

pitrou commented Feb 7, 2024

In both cases, the difference is rather minor (up to 10% on micro-benchmarks).

@pitrou
Copy link
Copy Markdown
Member

pitrou commented Feb 7, 2024

@ursabot please benchmark

@ursabot
Copy link
Copy Markdown

ursabot commented Feb 7, 2024

Benchmark runs are scheduled for commit 4722067. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

@mapleFU
Copy link
Copy Markdown
Member

mapleFU commented Feb 7, 2024

void DefLevelsToBitmap(const int16_t* def_levels, int64_t num_def_levels,
                       LevelInfo level_info, ValidityBitmapInputOutput* output) {
  // It is simpler to rely on rep_level here until PARQUET-1899 is done and the code
  // is deleted in a follow-up release.
  if (level_info.rep_level > 0) {
#if defined(ARROW_HAVE_RUNTIME_BMI2)
    if (CpuInfo::GetInstance()->HasEfficientBmi2()) {
      return DefLevelsToBitmapBmi2WithRepeatedParent(def_levels, num_def_levels,
                                                     level_info, output);
    }
#endif
    standard::DefLevelsToBitmapSimd</*has_repeated_parent=*/true>(
        def_levels, num_def_levels, level_info, output);
  } else {
    standard::DefLevelsToBitmapSimd</*has_repeated_parent=*/false>(
        def_levels, num_def_levels, level_info, output);
  }
}

@pitrou is bmi2 enabled in your DefinitionLevelsToBitmapRepeated benchmark?

@pitrou
Copy link
Copy Markdown
Member

pitrou commented Feb 7, 2024

On my CPU, it shouldn't, no.

@conbench-apache-arrow
Copy link
Copy Markdown

Thanks for your patience. Conbench analyzed the 5 benchmarking runs that have been run so far on PR commit 4722067.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

@github-actions
Copy link
Copy Markdown

Thank you for your contribution. Unfortunately, this pull request has been marked as stale because it has had no activity in the past 365 days. Please remove the stale label or comment below, or this PR will be closed in 14 days. Feel free to re-open this if it has been closed in error. If you do not have repository permissions to reopen the PR, please tag a maintainer.

@github-actions github-actions bot added the Status: stale-warning Issues and PRs flagged as stale which are due to be closed if no indication otherwise label Nov 18, 2025
@github-actions github-actions bot closed this Jan 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awaiting committer review Awaiting committer review Component: C++ Status: stale-warning Issues and PRs flagged as stale which are due to be closed if no indication otherwise

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[C++] bit_util TrailingBits can be made much faster

5 participants