Optimize CRC16 using multi-byte LUT by Cuda-Chen · Pull Request #2790 · valkey-io/valkey

Cuda-Chen · 2025-10-31T12:09:29Z

Optimize CRC16 using multi-byte LUT.

See also the discussion in the previous attempt: #2691

Cuda-Chen · 2025-10-31T12:16:42Z

I would like to share some benchmark results of this PR.
If we are going to merge this PR, there are at least two works waiting for finishing:

jump-table for reduction
code tidy (e.g., using macro)

Environment

CPU: Intel(R) Core(TM) i5-3230M CPU @ 2.60GHz
OS: Linux 5.15.0-157-generic
For all "reduction", it reduces from the target length to 1-byte (e.g., 8-byte: 8 -> 4 -> 2 -> 1)

Benchmark Results

baseline (commit `f54818cc60597e9fe5dc03a52fd39ab944cd4932` on `unstable` branch

# -t get,set
SET: 292902.97 requests per second, p50=1.431 msec                    
GET: 350557.41 requests per second, p50=1.199 msec

# -t get
GET: 454277.00 requests per second, p50=0.975 msec

2-byte reduction

# -t set,get
SET: 321502.06 requests per second, p50=1.351 msec                    
GET: 387236.69 requests per second, p50=1.151 msec

# -t get
GET: 453309.16 requests per second, p50=0.959 msec

4-byte reduction

# -t set,get
SET: 324191.12 requests per second, p50=1.335 msec                    
GET: 376364.31 requests per second, p50=1.135 msec

# -t get
GET: 470167.88 requests per second, p50=0.943 msec

8-byte reduction

# -t set,get
SET: 319550.06 requests per second, p50=1.375 msec                    
GET: 384349.31 requests per second, p50=1.143 msec

# -t get
GET: 469329.34 requests per second, p50=0.959 msec

16-byte reduction

# -t set,get
SET: 300210.12 requests per second, p50=1.431 msec                    
GET: 391880.25 requests per second, p50=1.143 msec 

# -t get
GET: 443066.03 requests per second, p50=0.943 msec

zuiderkwast · 2025-11-06T17:09:33Z

Interesting! But with a larger LUT, when it is loaded into L1 cache, other stuff are evicted that might be needed later, so it may depend on use case if it's actually faster in all cases.

How much memory would the 2-byte reduction variant use? The current draft is for the 16-byte reduction, right?

Cuda-Chen · 2025-11-07T03:36:25Z

Hi @zuiderkwast ,

How much memory would the 2-byte reduction variant use?

It will use 1 KB memory for LUT (256 entries * 2 bytes of each entry * 2 times for 2-byte lookup at the same time = 1024 bytes = 1 KB).

The current draft is for the 16-byte reduction, right?

Yes, the current draft uses 16-byte reduction.

zuiderkwast · 2026-02-24T16:14:24Z

Thanks for keeping this up to date.

Which key length did you use for the benchmarking? Do you still have the valkey-benchmark command line you used?

the current draft uses 16-byte reduction

Can you change it to use only 2-byte LUT? Using 4-byte or more didn't seem to gain more and it may instead evict other data from the L1 cache that we'd rather keep in there. Preferabley do it as an additional commit so the full 16-byte LUT implementation is still visible in the history of the PR.

Then, I hope we can find someone more to benchmark this with realistic traffic. We should also make sure we don't get a performance regression for very short keys, like 3 byte keys.

The key length varies for different users and there's also the possibility of using tags with curly braces within the keys. For example in a key named like "user:{123abc}:bla-bla-bla:some:stuff", we do CRC16 only on the part within curly braces, in this case "123abc", and when this style is used, the data for CRC16 is typically quite short.

Cuda-Chen · 2026-02-25T02:30:04Z

Hi @zuiderkwast ,

Do you still have the valkey-benchmark command line you used?

Yes. I use the commands provided in this PR comments. For clearance, I re-post here:

$ cd src
$ rm dump.rdb nodes.conf # clean up any old data files (if they exist)
$ ./valkey-server --cluster-enabled yes --save '' &
(...)
2622423:M 07 Oct 2025 17:48:58.872 * Server initialized
2622423:M 07 Oct 2025 17:48:58.873 * Ready to accept connections tcp

$ ./valkey-cli ./valkey-cli cluster addslotsrange 0 16383
OK
$ ./valkey-cli cluster info | head -3
cluster_state:ok
cluster_slots_assigned:16384
cluster_slots_ok:16384

$ ./valkey-benchmark --threads 3 -P 10 -n 10000000 -r 1000000 -t set,get -q

# clear the existing slots for next time benchmark
./valkey-cli flushall

Which key length did you use for the benchmarking?

I admit I have no idea. However, the key length should be the settings of set and get tests provided to valkey-benchmark.

Can you change it to use only 2-byte LUT?

I will alter my code to use only 2-byte LUT in later commits.
What's more, I will conduct benchmark again as my testing environment recently got a system upgrade.

Cuda-Chen · 2026-02-25T08:45:53Z

Hi @zuiderkwast ,

For one more thing:

Preferabley do it as an additional commit so the full 16-byte LUT implementation is still visible in the history of the PR.

I guess you mean:

# current commit
commit A: <description of multi-byte LUT>

# change into two commits
commit A: <description of 2-byte LUT>
commit B: <description of 4-byte and above LUT>

If I am wrong, just let me know.

zuiderkwast · 2026-02-25T12:41:11Z

Which key length did you use for the benchmarking?

I admit I have no idea. However, the key length should be the settings of set and get tests provided to valkey-benchmark.

Right, I remember you first used --cluster and then I suggested you run without --cluster to make valkey-benchmark construct the keys differently. This is how the keys look in the get and set tests:

	Without `--cluster`	With `--cluster`
Example	`key:000000000000`	`key{06S}:000000000000`
Key length	16	21
CRC16 input	16	3

The number is zero-padded and varies with -r. The thing inside curly braces varies too but it's 3 bytes in general. Only that part in curly braces is passed to CRC16 if it exists.

I think both of these are realistic keys lengths and patterns. 16 and 3 bytes are both good to test.

zuiderkwast · 2026-02-25T12:44:08Z

If I am wrong, just let me know.

Yeah, I meant just like you've done it, i.e.

"Optimize CRC16 using multi-byte consumption"
"Disable 4-byte and above LUT"
(later) Cleanup, delete unused code and tables

zuiderkwast · 2026-02-25T12:55:43Z

@Cuda-Chen Do you want to profile it? Using perf or similar, for example generate a flamegraph so we can see how much the server spends in the crc16 function.

IMO, only -t get would be enough to run. The server has no keys, so no data will be returned. This is the fastest command so we maximize the part spent in crc16.

zuiderkwast · 2026-02-25T13:25:39Z

+	for(; counter + 1 < len; counter += 2) {
+        /* explicitly get two bytes */
+        uint16_t a = buf[counter];
+        uint16_t b = buf[counter + 1];
+        uint16_t tmp = ((a << 8) | b);
+
+        crc ^= tmp;
+        // fit LITTLE-ENDIAN architecture
+        crc = crc16tab[1 * 256 + (uint8_t)(crc >> 8)] ^ crc16tab[0 * 256 + (uint8_t)(crc >> 0)];
+    }
+
+    // deal with leftover
+    for(; counter < len; counter++)
+        crc = (crc << 8) ^ crc16tab[((crc >> 8) ^ (uint8_t)buf[counter]) & 0x00FF];


Just so I understand correctly:

The code for 2-byte LUT still does 2 table lookups for every 2 bytes of input. Same as for 1-byte LUT.

The difference is that we do fewer shifts? Only one instead of two.

I find it hard to see how reducing a single instruction per input byte would give any benefit TBH. Maybe the benchmark results were just random, no significant differences at all?

The difference is that we do fewer shifts? Only one instead of two.

The difference is that the two loads and two table lookups can be done in parallel (Instruction Parallelism). So we can have a chance to improve the performance.

For example, a CRC32 implementation (provided by ¹) can have an improvement from 1.10 bits per cycle (1-byte LUT) to 1.60 bits per cycle (2-byte Tabular).

What's more, the multi-byte LUT (esepcially the slicing-by-4 and slicing-by-8 originated from Intel ²) consumes four or eight input bytes in the same time to improve performance further (in ³, this technique can improve performance to 4.80 bits per cycle in slicing-by-8 implementation).

Footnotes

https://github.com/komrad36/CRC ↩

https://create.stephan-brumme.com/crc32/#slicing-by-8-overview ↩

https://github.com/komrad36/CRC?tab=readme-ov-file#performance-comparison-table ↩

two table lookups can be done in parallel

Right, this is the point. Thanks for reminding me. :)

Cuda-Chen · 2026-02-25T15:21:47Z

@zuiderkwast

Do you want to profile it? Using perf or similar, for example generate a flamegraph so we can see how much the server spends in the crc16 function.

Yes. I will profile with/without setting the --cluster parameter.

IMO, only -t get would be enough to run. The server has no keys, so no data will be returned. This is the fastest command so we maximize the part spent in crc16.

I will remind to benchmark/profile with only -t get.

sarthakaggarwal97 · 2026-02-25T23:31:01Z

+        uint16_t a = buf[counter];
+        uint16_t b = buf[counter + 1];


Suggested change

uint16_t a = buf[counter];

uint16_t b = buf[counter + 1];

uint16_t a = (uint8_t)buf[counter];

uint16_t b = (uint8_t)buf[counter + 1];

Cuda-Chen · 2026-02-27T08:10:56Z

Benchmark Environment

OS: Linux 6.8.0-101-generic
CPU: Intel(R) Core(TM) i5-3230M CPU @ 2.60GHz

Benchmark Procedures

Note

Conduct three times of benchmarking for each part.

Normal

Mention in this previous comment.

Clustering Mode

$ ./utils/create-cluster/create-cluster start
$ ./utils/create-cluster/create-cluster create
$ perf record -g --pid <first primary node PID> -F 999

# ... open another terminal

$ ./src/valkey-benchmark --cluster -p <port of the first primary node> --threads 3 -P 10 -n 10000000 -r 1000000 -t get -q

# ... stop then tear down the cluster

$ ./utils/create-cluster/create-cluster stop
$ ./utils/create-cluster/create-cluster clean

Benchmark Results

Normal

RPS

Base	This PR	Performance gain
362040.906666667	351379.29	-2.9448652 %

latency (measured in seconds)

Base	This PR	Performance gain
1.161666667	1.124333333	3.2137734 %

Detailed Performance Metrics

# base
GET: 383774.03 requests per second, p50=1.167 msec
GET: 366609.22 requests per second, p50=1.151 msec
GET: 335739.47 requests per second, p50=1.167 msec

# this PR
GET: 289293.25 requests per second, p50=1.143 msec
GET: 391236.31 requests per second, p50=1.111 msec
GET: 373608.31 requests per second, p50=1.119 msec

Clustering Mode

RPS

Base	This PR	Performance gain
479107.686666667	484595.01	1.1453215 %

latency (measured in seconds)

Base	This PR	Performance gain
0.737666667	0.719	2.5305016 %

Detailed Performance Metrics

# base
GET: 553832.50 requests per second, p50=0.655 msec
GET: 454029.53 requests per second, p50=0.759 msec
GET: 429461.03 requests per second, p50=0.799 msec

# this PR
GET: 525983.62 requests per second, p50=0.671 msec
GET: 469946.91 requests per second, p50=0.735 msec
GET: 457854.50 requests per second, p50=0.751 msec

Cuda-Chen · 2026-02-27T08:41:20Z

FlameGraphs

Note

Check the matched percentage of keyHashSlot() (crc16() is inlined in this function).

Normal

Base	This PR
4.7%	1.3 %

FlameGraphs (Remember to use Right-click to Download)

Base

This PR

Clustering Mode

Base	This PR
1.3%	1.4%

FlameGraphs (Remember to use Right-click to Download)

Base

This PR

Cuda-Chen · 2026-02-27T11:46:01Z

I think both of these are realistic keys lengths and patterns. 16 and 3 bytes are both good to test.

Saying for 3 bytes, I come up with an idea: instead of reduce-by-power-of-2, how about reduce-by-prime (just like FFTW does for FFT)?

zuiderkwast · 2026-02-27T23:35:47Z

Regarding reduce-by-prime, I don't really understand what you mean. Would we have tables for e.g. 5, 3, 2 and 1 byte?

I have a new idea: Can we get more memory-parallelization if we do crc16 on multiple strings in parallel? Commands often come in batches.

Cuda-Chen · 2026-02-28T03:09:16Z

Regarding reduce-by-prime, I don't really understand what you mean. Would we have tables for e.g. 5, 3, 2 and 1 byte?

Yes.
We prepare the table, and we do integer factorization (this can be done by another table as the input length will not exceed 20). Then, we run the certain reduction part.
For example, if the input length is 3, we first get its largest prime factor is 3. Next, we run the reduce-by-3 part, something like this:

/* largest prime factor of an integer
 * Each index indicate the largest prime factor of the index.
 * E.g., fact[3] = 3 means the largest prime factor of 3 is 3.
 */
int fact[] = {0, 1, 2, 3, ...};

while(len >= 0) {
  /* we can change switch-case to computed goto for potential more performance */
  switch(fact[len]) {
    /* ... plenty of prime number cases */ 
    case 5:
      /* do 5-byte tabular CRC */
      len -= 5;
      break;
    case 3:
      /* do 3-byte tabular CRC */
      len -= 3;
      break;
    case 2:
      /* do 2-byte tabular CRC */
      len -= 2;
      break;
    case 1:
      /* do 1-byte tabular CRC */
      len -= 1;
      break;
  }
}

Cuda-Chen · 2026-02-28T03:12:40Z

I have a new idea: Can we get more memory-parallelization if we do crc16 on multiple strings in parallel? Commands often come in batches.

I will say yes. But I need to realize how ValKey does crc16 on multiple strings as I am not familiar with this part.

Cuda-Chen · 2026-03-05T14:57:06Z

Saying for 3 bytes, I come up with an idea: instead of reduce-by-power-of-2, how about reduce-by-prime (just like FFTW does for FFT)?

So after implementing the reduce-by-3, I find a significant impact in clustering mode. So I will drop the commit of reduce-by-3.

For record, I paste the benchmark result (the measurement is as same as this previous comment).

Normal

RPS

base	reduce-by-3	performance gain
362040.906666667	396712.146666667	9.5766084%

latency

base	reduce-by-3	performance gain
1.161666667	1.143	1.6068867%

Cluster Mode

RPS

base	reduce-by-3	performance gain
479107.686666667	427653.74	-10.7395369%

latench

base	reduce-by-3	performance gain
0.737666667	0.793666667	-7.5915047%

Cuda-Chen · 2026-03-05T14:58:42Z

One more thing (for my own curiosity): does ValKey has performance issue when the server runs with HDD? I borrow a computer with HDD, and clustering mode gives the same latency when benchmarking.

zuiderkwast · 2026-03-05T17:50:09Z

One more thing (for my own curiosity): does ValKey has performance issue when the server runs with HDD? I borrow a computer with HDD, and clustering mode gives the same latency when benchmarking.

Hard disk? Snapshots to disk are asynchronous using a child process so they shouldn't affect the latency and RPS of commands.

So after implementing the reduce-by-3, I find a significant impact in clustering mode. So I will drop the commit of reduce-by-3.

OK, interesting. In the valkey-bechmark --cluster mode, the crc16 input is 3 bytes long which I guessed would be optimal for reduce-by-3 LUT. Maybe the larger LUT constantly gets evicted from CPU cache? That is my guess.

The smallest reduce-by-1 LUT still seems good enough.

I wanted to try to make compute multiple crc16 in parallel. It covers some scenario when a client sends multiple commands and the servers handles them in a batch, or the cliend sends commands that involves multiple keys (MGET for example). I haven't really tested it properly yet. #3323

Cuda-Chen · 2026-03-06T05:59:06Z

So after implementing the reduce-by-3, I find a significant impact in clustering mode.

I find that I forget to put --cluster when benchmarking cluster mode. I will re-do the benchmark of this part again.

I wanted to try to make compute multiple crc16 in parallel. It covers some scenario when a client sends multiple commands and the servers handles them in a batch, or the cliend sends commands that involves multiple keys (MGET for example). I haven't really tested it properly yet. #3323

I think I can merge this commit then benchmark.

zuiderkwast · 2026-03-06T17:10:19Z

I've benchmarked the reduce-by-3 and reduce-by-2 LUT locally. Over multiple runs, I can't see that any of these variants is consistently better than reduce-by-1. Therefore, I don't think it's worth merging it.

My draft where I tried to do crc16 on multiple keys in parallel, also didn't give good results. Actually it was 1% degradation in RPS. Probably because of the extra logic means more instructions.

The first thoughts that it's possible to optimize crc16 using some SIMD instruction turns out not to be true. It's not so easy for this use case.

I think we should try further to optimize crc16. Of course, if you want, you are free to continue, but I think it's better to find some other code path to optimize. Anyway, thank you for the big effort to try out all of these ideas!

Signed-off-by: Cuda-Chen <clh960524@gmail.com>

codecov · 2026-03-09T09:26:06Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.94%. Comparing base (75fb0e6) to head (c3602f1).
⚠️ Report is 30 commits behind head on unstable.

Additional details and impacted files

@@              Coverage Diff              @@
##           unstable    #2790       +/-   ##
=============================================
+ Coverage          0   74.94%   +74.94%     
=============================================
  Files             0      129      +129     
  Lines             0    71565    +71565     
=============================================
+ Hits              0    53634    +53634     
- Misses            0    17931    +17931

Files with missing lines	Coverage Δ
src/crc16.c	`100.00% <100.00%> (ø)`

... and 128 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Cuda-Chen · 2026-03-09T14:25:31Z

I am going to close this PR as I agree that we do not spot any crucial performance improvement.

I would like to also give a great gratitude to @zuiderkwast for assisting the performance testing.
From how to use the test scripts, what testing under the hood, and what I should pay attention when using FlameGraph.

Anyway, thanks this topic and community very much!

zuiderkwast · 2026-03-09T21:50:39Z

Now it's possible to run automated benchmarks in cluster mode. Let me repoen this just to run it.

github-actions · 2026-03-10T07:31:21Z

Cluster Benchmark ran on this commit: c3602f1

Benchmark Comparison: HEAD vs HEAD (averaged) - rps metrics

Run Summary:

HEAD: 80 total runs, 16 configurations (avg 5.00 runs per config)
HEAD: 80 total runs, 16 configurations (avg 5.00 runs per config)

Statistical Notes:

CI99%: 99% Confidence Interval - range where the true population mean is likely to fall
PI99%: 99% Prediction Interval - range where a single future observation is likely to fall
CV: Coefficient of Variation - relative variability (σ/μ × 100%)

Note: Values with (n=X, σ=Y, CV=Z%, CI99%=±W%, PI99%=±V%) indicate averages from X runs with standard deviation Y, coefficient of variation Z%, 99% confidence interval margin of error ±W% of the mean, and 99% prediction interval margin of error ±V% of the mean. CI bounds [A, B] and PI bounds [C, D] show the actual interval ranges.

Configuration:

architecture: aarch64
benchmark_mode: duration
clients: 1600
cluster_mode: True
data_size: 16
duration: 180
tls: False
valkey_benchmark_threads: 90
warmup: 30

Command	Metric	Pipeline	io_threads	HEAD	HEAD	Diff	% Change
GET	rps	1	1	217320.366 (n=5, σ=417.211, CV=0.19%, CI99%=±0.395%, PI99%=±0.968%, CI[216461.324, 218179.408], PI[215216.151, 219424.581])	216179.258 (n=5, σ=1410.800, CV=0.65%, CI99%=±1.344%, PI99%=±3.291%, CI[213274.401, 219084.115], PI[209063.840, 223294.676])	-1141.108	-0.525%
GET	rps	1	9	1488372.498 (n=5, σ=13968.604, CV=0.94%, CI99%=±1.932%, PI99%=±4.733%, CI[1459610.949, 1517134.047], PI[1417921.378, 1558823.618])	1489021.222 (n=5, σ=12956.537, CV=0.87%, CI99%=±1.792%, PI99%=±4.389%, CI[1462343.534, 1515698.910], PI[`1423674`.499, 1554367.945])	648.724	+0.044%
GET	rps	10	1	1034796.626 (n=5, σ=8988.540, CV=0.87%, CI99%=±1.789%, PI99%=±4.381%, CI[1016289.098, 1053304.154], PI[989462.627, 1080130.625])	1033891.326 (n=5, σ=3815.843, CV=0.37%, CI99%=±0.760%, PI99%=±1.861%, CI[1026034.452, 1041748.200], PI[1014645.993, 1053136.659])	-905.300	-0.087%
GET	rps	10	9	2345842.450 (n=5, σ=13962.230, CV=0.60%, CI99%=±1.226%, PI99%=±3.002%, CI[2317094.026, 2374590.874], PI[2275423.480, 2416261.420])	2340635.150 (n=5, σ=16159.395, CV=0.69%, CI99%=±1.422%, PI99%=±3.482%, CI[2307362.732, 2373907.568], PI[2259134.704, 2422135.596])	-5207.300	-0.222%
SET	rps	1	1	209610.464 (n=5, σ=2813.694, CV=1.34%, CI99%=±2.764%, PI99%=±6.770%, CI[203817.030, 215403.898], PI[195419.506, 223801.422])	209907.034 (n=5, σ=1973.259, CV=0.94%, CI99%=±1.936%, PI99%=±4.741%, CI[205844.066, 213970.002], PI[199954.836, 219859.232])	296.570	+0.141%
SET	rps	1	9	1359231.128 (n=5, σ=13816.738, CV=1.02%, CI99%=±2.093%, PI99%=±5.127%, CI[1330782.274, 1387679.982], PI[1289545.952, 1428916.304])	1362164.102 (n=5, σ=8091.207, CV=0.59%, CI99%=±1.223%, PI99%=±2.996%, CI[1345504.196, 1378824.008], PI[1321355.833, 1402972.371])	2932.974	+0.216%
SET	rps	10	1	907407.326 (n=5, σ=4502.097, CV=0.50%, CI99%=±1.022%, PI99%=±2.502%, CI[898137.447, 916677.205], PI[884700.852, 930113.800])	906814.476 (n=5, σ=2276.325, CV=0.25%, CI99%=±0.517%, PI99%=±1.266%, CI[902127.490, 911501.462], PI[895333.753, 918295.199])	-592.850	-0.065%
SET	rps	10	9	1714330.624 (n=5, σ=19364.201, CV=1.13%, CI99%=±2.326%, PI99%=±5.697%, CI[1674459.467, 1754201.781], PI[1616666.633, 1811994.615])	1728654.400 (n=5, σ=12806.560, CV=0.74%, CI99%=±1.525%, PI99%=±3.736%, CI[1702285.515, 1755023.285], PI[1664064.087, 1793244.713])	14323.776	+0.836%

Configuration:

architecture: aarch64
benchmark_mode: duration
clients: 1600
cluster_mode: True
data_size: 96
duration: 180
tls: False
valkey_benchmark_threads: 90
warmup: 30

Command	Metric	Pipeline	io_threads	HEAD	HEAD	Diff	% Change
GET	rps	1	1	208862.750 (n=5, σ=1847.790, CV=0.88%, CI99%=±1.822%, PI99%=±4.462%, CI[205058.124, 212667.376], PI[199543.358, 218182.142])	207128.030 (n=5, σ=4038.768, CV=1.95%, CI99%=±4.015%, PI99%=±9.834%, CI[198812.150, 215443.910], PI[186758.367, 227497.693])	-1734.720	-0.831%
GET	rps	1	9	1409337.776 (n=5, σ=12419.870, CV=0.88%, CI99%=±1.815%, PI99%=±4.445%, CI[1383765.093, 1434910.459], PI[1346697.750, 1471977.802])	1405667.700 (n=5, σ=16285.354, CV=1.16%, CI99%=±2.385%, PI99%=±5.843%, CI[1372135.931, 1439199.469], PI[1323531.975, 1487803.425])	-3670.076	-0.260%
GET	rps	10	1	979676.764 (n=5, σ=6572.178, CV=0.67%, CI99%=±1.381%, PI99%=±3.383%, CI[966144.558, 993208.970], PI[946529.765, 1012823.763])	985330.524 (n=5, σ=4710.855, CV=0.48%, CI99%=±0.984%, PI99%=±2.411%, CI[975630.807, 995030.241], PI[961571.168, 1009089.880])	5653.760	+0.577%
GET	rps	10	9	1953534.278 (n=5, σ=19046.359, CV=0.97%, CI99%=±2.007%, PI99%=±4.917%, CI[1914317.561, 1992750.995], PI[1857473.332, 2049595.224])	1953647.848 (n=5, σ=22983.445, CV=1.18%, CI99%=±2.422%, PI99%=±5.933%, CI[1906324.617, 2000971.079], PI[1837730.078, 2069565.618])	113.570	+0.006%
SET	rps	1	1	202143.826 (n=5, σ=839.888, CV=0.42%, CI99%=±0.856%, PI99%=±2.096%, CI[200414.485, 203873.167], PI[197907.823, 206379.829])	201932.202 (n=5, σ=1306.959, CV=0.65%, CI99%=±1.333%, PI99%=±3.264%, CI[199241.156, 204623.248], PI[195340.512, 208523.892])	-211.624	-0.105%
SET	rps	1	9	1380696.002 (n=5, σ=6793.117, CV=0.49%, CI99%=±1.013%, PI99%=±2.481%, CI[1366708.881, 1394683.123], PI[1346434.692, 1414957.312])	1373928.674 (n=5, σ=15653.075, CV=1.14%, CI99%=±2.346%, PI99%=±5.746%, CI[1341698.776, 1406158.572], PI[1294981.870, 1452875.478])	-6767.328	-0.490%
SET	rps	10	1	893102.922 (n=5, σ=6183.801, CV=0.69%, CI99%=±1.426%, PI99%=±3.492%, CI[880370.390, 905835.454], PI[861914.715, 924291.129])	892045.900 (n=5, σ=4787.770, CV=0.54%, CI99%=±1.105%, PI99%=±2.707%, CI[882187.816, 901903.984], PI[867898.625, 916193.175])	-1057.022	-0.118%
SET	rps	10	9	1640089.728 (n=5, σ=14054.599, CV=0.86%, CI99%=±1.764%, PI99%=±4.322%, CI[1611151.115, 1669028.341], PI[1569204.892, 1710974.564])	1632861.474 (n=5, σ=13634.091, CV=0.83%, CI99%=±1.719%, PI99%=±4.211%, CI[1604788.693, 1660934.255], PI[1564097.484, 1701625.464])	-7228.254	-0.441%

github-actions Bot assigned Cuda-Chen Oct 31, 2025

Cuda-Chen marked this pull request as ready for review December 17, 2025 02:51

Cuda-Chen force-pushed the crc16-multibyte-lut branch from 2f1c909 to d554219 Compare February 24, 2026 15:00

zuiderkwast reviewed Feb 25, 2026

View reviewed changes

sarthakaggarwal97 reviewed Feb 25, 2026

View reviewed changes

Cuda-Chen force-pushed the crc16-multibyte-lut branch 3 times, most recently from 754a7d5 to 95ebdea Compare February 27, 2026 11:17

Cuda-Chen requested review from sarthakaggarwal97 and zuiderkwast February 27, 2026 11:19

Cuda-Chen force-pushed the crc16-multibyte-lut branch from 95ebdea to 688b2f2 Compare February 27, 2026 12:31

zuiderkwast mentioned this pull request Mar 6, 2026

[NEW] Accelerate CRC16 with SIMD #2031

Closed

Optimize CRC16 using 3-byte tabular

c3602f1

Signed-off-by: Cuda-Chen <clh960524@gmail.com>

Cuda-Chen force-pushed the crc16-multibyte-lut branch from 9dac4d2 to c3602f1 Compare March 7, 2026 13:46

Cuda-Chen closed this Mar 9, 2026

zuiderkwast added the run-cluster-benchmark label Mar 9, 2026

zuiderkwast reopened this Mar 9, 2026

zuiderkwast marked this pull request as draft March 9, 2026 21:51

zuiderkwast mentioned this pull request Mar 9, 2026

Adds cluster benchmark support for benchmark-on-label #3338

Merged

github-actions Bot removed the run-cluster-benchmark label Mar 10, 2026

zuiderkwast closed this Mar 10, 2026

Uh oh!

Conversation

Cuda-Chen commented Oct 31, 2025 • edited by zuiderkwast Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Cuda-Chen commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Environment

Benchmark Results

baseline (commit f54818cc60597e9fe5dc03a52fd39ab944cd4932 on unstable branch

2-byte reduction

4-byte reduction

8-byte reduction

16-byte reduction

Uh oh!

zuiderkwast commented Nov 6, 2025

Uh oh!

Cuda-Chen commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zuiderkwast commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Cuda-Chen commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Cuda-Chen commented Feb 25, 2026

Uh oh!

zuiderkwast commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zuiderkwast commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zuiderkwast commented Feb 25, 2026

Uh oh!

zuiderkwast Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Cuda-Chen Feb 25, 2026

Choose a reason for hiding this comment

Footnotes

Uh oh!

zuiderkwast Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Cuda-Chen commented Feb 25, 2026

Uh oh!

sarthakaggarwal97 Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Cuda-Chen commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Environment

Benchmark Procedures

Note

Normal

Clustering Mode

Benchmark Results

Normal

RPS

latency (measured in seconds)

Clustering Mode

RPS

latency (measured in seconds)

Uh oh!

Cuda-Chen commented Feb 27, 2026

FlameGraphs

Note

Normal

Base

This PR

Clustering Mode

Base

This PR

Uh oh!

Cuda-Chen commented Feb 27, 2026

Uh oh!

zuiderkwast commented Feb 27, 2026

Uh oh!

Cuda-Chen commented Oct 31, 2025 •

edited by zuiderkwast

Loading

Cuda-Chen commented Oct 31, 2025 •

edited

Loading

baseline (commit `f54818cc60597e9fe5dc03a52fd39ab944cd4932` on `unstable` branch

Cuda-Chen commented Nov 7, 2025 •

edited

Loading

zuiderkwast commented Feb 24, 2026 •

edited

Loading

Cuda-Chen commented Feb 25, 2026 •

edited

Loading

zuiderkwast commented Feb 25, 2026 •

edited

Loading

zuiderkwast commented Feb 25, 2026 •

edited

Loading

Cuda-Chen commented Feb 27, 2026 •

edited

Loading

Cuda-Chen commented Feb 28, 2026 •

edited

Loading

zuiderkwast commented Mar 5, 2026 •

edited

Loading

codecov Bot commented Mar 9, 2026 •

edited

Loading