Skip to content

Optimize CRC16 using multi-byte LUT#2790

Closed
Cuda-Chen wants to merge 1 commit into
valkey-io:unstablefrom
Cuda-Chen:crc16-multibyte-lut
Closed

Optimize CRC16 using multi-byte LUT#2790
Cuda-Chen wants to merge 1 commit into
valkey-io:unstablefrom
Cuda-Chen:crc16-multibyte-lut

Conversation

@Cuda-Chen

@Cuda-Chen Cuda-Chen commented Oct 31, 2025

Copy link
Copy Markdown

Optimize CRC16 using multi-byte LUT.

See also the discussion in the previous attempt: #2691

@Cuda-Chen

Cuda-Chen commented Oct 31, 2025

Copy link
Copy Markdown
Author

Hi @zuiderkwast ,

I would like to share some benchmark results of this PR.
If we are going to merge this PR, there are at least two works waiting for finishing:

  • jump-table for reduction
  • code tidy (e.g., using macro)

Environment

  • CPU: Intel(R) Core(TM) i5-3230M CPU @ 2.60GHz
  • OS: Linux 5.15.0-157-generic
  • For all "reduction", it reduces from the target length to 1-byte (e.g., 8-byte: 8 -> 4 -> 2 -> 1)

Benchmark Results

baseline (commit f54818cc60597e9fe5dc03a52fd39ab944cd4932 on unstable branch

# -t get,set
SET: 292902.97 requests per second, p50=1.431 msec                    
GET: 350557.41 requests per second, p50=1.199 msec

# -t get
GET: 454277.00 requests per second, p50=0.975 msec

2-byte reduction

# -t set,get
SET: 321502.06 requests per second, p50=1.351 msec                    
GET: 387236.69 requests per second, p50=1.151 msec

# -t get
GET: 453309.16 requests per second, p50=0.959 msec

4-byte reduction

# -t set,get
SET: 324191.12 requests per second, p50=1.335 msec                    
GET: 376364.31 requests per second, p50=1.135 msec

# -t get
GET: 470167.88 requests per second, p50=0.943 msec

8-byte reduction

# -t set,get
SET: 319550.06 requests per second, p50=1.375 msec                    
GET: 384349.31 requests per second, p50=1.143 msec

# -t get
GET: 469329.34 requests per second, p50=0.959 msec

16-byte reduction

# -t set,get
SET: 300210.12 requests per second, p50=1.431 msec                    
GET: 391880.25 requests per second, p50=1.143 msec 

# -t get
GET: 443066.03 requests per second, p50=0.943 msec

@zuiderkwast

Copy link
Copy Markdown
Contributor

Interesting! But with a larger LUT, when it is loaded into L1 cache, other stuff are evicted that might be needed later, so it may depend on use case if it's actually faster in all cases.

How much memory would the 2-byte reduction variant use? The current draft is for the 16-byte reduction, right?

@Cuda-Chen

Cuda-Chen commented Nov 7, 2025

Copy link
Copy Markdown
Author

Hi @zuiderkwast ,

How much memory would the 2-byte reduction variant use?

It will use 1 KB memory for LUT (256 entries * 2 bytes of each entry * 2 times for 2-byte lookup at the same time = 1024 bytes = 1 KB).

The current draft is for the 16-byte reduction, right?

Yes, the current draft uses 16-byte reduction.

@Cuda-Chen Cuda-Chen marked this pull request as ready for review December 17, 2025 02:51
@zuiderkwast

zuiderkwast commented Feb 24, 2026

Copy link
Copy Markdown
Contributor

Thanks for keeping this up to date.

Which key length did you use for the benchmarking? Do you still have the valkey-benchmark command line you used?

the current draft uses 16-byte reduction

Can you change it to use only 2-byte LUT? Using 4-byte or more didn't seem to gain more and it may instead evict other data from the L1 cache that we'd rather keep in there. Preferabley do it as an additional commit so the full 16-byte LUT implementation is still visible in the history of the PR.

Then, I hope we can find someone more to benchmark this with realistic traffic. We should also make sure we don't get a performance regression for very short keys, like 3 byte keys.

The key length varies for different users and there's also the possibility of using tags with curly braces within the keys. For example in a key named like "user:{123abc}:bla-bla-bla:some:stuff", we do CRC16 only on the part within curly braces, in this case "123abc", and when this style is used, the data for CRC16 is typically quite short.

@Cuda-Chen

Cuda-Chen commented Feb 25, 2026

Copy link
Copy Markdown
Author

Hi @zuiderkwast ,

Do you still have the valkey-benchmark command line you used?

Yes. I use the commands provided in this PR comments. For clearance, I re-post here:

$ cd src
$ rm dump.rdb nodes.conf # clean up any old data files (if they exist)
$ ./valkey-server --cluster-enabled yes --save '' &
(...)
2622423:M 07 Oct 2025 17:48:58.872 * Server initialized
2622423:M 07 Oct 2025 17:48:58.873 * Ready to accept connections tcp

$ ./valkey-cli ./valkey-cli cluster addslotsrange 0 16383
OK
$ ./valkey-cli cluster info | head -3
cluster_state:ok
cluster_slots_assigned:16384
cluster_slots_ok:16384

$ ./valkey-benchmark --threads 3 -P 10 -n 10000000 -r 1000000 -t set,get -q

# clear the existing slots for next time benchmark
./valkey-cli flushall

Which key length did you use for the benchmarking?

I admit I have no idea. However, the key length should be the settings of set and get tests provided to valkey-benchmark.

Can you change it to use only 2-byte LUT?

I will alter my code to use only 2-byte LUT in later commits.
What's more, I will conduct benchmark again as my testing environment recently got a system upgrade.

@Cuda-Chen

Copy link
Copy Markdown
Author

Hi @zuiderkwast ,

For one more thing:

Preferabley do it as an additional commit so the full 16-byte LUT implementation is still visible in the history of the PR.

I guess you mean:

# current commit
commit A: <description of multi-byte LUT>

# change into two commits
commit A: <description of 2-byte LUT>
commit B: <description of 4-byte and above LUT>

If I am wrong, just let me know.

@zuiderkwast

zuiderkwast commented Feb 25, 2026

Copy link
Copy Markdown
Contributor

Which key length did you use for the benchmarking?

I admit I have no idea. However, the key length should be the settings of set and get tests provided to valkey-benchmark.

Right, I remember you first used --cluster and then I suggested you run without --cluster to make valkey-benchmark construct the keys differently. This is how the keys look in the get and set tests:

Without --cluster With --cluster
Example key:000000000000 key{06S}:000000000000
Key length 16 21
CRC16 input 16 3

The number is zero-padded and varies with -r. The thing inside curly braces varies too but it's 3 bytes in general. Only that part in curly braces is passed to CRC16 if it exists.

I think both of these are realistic keys lengths and patterns. 16 and 3 bytes are both good to test.

@zuiderkwast

zuiderkwast commented Feb 25, 2026

Copy link
Copy Markdown
Contributor

If I am wrong, just let me know.

Yeah, I meant just like you've done it, i.e.

  1. "Optimize CRC16 using multi-byte consumption"
  2. "Disable 4-byte and above LUT"
  3. (later) Cleanup, delete unused code and tables

@zuiderkwast

Copy link
Copy Markdown
Contributor

@Cuda-Chen Do you want to profile it? Using perf or similar, for example generate a flamegraph so we can see how much the server spends in the crc16 function.

IMO, only -t get would be enough to run. The server has no keys, so no data will be returned. This is the fastest command so we maximize the part spent in crc16.

Comment thread src/crc16.c Outdated
Comment on lines +679 to +692
for(; counter + 1 < len; counter += 2) {
/* explicitly get two bytes */
uint16_t a = buf[counter];
uint16_t b = buf[counter + 1];
uint16_t tmp = ((a << 8) | b);

crc ^= tmp;
// fit LITTLE-ENDIAN architecture
crc = crc16tab[1 * 256 + (uint8_t)(crc >> 8)] ^ crc16tab[0 * 256 + (uint8_t)(crc >> 0)];
}

// deal with leftover
for(; counter < len; counter++)
crc = (crc << 8) ^ crc16tab[((crc >> 8) ^ (uint8_t)buf[counter]) & 0x00FF];

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just so I understand correctly:

The code for 2-byte LUT still does 2 table lookups for every 2 bytes of input. Same as for 1-byte LUT.

The difference is that we do fewer shifts? Only one instead of two.

I find it hard to see how reducing a single instruction per input byte would give any benefit TBH. Maybe the benchmark results were just random, no significant differences at all?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference is that we do fewer shifts? Only one instead of two.

The difference is that the two loads and two table lookups can be done in parallel (Instruction Parallelism). So we can have a chance to improve the performance.

For example, a CRC32 implementation (provided by 1) can have an improvement from 1.10 bits per cycle (1-byte LUT) to 1.60 bits per cycle (2-byte Tabular).

What's more, the multi-byte LUT (esepcially the slicing-by-4 and slicing-by-8 originated from Intel 2) consumes four or eight input bytes in the same time to improve performance further (in 3, this technique can improve performance to 4.80 bits per cycle in slicing-by-8 implementation).

Footnotes

  1. https://github.com/komrad36/CRC

  2. https://create.stephan-brumme.com/crc32/#slicing-by-8-overview

  3. https://github.com/komrad36/CRC?tab=readme-ov-file#performance-comparison-table

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two table lookups can be done in parallel

Right, this is the point. Thanks for reminding me. :)

@Cuda-Chen

Copy link
Copy Markdown
Author

@zuiderkwast

Do you want to profile it? Using perf or similar, for example generate a flamegraph so we can see how much the server spends in the crc16 function.

Yes. I will profile with/without setting the --cluster parameter.

IMO, only -t get would be enough to run. The server has no keys, so no data will be returned. This is the fastest command so we maximize the part spent in crc16.

I will remind to benchmark/profile with only -t get.

Comment thread src/crc16.c Outdated
Comment on lines +681 to +682
uint16_t a = buf[counter];
uint16_t b = buf[counter + 1];

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
uint16_t a = buf[counter];
uint16_t b = buf[counter + 1];
uint16_t a = (uint8_t)buf[counter];
uint16_t b = (uint8_t)buf[counter + 1];

@Cuda-Chen

Cuda-Chen commented Feb 27, 2026

Copy link
Copy Markdown
Author

Benchmark Environment

  • OS: Linux 6.8.0-101-generic
  • CPU: Intel(R) Core(TM) i5-3230M CPU @ 2.60GHz

Benchmark Procedures

Note

  • Conduct three times of benchmarking for each part.

Normal

Mention in this previous comment.

Clustering Mode

$ ./utils/create-cluster/create-cluster start
$ ./utils/create-cluster/create-cluster create
$ perf record -g --pid <first primary node PID> -F 999

# ... open another terminal

$ ./src/valkey-benchmark --cluster -p <port of the first primary node> --threads 3 -P 10 -n 10000000 -r 1000000 -t get -q

# ... stop then tear down the cluster

$ ./utils/create-cluster/create-cluster stop
$ ./utils/create-cluster/create-cluster clean

Benchmark Results

Normal

RPS

Base This PR Performance gain
362040.906666667 351379.29 -2.9448652 %

latency (measured in seconds)

Base This PR Performance gain
1.161666667 1.124333333 3.2137734 %
Detailed Performance Metrics
# base
GET: 383774.03 requests per second, p50=1.167 msec
GET: 366609.22 requests per second, p50=1.151 msec
GET: 335739.47 requests per second, p50=1.167 msec

# this PR
GET: 289293.25 requests per second, p50=1.143 msec
GET: 391236.31 requests per second, p50=1.111 msec
GET: 373608.31 requests per second, p50=1.119 msec 

Clustering Mode

RPS

Base This PR Performance gain
479107.686666667 484595.01 1.1453215 %

latency (measured in seconds)

Base This PR Performance gain
0.737666667 0.719 2.5305016 %
Detailed Performance Metrics
# base
GET: 553832.50 requests per second, p50=0.655 msec
GET: 454029.53 requests per second, p50=0.759 msec
GET: 429461.03 requests per second, p50=0.799 msec

# this PR
GET: 525983.62 requests per second, p50=0.671 msec
GET: 469946.91 requests per second, p50=0.735 msec
GET: 457854.50 requests per second, p50=0.751 msec

@Cuda-Chen

Copy link
Copy Markdown
Author

FlameGraphs

Note

  • Check the matched percentage of keyHashSlot() (crc16() is inlined in this function).

Normal

Base This PR
4.7% 1.3 %
FlameGraphs (Remember to use Right-click to Download)

Base

valkey_orig

This PR

valkey_twolut

Clustering Mode

Base This PR
1.3% 1.4%
FlameGraphs (Remember to use Right-click to Download)

Base

valkey_cluster_orig

This PR

valkey_cluster twolut

@Cuda-Chen Cuda-Chen force-pushed the crc16-multibyte-lut branch 3 times, most recently from 754a7d5 to 95ebdea Compare February 27, 2026 11:17
@Cuda-Chen

Copy link
Copy Markdown
Author

I think both of these are realistic keys lengths and patterns. 16 and 3 bytes are both good to test.

Saying for 3 bytes, I come up with an idea: instead of reduce-by-power-of-2, how about reduce-by-prime (just like FFTW does for FFT)?

@zuiderkwast

Copy link
Copy Markdown
Contributor

Regarding reduce-by-prime, I don't really understand what you mean. Would we have tables for e.g. 5, 3, 2 and 1 byte?

I have a new idea: Can we get more memory-parallelization if we do crc16 on multiple strings in parallel? Commands often come in batches.

@Cuda-Chen

Cuda-Chen commented Feb 28, 2026

Copy link
Copy Markdown
Author

Regarding reduce-by-prime, I don't really understand what you mean. Would we have tables for e.g. 5, 3, 2 and 1 byte?

Yes.
We prepare the table, and we do integer factorization (this can be done by another table as the input length will not exceed 20). Then, we run the certain reduction part.
For example, if the input length is 3, we first get its largest prime factor is 3. Next, we run the reduce-by-3 part, something like this:

/* largest prime factor of an integer
 * Each index indicate the largest prime factor of the index.
 * E.g., fact[3] = 3 means the largest prime factor of 3 is 3.
 */
int fact[] = {0, 1, 2, 3, ...};

while(len >= 0) {
  /* we can change switch-case to computed goto for potential more performance */
  switch(fact[len]) {
    /* ... plenty of prime number cases */ 
    case 5:
      /* do 5-byte tabular CRC */
      len -= 5;
      break;
    case 3:
      /* do 3-byte tabular CRC */
      len -= 3;
      break;
    case 2:
      /* do 2-byte tabular CRC */
      len -= 2;
      break;
    case 1:
      /* do 1-byte tabular CRC */
      len -= 1;
      break;
  }
}

@Cuda-Chen

Copy link
Copy Markdown
Author

I have a new idea: Can we get more memory-parallelization if we do crc16 on multiple strings in parallel? Commands often come in batches.

I will say yes. But I need to realize how ValKey does crc16 on multiple strings as I am not familiar with this part.

@Cuda-Chen

Copy link
Copy Markdown
Author

Saying for 3 bytes, I come up with an idea: instead of reduce-by-power-of-2, how about reduce-by-prime (just like FFTW does for FFT)?

So after implementing the reduce-by-3, I find a significant impact in clustering mode. So I will drop the commit of reduce-by-3.

For record, I paste the benchmark result (the measurement is as same as this previous comment).

Normal

RPS

base reduce-by-3 performance gain
362040.906666667 396712.146666667 9.5766084%

latency

base reduce-by-3 performance gain
1.161666667 1.143 1.6068867%

Cluster Mode

RPS

base reduce-by-3 performance gain
479107.686666667 427653.74 -10.7395369%

latench

base reduce-by-3 performance gain
0.737666667 0.793666667 -7.5915047%

@Cuda-Chen

Copy link
Copy Markdown
Author

One more thing (for my own curiosity): does ValKey has performance issue when the server runs with HDD? I borrow a computer with HDD, and clustering mode gives the same latency when benchmarking.

@zuiderkwast

zuiderkwast commented Mar 5, 2026

Copy link
Copy Markdown
Contributor

One more thing (for my own curiosity): does ValKey has performance issue when the server runs with HDD? I borrow a computer with HDD, and clustering mode gives the same latency when benchmarking.

Hard disk? Snapshots to disk are asynchronous using a child process so they shouldn't affect the latency and RPS of commands.

So after implementing the reduce-by-3, I find a significant impact in clustering mode. So I will drop the commit of reduce-by-3.

OK, interesting. In the valkey-bechmark --cluster mode, the crc16 input is 3 bytes long which I guessed would be optimal for reduce-by-3 LUT. Maybe the larger LUT constantly gets evicted from CPU cache? That is my guess.

The smallest reduce-by-1 LUT still seems good enough.

I wanted to try to make compute multiple crc16 in parallel. It covers some scenario when a client sends multiple commands and the servers handles them in a batch, or the cliend sends commands that involves multiple keys (MGET for example). I haven't really tested it properly yet. #3323

@Cuda-Chen

Copy link
Copy Markdown
Author

So after implementing the reduce-by-3, I find a significant impact in clustering mode.

I find that I forget to put --cluster when benchmarking cluster mode. I will re-do the benchmark of this part again.

I wanted to try to make compute multiple crc16 in parallel. It covers some scenario when a client sends multiple commands and the servers handles them in a batch, or the cliend sends commands that involves multiple keys (MGET for example). I haven't really tested it properly yet. #3323

I think I can merge this commit then benchmark.

@zuiderkwast

Copy link
Copy Markdown
Contributor

I've benchmarked the reduce-by-3 and reduce-by-2 LUT locally. Over multiple runs, I can't see that any of these variants is consistently better than reduce-by-1. Therefore, I don't think it's worth merging it.

My draft where I tried to do crc16 on multiple keys in parallel, also didn't give good results. Actually it was 1% degradation in RPS. Probably because of the extra logic means more instructions.

The first thoughts that it's possible to optimize crc16 using some SIMD instruction turns out not to be true. It's not so easy for this use case.

I think we should try further to optimize crc16. Of course, if you want, you are free to continue, but I think it's better to find some other code path to optimize. Anyway, thank you for the big effort to try out all of these ideas!

Signed-off-by: Cuda-Chen <clh960524@gmail.com>
@Cuda-Chen Cuda-Chen force-pushed the crc16-multibyte-lut branch from 9dac4d2 to c3602f1 Compare March 7, 2026 13:46
@codecov

codecov Bot commented Mar 9, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.94%. Comparing base (75fb0e6) to head (c3602f1).
⚠️ Report is 30 commits behind head on unstable.

Additional details and impacted files
@@              Coverage Diff              @@
##           unstable    #2790       +/-   ##
=============================================
+ Coverage          0   74.94%   +74.94%     
=============================================
  Files             0      129      +129     
  Lines             0    71565    +71565     
=============================================
+ Hits              0    53634    +53634     
- Misses            0    17931    +17931     
Files with missing lines Coverage Δ
src/crc16.c 100.00% <100.00%> (ø)

... and 128 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Cuda-Chen

Copy link
Copy Markdown
Author

I am going to close this PR as I agree that we do not spot any crucial performance improvement.

I would like to also give a great gratitude to @zuiderkwast for assisting the performance testing.
From how to use the test scripts, what testing under the hood, and what I should pay attention when using FlameGraph.

Anyway, thanks this topic and community very much!

@zuiderkwast

Copy link
Copy Markdown
Contributor

Now it's possible to run automated benchmarks in cluster mode. Let me repoen this just to run it.

@github-actions

Copy link
Copy Markdown

Cluster Benchmark ran on this commit: c3602f1

Benchmark Comparison: HEAD vs HEAD (averaged) - rps metrics

Run Summary:

  • HEAD: 80 total runs, 16 configurations (avg 5.00 runs per config)
  • HEAD: 80 total runs, 16 configurations (avg 5.00 runs per config)

Statistical Notes:

  • CI99%: 99% Confidence Interval - range where the true population mean is likely to fall
  • PI99%: 99% Prediction Interval - range where a single future observation is likely to fall
  • CV: Coefficient of Variation - relative variability (σ/μ × 100%)

Note: Values with (n=X, σ=Y, CV=Z%, CI99%=±W%, PI99%=±V%) indicate averages from X runs with standard deviation Y, coefficient of variation Z%, 99% confidence interval margin of error ±W% of the mean, and 99% prediction interval margin of error ±V% of the mean. CI bounds [A, B] and PI bounds [C, D] show the actual interval ranges.

Configuration:

  • architecture: aarch64
  • benchmark_mode: duration
  • clients: 1600
  • cluster_mode: True
  • data_size: 16
  • duration: 180
  • tls: False
  • valkey_benchmark_threads: 90
  • warmup: 30
Command Metric Pipeline io_threads HEAD HEAD Diff % Change
GET rps 1 1 217320.366 (n=5, σ=417.211, CV=0.19%, CI99%=±0.395%, PI99%=±0.968%, CI[216461.324, 218179.408], PI[215216.151, 219424.581]) 216179.258 (n=5, σ=1410.800, CV=0.65%, CI99%=±1.344%, PI99%=±3.291%, CI[213274.401, 219084.115], PI[209063.840, 223294.676]) -1141.108 -0.525%
GET rps 1 9 1488372.498 (n=5, σ=13968.604, CV=0.94%, CI99%=±1.932%, PI99%=±4.733%, CI[1459610.949, 1517134.047], PI[1417921.378, 1558823.618]) 1489021.222 (n=5, σ=12956.537, CV=0.87%, CI99%=±1.792%, PI99%=±4.389%, CI[1462343.534, 1515698.910], PI[1423674.499, 1554367.945]) 648.724 +0.044%
GET rps 10 1 1034796.626 (n=5, σ=8988.540, CV=0.87%, CI99%=±1.789%, PI99%=±4.381%, CI[1016289.098, 1053304.154], PI[989462.627, 1080130.625]) 1033891.326 (n=5, σ=3815.843, CV=0.37%, CI99%=±0.760%, PI99%=±1.861%, CI[1026034.452, 1041748.200], PI[1014645.993, 1053136.659]) -905.300 -0.087%
GET rps 10 9 2345842.450 (n=5, σ=13962.230, CV=0.60%, CI99%=±1.226%, PI99%=±3.002%, CI[2317094.026, 2374590.874], PI[2275423.480, 2416261.420]) 2340635.150 (n=5, σ=16159.395, CV=0.69%, CI99%=±1.422%, PI99%=±3.482%, CI[2307362.732, 2373907.568], PI[2259134.704, 2422135.596]) -5207.300 -0.222%
SET rps 1 1 209610.464 (n=5, σ=2813.694, CV=1.34%, CI99%=±2.764%, PI99%=±6.770%, CI[203817.030, 215403.898], PI[195419.506, 223801.422]) 209907.034 (n=5, σ=1973.259, CV=0.94%, CI99%=±1.936%, PI99%=±4.741%, CI[205844.066, 213970.002], PI[199954.836, 219859.232]) 296.570 +0.141%
SET rps 1 9 1359231.128 (n=5, σ=13816.738, CV=1.02%, CI99%=±2.093%, PI99%=±5.127%, CI[1330782.274, 1387679.982], PI[1289545.952, 1428916.304]) 1362164.102 (n=5, σ=8091.207, CV=0.59%, CI99%=±1.223%, PI99%=±2.996%, CI[1345504.196, 1378824.008], PI[1321355.833, 1402972.371]) 2932.974 +0.216%
SET rps 10 1 907407.326 (n=5, σ=4502.097, CV=0.50%, CI99%=±1.022%, PI99%=±2.502%, CI[898137.447, 916677.205], PI[884700.852, 930113.800]) 906814.476 (n=5, σ=2276.325, CV=0.25%, CI99%=±0.517%, PI99%=±1.266%, CI[902127.490, 911501.462], PI[895333.753, 918295.199]) -592.850 -0.065%
SET rps 10 9 1714330.624 (n=5, σ=19364.201, CV=1.13%, CI99%=±2.326%, PI99%=±5.697%, CI[1674459.467, 1754201.781], PI[1616666.633, 1811994.615]) 1728654.400 (n=5, σ=12806.560, CV=0.74%, CI99%=±1.525%, PI99%=±3.736%, CI[1702285.515, 1755023.285], PI[1664064.087, 1793244.713]) 14323.776 +0.836%

Configuration:

  • architecture: aarch64
  • benchmark_mode: duration
  • clients: 1600
  • cluster_mode: True
  • data_size: 96
  • duration: 180
  • tls: False
  • valkey_benchmark_threads: 90
  • warmup: 30
Command Metric Pipeline io_threads HEAD HEAD Diff % Change
GET rps 1 1 208862.750 (n=5, σ=1847.790, CV=0.88%, CI99%=±1.822%, PI99%=±4.462%, CI[205058.124, 212667.376], PI[199543.358, 218182.142]) 207128.030 (n=5, σ=4038.768, CV=1.95%, CI99%=±4.015%, PI99%=±9.834%, CI[198812.150, 215443.910], PI[186758.367, 227497.693]) -1734.720 -0.831%
GET rps 1 9 1409337.776 (n=5, σ=12419.870, CV=0.88%, CI99%=±1.815%, PI99%=±4.445%, CI[1383765.093, 1434910.459], PI[1346697.750, 1471977.802]) 1405667.700 (n=5, σ=16285.354, CV=1.16%, CI99%=±2.385%, PI99%=±5.843%, CI[1372135.931, 1439199.469], PI[1323531.975, 1487803.425]) -3670.076 -0.260%
GET rps 10 1 979676.764 (n=5, σ=6572.178, CV=0.67%, CI99%=±1.381%, PI99%=±3.383%, CI[966144.558, 993208.970], PI[946529.765, 1012823.763]) 985330.524 (n=5, σ=4710.855, CV=0.48%, CI99%=±0.984%, PI99%=±2.411%, CI[975630.807, 995030.241], PI[961571.168, 1009089.880]) 5653.760 +0.577%
GET rps 10 9 1953534.278 (n=5, σ=19046.359, CV=0.97%, CI99%=±2.007%, PI99%=±4.917%, CI[1914317.561, 1992750.995], PI[1857473.332, 2049595.224]) 1953647.848 (n=5, σ=22983.445, CV=1.18%, CI99%=±2.422%, PI99%=±5.933%, CI[1906324.617, 2000971.079], PI[1837730.078, 2069565.618]) 113.570 +0.006%
SET rps 1 1 202143.826 (n=5, σ=839.888, CV=0.42%, CI99%=±0.856%, PI99%=±2.096%, CI[200414.485, 203873.167], PI[197907.823, 206379.829]) 201932.202 (n=5, σ=1306.959, CV=0.65%, CI99%=±1.333%, PI99%=±3.264%, CI[199241.156, 204623.248], PI[195340.512, 208523.892]) -211.624 -0.105%
SET rps 1 9 1380696.002 (n=5, σ=6793.117, CV=0.49%, CI99%=±1.013%, PI99%=±2.481%, CI[1366708.881, 1394683.123], PI[1346434.692, 1414957.312]) 1373928.674 (n=5, σ=15653.075, CV=1.14%, CI99%=±2.346%, PI99%=±5.746%, CI[1341698.776, 1406158.572], PI[1294981.870, 1452875.478]) -6767.328 -0.490%
SET rps 10 1 893102.922 (n=5, σ=6183.801, CV=0.69%, CI99%=±1.426%, PI99%=±3.492%, CI[880370.390, 905835.454], PI[861914.715, 924291.129]) 892045.900 (n=5, σ=4787.770, CV=0.54%, CI99%=±1.105%, PI99%=±2.707%, CI[882187.816, 901903.984], PI[867898.625, 916193.175]) -1057.022 -0.118%
SET rps 10 9 1640089.728 (n=5, σ=14054.599, CV=0.86%, CI99%=±1.764%, PI99%=±4.322%, CI[1611151.115, 1669028.341], PI[1569204.892, 1710974.564]) 1632861.474 (n=5, σ=13634.091, CV=0.83%, CI99%=±1.719%, PI99%=±4.211%, CI[1604788.693, 1660934.255], PI[1564097.484, 1701625.464]) -7228.254 -0.441%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants