Optimize CRC16 using multi-byte LUT#2790
Conversation
|
Hi @zuiderkwast , I would like to share some benchmark results of this PR.
Environment
Benchmark Resultsbaseline (commit
|
|
Interesting! But with a larger LUT, when it is loaded into L1 cache, other stuff are evicted that might be needed later, so it may depend on use case if it's actually faster in all cases. How much memory would the 2-byte reduction variant use? The current draft is for the 16-byte reduction, right? |
|
Hi @zuiderkwast ,
It will use 1 KB memory for LUT (256 entries * 2 bytes of each entry * 2 times for 2-byte lookup at the same time = 1024 bytes = 1 KB).
Yes, the current draft uses 16-byte reduction. |
2f1c909 to
d554219
Compare
|
Thanks for keeping this up to date. Which key length did you use for the benchmarking? Do you still have the valkey-benchmark command line you used?
Can you change it to use only 2-byte LUT? Using 4-byte or more didn't seem to gain more and it may instead evict other data from the L1 cache that we'd rather keep in there. Preferabley do it as an additional commit so the full 16-byte LUT implementation is still visible in the history of the PR. Then, I hope we can find someone more to benchmark this with realistic traffic. We should also make sure we don't get a performance regression for very short keys, like 3 byte keys. The key length varies for different users and there's also the possibility of using tags with curly braces within the keys. For example in a key named like |
|
Hi @zuiderkwast ,
Yes. I use the commands provided in this PR comments. For clearance, I re-post here:
I admit I have no idea. However, the key length should be the settings of
I will alter my code to use only 2-byte LUT in later commits. |
|
Hi @zuiderkwast , For one more thing:
I guess you mean: If I am wrong, just let me know. |
Right, I remember you first used
The number is zero-padded and varies with I think both of these are realistic keys lengths and patterns. 16 and 3 bytes are both good to test. |
Yeah, I meant just like you've done it, i.e.
|
|
@Cuda-Chen Do you want to profile it? Using perf or similar, for example generate a flamegraph so we can see how much the server spends in the crc16 function. IMO, only |
| for(; counter + 1 < len; counter += 2) { | ||
| /* explicitly get two bytes */ | ||
| uint16_t a = buf[counter]; | ||
| uint16_t b = buf[counter + 1]; | ||
| uint16_t tmp = ((a << 8) | b); | ||
|
|
||
| crc ^= tmp; | ||
| // fit LITTLE-ENDIAN architecture | ||
| crc = crc16tab[1 * 256 + (uint8_t)(crc >> 8)] ^ crc16tab[0 * 256 + (uint8_t)(crc >> 0)]; | ||
| } | ||
|
|
||
| // deal with leftover | ||
| for(; counter < len; counter++) | ||
| crc = (crc << 8) ^ crc16tab[((crc >> 8) ^ (uint8_t)buf[counter]) & 0x00FF]; |
There was a problem hiding this comment.
Just so I understand correctly:
The code for 2-byte LUT still does 2 table lookups for every 2 bytes of input. Same as for 1-byte LUT.
The difference is that we do fewer shifts? Only one instead of two.
I find it hard to see how reducing a single instruction per input byte would give any benefit TBH. Maybe the benchmark results were just random, no significant differences at all?
There was a problem hiding this comment.
The difference is that we do fewer shifts? Only one instead of two.
The difference is that the two loads and two table lookups can be done in parallel (Instruction Parallelism). So we can have a chance to improve the performance.
For example, a CRC32 implementation (provided by 1) can have an improvement from 1.10 bits per cycle (1-byte LUT) to 1.60 bits per cycle (2-byte Tabular).
What's more, the multi-byte LUT (esepcially the slicing-by-4 and slicing-by-8 originated from Intel 2) consumes four or eight input bytes in the same time to improve performance further (in 3, this technique can improve performance to 4.80 bits per cycle in slicing-by-8 implementation).
Footnotes
There was a problem hiding this comment.
two table lookups can be done in parallel
Right, this is the point. Thanks for reminding me. :)
Yes. I will profile with/without setting the
I will remind to benchmark/profile with only |
| uint16_t a = buf[counter]; | ||
| uint16_t b = buf[counter + 1]; |
There was a problem hiding this comment.
| uint16_t a = buf[counter]; | |
| uint16_t b = buf[counter + 1]; | |
| uint16_t a = (uint8_t)buf[counter]; | |
| uint16_t b = (uint8_t)buf[counter + 1]; |
Benchmark Environment
Benchmark ProceduresNote
NormalMention in this previous comment. Clustering ModeBenchmark ResultsNormalRPS
latency (measured in seconds)
Detailed Performance MetricsClustering ModeRPS
latency (measured in seconds)
Detailed Performance Metrics |
FlameGraphsNote
Normal
Clustering Mode
|
754a7d5 to
95ebdea
Compare
Saying for 3 bytes, I come up with an idea: instead of reduce-by-power-of-2, how about reduce-by-prime (just like FFTW does for FFT)? |
95ebdea to
688b2f2
Compare
|
Regarding reduce-by-prime, I don't really understand what you mean. Would we have tables for e.g. 5, 3, 2 and 1 byte? I have a new idea: Can we get more memory-parallelization if we do crc16 on multiple strings in parallel? Commands often come in batches. |
Yes. /* largest prime factor of an integer
* Each index indicate the largest prime factor of the index.
* E.g., fact[3] = 3 means the largest prime factor of 3 is 3.
*/
int fact[] = {0, 1, 2, 3, ...};
while(len >= 0) {
/* we can change switch-case to computed goto for potential more performance */
switch(fact[len]) {
/* ... plenty of prime number cases */
case 5:
/* do 5-byte tabular CRC */
len -= 5;
break;
case 3:
/* do 3-byte tabular CRC */
len -= 3;
break;
case 2:
/* do 2-byte tabular CRC */
len -= 2;
break;
case 1:
/* do 1-byte tabular CRC */
len -= 1;
break;
}
} |
I will say yes. But I need to realize how ValKey does crc16 on multiple strings as I am not familiar with this part. |
So after implementing the reduce-by-3, I find a significant impact in clustering mode. So I will drop the commit of reduce-by-3. For record, I paste the benchmark result (the measurement is as same as this previous comment). NormalRPS
latency
Cluster ModeRPS
latench
|
|
One more thing (for my own curiosity): does ValKey has performance issue when the server runs with HDD? I borrow a computer with HDD, and clustering mode gives the same latency when benchmarking. |
Hard disk? Snapshots to disk are asynchronous using a child process so they shouldn't affect the latency and RPS of commands.
OK, interesting. In the The smallest reduce-by-1 LUT still seems good enough. I wanted to try to make compute multiple crc16 in parallel. It covers some scenario when a client sends multiple commands and the servers handles them in a batch, or the cliend sends commands that involves multiple keys (MGET for example). I haven't really tested it properly yet. #3323 |
I find that I forget to put
I think I can merge this commit then benchmark. |
|
I've benchmarked the reduce-by-3 and reduce-by-2 LUT locally. Over multiple runs, I can't see that any of these variants is consistently better than reduce-by-1. Therefore, I don't think it's worth merging it. My draft where I tried to do crc16 on multiple keys in parallel, also didn't give good results. Actually it was 1% degradation in RPS. Probably because of the extra logic means more instructions. The first thoughts that it's possible to optimize crc16 using some SIMD instruction turns out not to be true. It's not so easy for this use case. I think we should try further to optimize crc16. Of course, if you want, you are free to continue, but I think it's better to find some other code path to optimize. Anyway, thank you for the big effort to try out all of these ideas! |
Signed-off-by: Cuda-Chen <clh960524@gmail.com>
9dac4d2 to
c3602f1
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## unstable #2790 +/- ##
=============================================
+ Coverage 0 74.94% +74.94%
=============================================
Files 0 129 +129
Lines 0 71565 +71565
=============================================
+ Hits 0 53634 +53634
- Misses 0 17931 +17931
🚀 New features to boost your workflow:
|
|
I am going to close this PR as I agree that we do not spot any crucial performance improvement. I would like to also give a great gratitude to @zuiderkwast for assisting the performance testing. Anyway, thanks this topic and community very much! |
|
Now it's possible to run automated benchmarks in cluster mode. Let me repoen this just to run it. |
|
Cluster Benchmark ran on this commit: Benchmark Comparison: HEAD vs HEAD (averaged) - rps metricsRun Summary:
Statistical Notes:
Note: Values with (n=X, σ=Y, CV=Z%, CI99%=±W%, PI99%=±V%) indicate averages from X runs with standard deviation Y, coefficient of variation Z%, 99% confidence interval margin of error ±W% of the mean, and 99% prediction interval margin of error ±V% of the mean. CI bounds [A, B] and PI bounds [C, D] show the actual interval ranges. Configuration:
Configuration:
|
Optimize CRC16 using multi-byte LUT.
See also the discussion in the previous attempt: #2691