More efficient RIB tree for BMP#2433
Merged
Merged
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2433 +/- ##
==========================================
- Coverage 85.38% 85.38% -0.01%
==========================================
Files 237 237
Lines 14844 14845 +1
==========================================
Hits 12675 12675
+ Misses 1435 1434 -1
- Partials 734 736 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
fa6a40d to
4ef7ac7
Compare
Member
Author
ddcb512 to
a4f9fd3
Compare
a4f9fd3 to
0c66127
Compare
Member
Author
|
Explanations in this blog post. |
This was detected by "perf c2c" on an Intel CPU, but I am unable to
"fix" it:
```
goos: linux
goarch: amd64
pkg: akvorado/outlet/routing/provider/bmp
cpu: AMD Ryzen 5 5600X 6-Core Processor
│ 1.txt │ 2.txt │
│ sec/read │ sec/read vs base │
RIBConcurrent/16_shards,_500000_routes,_8_writers,_32_readers-12 5.271µ ± 3% 5.304µ ± 10% ~ (p=0.734 n=6)
│ 1.txt │ 2.txt │
│ sec/write │ sec/write vs base │
RIBConcurrent/16_shards,_500000_routes,_8_writers,_32_readers-12 40.17µ ± 3% 40.10µ ± 5% ~ (p=0.485 n=6)
```
I keep it here as a reminder to check that. Here is how this is done:
```
make test-bench PKG=akvorado/outlet/routing/provider/bmp \
GOAMD64=v3 GOTEST_ARGS="-c"
```
You get `bmp.test`. On an Intel CPU:
```
perf c2c record -g -- ./bmp.test \
-test.run='^$' \
-test.bench='BenchmarkRIBConcurrent/16_shards,_500000_routes,_8_writers,_32_readers' \
-test.benchtime=10s -test.count=1
perf c2c report --stdio
```
Then, look at the most hit cache line and at the lines in
"akvorado/outlet/routing". There is some awk to add LclHitm% to find the
lines that have the higher percent. Then you get the line and the offset
and you have to guess the structure. For example:
```
0.00% 1.03% 0.00% 0.00% 0.00% 0x28 0 1 0xa45e02 0 369 401 250 35 [.] akvorado/outlet/routing/pr bmp.test rib.go:203 0
```
Line 203 is return int(h % uint32(len(r.shards))). So, the impacted
structure is like struct rib and offset 0x28 is the shards array. And
other lines show that's the mutex. So, there is false sharing between
them.
I should dig a bit more in that.
It should be noted that `ModifyPersist()` is not used as it would copy
the tree for nothing if the prefix already exists (we just need to
update the routes) and we add/update/remove. So, we use Get() and a
conditional insert. Likely, the prefix already exists and this is a win.
Partial bench looks OK:
```
goos: linux
goarch: amd64
pkg: akvorado/outlet/routing/provider/bmp
cpu: AMD Ryzen 5 5600X 6-Core Processor
│ 1.txt │ 3.txt │
│ sec/read │ sec/read vs base │
RIBConcurrent/16_shards,_500000_routes,_8_writers,_32_readers-12 5.271µ ± 3% 2.758µ ± 13% -47.68% (p=0.002 n=6)
│ 1.txt │ 3.txt │
│ sec/write │ sec/write vs base │
RIBConcurrent/16_shards,_500000_routes,_8_writers,_32_readers-12 40.17µ ± 3% 25.70µ ± 4% -36.02% (p=0.002 n=6)
```
This is consistent with the 20% HITM of perf c2c report.
If a prefix index is reused for another prefix while a lookup is in
progress, we return the wrong routes. This is an issue that was also
true before removing the global RW lock. We attach a generation for each
prefix index in a shard and store it along the prefix index in the tree.
If the generation does not match the index, we assume the prefix is gone
and return no route.
This increases the size of the tree. This also stores a list of prefix
indexes. With 1.2M routes, we get 5MB of data. This should be OK.
No impact on a partial bench:
```
goos: linux
goarch: amd64
pkg: akvorado/outlet/routing/provider/bmp
cpu: AMD Ryzen 5 5600X 6-Core Processor
│ 1.txt │ 3.txt │ 4.txt │
│ sec/read │ sec/read vs base │ sec/read vs base │
RIBConcurrent/16_shards,_500000_routes,_8_writers,_32_readers-12 5.271µ ± 3% 2.758µ ± 13% -47.68% (p=0.002 n=6) 2.789µ ± 11% -47.08% (p=0.002 n=6)
│ 1.txt │ 3.txt │ 4.txt │
│ sec/write │ sec/write vs base │ sec/write vs base │
RIBConcurrent/16_shards,_500000_routes,_8_writers,_32_readers-12 40.17µ ± 3% 25.70µ ± 4% -36.02% (p=0.002 n=6) 26.19µ ± 5% -34.79% (p=0.002 n=6)
```
Also commit benchmark data.
It's useful on collisions, but we compute the value even if there is no
collisions. The collisions should be rare. Store the hash value in place
of the previous link (only used if there are collisions).
Benchmarks are good:
```
goos: linux
goarch: amd64
pkg: akvorado/outlet/routing/provider/bmp
cpu: AMD Ryzen 5 5600X 6-Core Processor
│ 11.txt │ 22.txt │
│ %ins │ %ins vs base │
RIBInsertion/1000_routes,_1_peers-12 100.0 ± 0% 100.0 ± 0% ~ (p=1.000 n=6) ¹
RIBInsertion/1000_routes,_2_peers-12 100.0 ± 0% 100.0 ± 0% ~ (p=1.000 n=6) ¹
RIBInsertion/1000_routes,_5_peers-12 100.0 ± 0% 100.0 ± 0% ~ (p=1.000 n=6) ¹
RIBInsertion/10000_routes,_1_peers-12 99.98 ± 0% 99.98 ± 0% ~ (p=1.000 n=6) ¹
RIBInsertion/10000_routes,_2_peers-12 99.98 ± 0% 99.98 ± 0% ~ (p=1.000 n=6) ¹
RIBInsertion/10000_routes,_5_peers-12 99.98 ± 0% 99.98 ± 0% ~ (p=1.000 n=6) ¹
RIBInsertion/100000_routes,_1_peers-12 99.81 ± 0% 99.81 ± 0% ~ (p=1.000 n=6) ¹
RIBInsertion/100000_routes,_2_peers-12 99.81 ± 0% 99.81 ± 0% ~ (p=1.000 n=6) ¹
RIBInsertion/100000_routes,_5_peers-12 99.81 ± 0% 99.81 ± 0% ~ (p=1.000 n=6) ¹
geomean 99.93 99.93 +0.00%
¹ all samples are equal
│ 11.txt │ 22.txt │
│ bytes/route │ bytes/route vs base │
RIBInsertion/1000_routes,_1_peers-12 388.4 ± 2% 388.6 ± 2% ~ (p=0.909 n=6)
RIBInsertion/1000_routes,_2_peers-12 374.2 ± 1% 374.4 ± 1% ~ (p=0.747 n=6)
RIBInsertion/1000_routes,_5_peers-12 351.7 ± 0% 351.8 ± 0% ~ (p=0.732 n=6)
RIBInsertion/10000_routes,_1_peers-12 335.1 ± 0% 335.1 ± 0% ~ (p=0.773 n=6)
RIBInsertion/10000_routes,_2_peers-12 331.3 ± 0% 331.2 ± 0% ~ (p=0.545 n=6)
RIBInsertion/10000_routes,_5_peers-12 280.0 ± 0% 280.0 ± 0% ~ (p=1.000 n=6)
RIBInsertion/100000_routes,_1_peers-12 272.8 ± 0% 272.8 ± 0% ~ (p=1.000 n=6)
RIBInsertion/100000_routes,_2_peers-12 260.5 ± 0% 260.5 ± 0% ~ (p=1.000 n=6) ¹
RIBInsertion/100000_routes,_5_peers-12 305.5 ± 0% 305.7 ± 0% ~ (p=0.835 n=6)
geomean 319.3 319.4 +0.02%
¹ all samples are equal
│ 11.txt │ 22.txt │
│ sec/route │ sec/route vs base │
RIBInsertion/1000_routes,_1_peers-12 2.971µ ± 5% 2.998µ ± 6% ~ (p=0.563 n=6)
RIBInsertion/1000_routes,_2_peers-12 2.450µ ± 13% 2.393µ ± 1% -2.29% (p=0.002 n=6)
RIBInsertion/1000_routes,_5_peers-12 2.361µ ± 5% 2.186µ ± 1% -7.41% (p=0.002 n=6)
RIBInsertion/10000_routes,_1_peers-12 2.546µ ± 1% 2.450µ ± 3% -3.75% (p=0.009 n=6)
RIBInsertion/10000_routes,_2_peers-12 3.102µ ± 3% 2.937µ ± 2% -5.30% (p=0.002 n=6)
RIBInsertion/10000_routes,_5_peers-12 4.139µ ± 1% 4.064µ ± 2% ~ (p=0.058 n=6)
RIBInsertion/100000_routes,_1_peers-12 4.957µ ± 1% 4.966µ ± 2% ~ (p=0.818 n=6)
RIBInsertion/100000_routes,_2_peers-12 5.065µ ± 2% 5.011µ ± 1% -1.08% (p=0.041 n=6)
RIBInsertion/100000_routes,_5_peers-12 4.206µ ± 1% 4.158µ ± 1% -1.13% (p=0.026 n=6)
geomean 3.391µ 3.308µ -2.44%
│ 11.txt │ 22.txt │
│ ms/op │ ms/op vs base │
RIBFlush/1000_routes,_1_peers-12 185.6m ± 5% 170.5m ± 2% -8.16% (p=0.002 n=6)
RIBFlush/1000_routes,_2_peers-12 408.1m ± 1% 392.5m ± 2% -3.82% (p=0.002 n=6)
RIBFlush/1000_routes,_5_peers-12 882.3m ± 0% 861.0m ± 2% -2.42% (p=0.004 n=6)
RIBFlush/10000_routes,_1_peers-12 2.425 ± 23% 1.754 ± 6% -27.69% (p=0.002 n=6)
RIBFlush/10000_routes,_2_peers-12 5.489 ± 20% 4.953 ± 7% -9.76% (p=0.002 n=6)
RIBFlush/10000_routes,_5_peers-12 13.90 ± 1% 13.35 ± 2% -3.99% (p=0.002 n=6)
RIBFlush/100000_routes,_1_peers-12 40.94 ± 5% 36.28 ± 12% -11.37% (p=0.009 n=6)
RIBFlush/100000_routes,_2_peers-12 90.26 ± 5% 84.91 ± 4% -5.93% (p=0.015 n=6)
RIBFlush/100000_routes,_5_peers-12 209.4 ± 1% 231.0 ± 14% +10.32% (p=0.004 n=6)
geomean 5.966 5.520 -7.48%
```
0c66127 to
449555a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
While writing a blog post about RIB sharding, I find myself looking a bit more at the code and was thinking the global lock looks very bad. Since we now have a concurrent benchmark, we can look again at using "persist" variants from bart package. It works well! -60% read, -30% write! There was a trick not used in the previous tentatives: not using ModifyPersist which triggers "false" modifications, but instead use
Get()+InsertPersist()orDeletePersist().The second change is unrelated as it fixes a bug that was already there: it was possible to fetch a stale index from the tree and retrieve the wrong routes if the index is recycled. The third commit of this PR fixes that by introducing a generation number to detect when a prefix is recycled. It does not seem to impact performance.
cc @bogi788 if you want to test (but I need to spend a bit more time to ensure everything is correct, so no hurry).