core: Change the default plugin for Ceph erasure coded pools from Jerasure to ISA-L#58052
core: Change the default plugin for Ceph erasure coded pools from Jerasure to ISA-L#58052
Conversation
Signed-off-by: Jamie Pryde <jamiepry@uk.ibm.com>
|
Fascinating. My first thought when you mention
#3 sounds appealing, but it turned out to often be slower than #1. How do we handle the analogous situation today? My other question is as to how vetted ISA-L is with Ceph. We have loads of runtime on jerasure, but are we confident that there aren't gotchas with ISA-L in the context of Ceph? |
|
|
||
| ceph-erasure-code-tool validate-profile \ | ||
| plugin=jerasure,technique=reed_sol_van,k=2,m=1 | ||
| plugin=isa,technique=reed_sol_van,k=2,m=1 |
There was a problem hiding this comment.
Are the technique details 1:1 transferrable?
There was a problem hiding this comment.
I think they probably are. I'm going to add a test that compares the encoding output for various K/M values to try to confirm this.
There was a problem hiding this comment.
They are not 1:1 transferrable. I've added a tool that prints out the hash of each chunk here #58377
Here are chunks for ISA and Jerasure using reed_sol_van with k=3 m=3
[root@9d9c7969fd9a build]# cat plugin_comparison/output
489715dfc67930395b840e6f62f87310 plugin_comparison/plugin=isa stripe-width=256 k=3 m=3 technique=reed_sol_van/0
f2dfcedc0cbd3b8247267941e44ede2b plugin_comparison/plugin=isa stripe-width=256 k=3 m=3 technique=reed_sol_van/1
3b86c3580ddae1fd8a2bf7b0d94ef931 plugin_comparison/plugin=isa stripe-width=256 k=3 m=3 technique=reed_sol_van/2
3041babbcf88ebd486a4e103e7cdc4ae plugin_comparison/plugin=isa stripe-width=256 k=3 m=3 technique=reed_sol_van/3
72af377096a80419bc2d607031ad3404 plugin_comparison/plugin=isa stripe-width=256 k=3 m=3 technique=reed_sol_van/4
4d88897a89854d5732fd094b401d3107 plugin_comparison/plugin=isa stripe-width=256 k=3 m=3 technique=reed_sol_van/5
489715dfc67930395b840e6f62f87310 plugin_comparison/plugin=jerasure stripe-width=256 k=3 m=3 technique=reed_sol_van/0
f2dfcedc0cbd3b8247267941e44ede2b plugin_comparison/plugin=jerasure stripe-width=256 k=3 m=3 technique=reed_sol_van/1
3b86c3580ddae1fd8a2bf7b0d94ef931 plugin_comparison/plugin=jerasure stripe-width=256 k=3 m=3 technique=reed_sol_van/2
3041babbcf88ebd486a4e103e7cdc4ae plugin_comparison/plugin=jerasure stripe-width=256 k=3 m=3 technique=reed_sol_van/3
858b48eb0c113ee3f217e91a426fc4fd plugin_comparison/plugin=jerasure stripe-width=256 k=3 m=3 technique=reed_sol_van/4
b50e4bdf05f80ba7cab1164505cd6e87 plugin_comparison/plugin=jerasure stripe-width=256 k=3 m=3 technique=reed_sol_van/5
The first coding chunk is the same because we are using an xor optimisation for that one, but the second and third coding chunks are different.
So if we change the default plugin, it should be for new pools only.
The ISA-L library includes multiple different implementations of the erasure code encode and decode functions making use of the different instruction sets of modern CPUs. For the X86 there is an SSE, AVX, AVX2 and AVX512 implementation as well as a standard C implementation for ancient CPUs without any SSE support (pre 2006?). For ARM there is a standard C implementation and a NEON implementation. ISA-L chooses which implementation of function to use at runtime - the very first call to the encode/decode function will query the CPU capabilities to choose the latest supported implementation, it then updates a function pointer so that all subsequent calls don't repeat this test. There's a tiny overhead from de-referencing the function pointer but this is insignificant compared to the time it takes to encode/decode the data buffers. There is a newer ISA-L library that includes a new implementation of erasure coding using GFNI instructions - we tested that but on Icelake hardware it doesn't seem to be any faster than AVX512, so we didn't think it was worthwhile updating to a new ISA-L library just yet. |
|
Thanks for the clarification. I hadn't seen anything before that actually uses AVX512 so I was intrigued. Sun clearly was testing repeatedly or something similarly 🤦♂️ |
src/vstart.sh
Outdated
| $(format_conf "${extra_conf}") | ||
| mon cluster log file = $CEPH_OUT_DIR/cluster.mon.\$id.log | ||
| osd pool default erasure code profile = plugin=jerasure technique=reed_sol_van k=2 m=1 crush-failure-domain=osd | ||
| osd pool default erasure code profile = plugin=isa technique=reed_sol_van k=2 m=2 crush-failure-domain=osd |
There was a problem hiding this comment.
Any reason for changing to m=2? If you are going to increase it you should probably make k and m based on the number of OSDs
There was a problem hiding this comment.
A previous pull request changed the default in other places to K=2 M=2 so I wanted to make it consistent here. Checking the number of OSDs when making a cluster with vstart is a good idea though.
|
Great testing here @jamiepryde! Agree with both @anthonyeleven and @bill-scales's comments. Would like to get some coverage on EPYC as well. |
The idea looks very promising. What I would love to see are numbers from tests involving OSDs. I'm worried that the actual crunching isn't the biggest overhead we have on our EC paths – @bill-scales has nailed down memcpy due to misalignment. |
Thanks @markhpc. Here is a run of the benchmark on a system with an AMD EPYC 7763, which shows similar results:
@rzarzynski We did some testing with Jerasure and ISA-L on an older Intel system (Xeon E5-2667 v4 @ 3.20GHz) with 4 SATA SSDs as OSDs (2+2 profile) and there was no noticeable difference to I/O performance with either plugin. I'd like to get hold of something more recent and with more drives to test different k+m values, but I suspect if we see any improvements then they would still only be very small. I think it would still be worth trying to change the default plugin though, given that 1) Jerasure is no longer maintained, and 2) The benchmark results show that ISA-L probably has better optimisation. |
|
@rzarzynski You are right that although there are some big performance improvements shown here, as an overall percentage of the I/O path the time spent encoding/decoding is small so overall we might only expect to see a couple of percent improvement - however that's still worth having. Exactly what percentage of time is spent encoding/decoding is tricky to measure meaningfully because it depends so much on the test stand configuration and I/O workload - we've seen measurements in the 3%-10% range for % CPU time spent encoding/decoding. The heaviest decode workload is likely to come from backfill when the ratio of K/M is small (e.g 2+2), there are multiple drive failures and will require that there are enough drives and network bandwidth to prevent these becoming a bottleneck. Sequential write with a small K/M ratio will be the heaviest encode workload. |
|
See what EC plugins and techniques are reported via our telemetry: Sometimes plugins do not specify a technique (e.g. clay), hence the "empty" value. |
| level: advanced | ||
| desc: default erasure code profile for new erasure-coded pools | ||
| default: plugin=jerasure technique=reed_sol_van k=2 m=2 | ||
| default: plugin=isa technique=reed_sol_van k=2 m=2 |
There was a problem hiding this comment.
I'm afraid the change isn't going to be as simple as s/jerasure/isa/. The problem is we still need to test jerasure in upstream for the sake of existing clusters. I think the same idiom like with theFilestore-to-BlueStore transition is needed.
There was a problem hiding this comment.
I had the same suspicion.
It would be great if we could discuss this at the performance meeting. :) The community is very interested in these details! |
@bill-scales This roughly falls in line with what I've seen too. Sooner or later I think we'll need to rethink denc (and absolutely the traditional encode/decode path), but there are other low hanging fruit that I'm targeting first. |
|
I've been doing some more testing and noticed there is a problem with the graphs. The performance numbers for Jerasure are lower than they should be. I eventually realised this is because Ceph was built without the -DCMAKE_BUILD_TYPE=RelWithDebInfo flag on do_cmake, so I was testing a debug build. This clearly causes a notable impact to performance when using Jerasure. Building with RelWithDebInfo shows better performance when using Jerasure. New graphs from the tool on an Intel(R) Xeon(R) Gold 6336Y CPU @ 2.40GHz: And in a container on a macbook pro with an M1 pro: So on X86-64 we still see slightly better encoding performance using ISA. Decoding performance is very similar (The change in #58594 also slightly improves ISA decode performance where m=1 or erasures=1). |
|
@apeters1971 FYI. Any comments on this? |
Yeah, this has been an issue that's bitten multiple people since do_cmake.sh was changed years ago. I think I was the one that added the warning note when I got bitten by it. It's also semi-related to the ubuntu package performance bug we found last winter where RocksDB wasn't being compiled with proper optimizations. Sorry you got hit by it, but glad that you figured it out before we merged this PR. FWIW, the new numbers are much more in-line with what I saw the last time I compared jerasure and isa, so I think the new numbers are likely to be correct. |
fc1c63a to
50dd023
Compare
|
We've raised a few related PRs that make small changes to performance when using ISA, and some PRs to try to improve the EC benchmark tool. #58594 and #59862 - Use ISA's own RAID XOR function as an optimisation for m=1 and single erasures, rather than using and maintaining our own XOR code. Latest graphs testing with all of the above PRs: Xeon Gold 6336Y
Macbook M1 pro linux container (aarch64)
|
50dd023 to
bfe847a
Compare
|
jenkins test make check |
|
jenkins test api |
|
jenkins test dashboard |
|
jenkins test dashboard cephadm |
|
jenkins test make check |
1 similar comment
|
jenkins test make check |
Signed-off-by: Jamie Pryde <jamiepry@uk.ibm.com>
…to isa Signed-off-by: Jamie Pryde <jamiepry@uk.ibm.com>
Signed-off-by: Jamie Pryde <jamiepry@uk.ibm.com>
Signed-off-by: Jamie Pryde <jamiepry@uk.ibm.com>
Signed-off-by: Jamie Pryde <jamiepry@uk.ibm.com>
bfe847a to
25cac0f
Compare
markhpc
left a comment
There was a problem hiding this comment.
Given jerasure's lack of maintenance and the extensive testing done here, I think this is a valid change going forward so long as we are only applying it to new pools and leaving backwards compatibility in place.
|
Great work, and as @markhpc noted, performance aside, the fact that jerasure is basically not maintained anymore is almost reason enough in of itself. I support moving to ISA-L as the default. |
|
@jamiepryde - this seems to cause a failure in test_ceph_helpers.sh. |
|
@ronen-fr Taking a look, thanks. |










This PR changes the default plugin for erasure coded pools from Jerasure to ISA-L. Until now, Jerasure has been Ceph's default plugin due to its flexibility and genericism. However, the jerasure and gf-complete libraries no longer appear to be maintained, and they have not been updated to use modern CPU instructions such as AVX2 and AVX512. Using these instructions for erasure coding can result in performance improvements. ISA-L on the other hand is still maintained and receives updates to take advantage of modern CPU features. ISA-L also offers comparable flexibility to Jerasure.
Testing with the ceph_erasure_code_benchmark tool shows the potential for performance improvements when using ISA-L instead of Jerasure.
This benchmark data was captured by compiling Ceph from source and then executing the benchmark tool using the versions of the EC libraries that are currently included in Ceph. The following command was used to run the tool.
TOTAL_SIZE=$((4 * 1024 * 1024 * 1024)) qa/workunits/erasure-code/bench.sh fplot | tee qa/workunits/erasure-code/bench.jsThe first set of data was captured on an Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz to show the potential for improved performance on x86-64 architecture.
The second set of data was captured on a Macbook Pro with an M1 Pro and 16GB RAM. Ceph is built in a CentOS 8 aarch64 container running with 8GB of RAM assigned to the podman machine VM. This is intended to show the potential for improved performance on ARM64/aarch64 architecture, as well as showing that ISA-L supports non-Intel platforms.
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windowsjenkins test rook e2e