Skip to content

core: Change the default plugin for Ceph erasure coded pools from Jerasure to ISA-L#58052

Merged
yuriw merged 6 commits intoceph:mainfrom
jamiepryde:isal-default
Jan 21, 2025
Merged

core: Change the default plugin for Ceph erasure coded pools from Jerasure to ISA-L#58052
yuriw merged 6 commits intoceph:mainfrom
jamiepryde:isal-default

Conversation

@jamiepryde
Copy link
Contributor

This PR changes the default plugin for erasure coded pools from Jerasure to ISA-L. Until now, Jerasure has been Ceph's default plugin due to its flexibility and genericism. However, the jerasure and gf-complete libraries no longer appear to be maintained, and they have not been updated to use modern CPU instructions such as AVX2 and AVX512. Using these instructions for erasure coding can result in performance improvements. ISA-L on the other hand is still maintained and receives updates to take advantage of modern CPU features. ISA-L also offers comparable flexibility to Jerasure.

Testing with the ceph_erasure_code_benchmark tool shows the potential for performance improvements when using ISA-L instead of Jerasure.

This benchmark data was captured by compiling Ceph from source and then executing the benchmark tool using the versions of the EC libraries that are currently included in Ceph. The following command was used to run the tool.

TOTAL_SIZE=$((4 * 1024 * 1024 * 1024)) qa/workunits/erasure-code/bench.sh fplot | tee qa/workunits/erasure-code/bench.js

The first set of data was captured on an Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz to show the potential for improved performance on x86-64 architecture.

image

image

The second set of data was captured on a Macbook Pro with an M1 Pro and 16GB RAM. Ceph is built in a CentOS 8 aarch64 container running with 8GB of RAM assigned to the podman machine VM. This is intended to show the potential for improved performance on ARM64/aarch64 architecture, as well as showing that ISA-L supports non-Intel platforms.

image

image

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

Signed-off-by: Jamie Pryde <jamiepry@uk.ibm.com>
@anthonyeleven
Copy link
Contributor

Fascinating. My first thought when you mention AVX2 and AVX512 was to wonder if we can exploit those when present but also not fail when they aren't. I see the ARM test, but I'd love to understand how various x64 CPUs are handled. Back in the mists of time, a 68020 system could have an optional 68881 math coprocessor. With SunOS libs and compiler there were three compilation options:

  1. No 68881 instructions
  2. Only run if a 68881 is present
  3. Switch at runtime

#3 sounds appealing, but it turned out to often be slower than #1. How do we handle the analogous situation today?

My other question is as to how vetted ISA-L is with Ceph. We have loads of runtime on jerasure, but are we confident that there aren't gotchas with ISA-L in the context of Ceph?


ceph-erasure-code-tool validate-profile \
plugin=jerasure,technique=reed_sol_van,k=2,m=1
plugin=isa,technique=reed_sol_van,k=2,m=1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the technique details 1:1 transferrable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they probably are. I'm going to add a test that compares the encoding output for various K/M values to try to confirm this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are not 1:1 transferrable. I've added a tool that prints out the hash of each chunk here #58377

Here are chunks for ISA and Jerasure using reed_sol_van with k=3 m=3

[root@9d9c7969fd9a build]# cat plugin_comparison/output
489715dfc67930395b840e6f62f87310  plugin_comparison/plugin=isa stripe-width=256 k=3 m=3 technique=reed_sol_van/0
f2dfcedc0cbd3b8247267941e44ede2b  plugin_comparison/plugin=isa stripe-width=256 k=3 m=3 technique=reed_sol_van/1
3b86c3580ddae1fd8a2bf7b0d94ef931  plugin_comparison/plugin=isa stripe-width=256 k=3 m=3 technique=reed_sol_van/2
3041babbcf88ebd486a4e103e7cdc4ae  plugin_comparison/plugin=isa stripe-width=256 k=3 m=3 technique=reed_sol_van/3
72af377096a80419bc2d607031ad3404  plugin_comparison/plugin=isa stripe-width=256 k=3 m=3 technique=reed_sol_van/4
4d88897a89854d5732fd094b401d3107  plugin_comparison/plugin=isa stripe-width=256 k=3 m=3 technique=reed_sol_van/5

489715dfc67930395b840e6f62f87310  plugin_comparison/plugin=jerasure stripe-width=256 k=3 m=3 technique=reed_sol_van/0
f2dfcedc0cbd3b8247267941e44ede2b  plugin_comparison/plugin=jerasure stripe-width=256 k=3 m=3 technique=reed_sol_van/1
3b86c3580ddae1fd8a2bf7b0d94ef931  plugin_comparison/plugin=jerasure stripe-width=256 k=3 m=3 technique=reed_sol_van/2
3041babbcf88ebd486a4e103e7cdc4ae  plugin_comparison/plugin=jerasure stripe-width=256 k=3 m=3 technique=reed_sol_van/3
858b48eb0c113ee3f217e91a426fc4fd  plugin_comparison/plugin=jerasure stripe-width=256 k=3 m=3 technique=reed_sol_van/4
b50e4bdf05f80ba7cab1164505cd6e87  plugin_comparison/plugin=jerasure stripe-width=256 k=3 m=3 technique=reed_sol_van/5

The first coding chunk is the same because we are using an xor optimisation for that one, but the second and third coding chunks are different.

So if we change the default plugin, it should be for new pools only.

@bill-scales
Copy link
Contributor

Fascinating. My first thought when you mention AVX2 and AVX512 was to wonder if we can exploit those when present but also not fail when they aren't. I see the ARM test, but I'd love to understand how various x64 CPUs are handled.

The ISA-L library includes multiple different implementations of the erasure code encode and decode functions making use of the different instruction sets of modern CPUs. For the X86 there is an SSE, AVX, AVX2 and AVX512 implementation as well as a standard C implementation for ancient CPUs without any SSE support (pre 2006?). For ARM there is a standard C implementation and a NEON implementation.

ISA-L chooses which implementation of function to use at runtime - the very first call to the encode/decode function will query the CPU capabilities to choose the latest supported implementation, it then updates a function pointer so that all subsequent calls don't repeat this test. There's a tiny overhead from de-referencing the function pointer but this is insignificant compared to the time it takes to encode/decode the data buffers.

There is a newer ISA-L library that includes a new implementation of erasure coding using GFNI instructions - we tested that but on Icelake hardware it doesn't seem to be any faster than AVX512, so we didn't think it was worthwhile updating to a new ISA-L library just yet.

@anthonyeleven
Copy link
Contributor

Thanks for the clarification. I hadn't seen anything before that actually uses AVX512 so I was intrigued. Sun clearly was testing repeatedly or something similarly 🤦‍♂️

src/vstart.sh Outdated
$(format_conf "${extra_conf}")
mon cluster log file = $CEPH_OUT_DIR/cluster.mon.\$id.log
osd pool default erasure code profile = plugin=jerasure technique=reed_sol_van k=2 m=1 crush-failure-domain=osd
osd pool default erasure code profile = plugin=isa technique=reed_sol_van k=2 m=2 crush-failure-domain=osd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason for changing to m=2? If you are going to increase it you should probably make k and m based on the number of OSDs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A previous pull request changed the default in other places to K=2 M=2 so I wanted to make it consistent here. Checking the number of OSDs when making a cluster with vstart is a good idea though.

@markhpc
Copy link
Member

markhpc commented Jun 20, 2024

Great testing here @jamiepryde! Agree with both @anthonyeleven and @bill-scales's comments. Would like to get some coverage on EPYC as well.

@rzarzynski
Copy link
Contributor

This benchmark data was captured by compiling Ceph from source and then executing the benchmark tool using the versions of the EC libraries that are currently included in Ceph.

The idea looks very promising. What I would love to see are numbers from tests involving OSDs. I'm worried that the actual crunching isn't the biggest overhead we have on our EC paths – @bill-scales has nailed down memcpy due to misalignment.

@jamiepryde
Copy link
Contributor Author

Great testing here @jamiepryde! Agree with both @anthonyeleven and @bill-scales's comments. Would like to get some coverage on EPYC as well.

Thanks @markhpc. Here is a run of the benchmark on a system with an AMD EPYC 7763, which shows similar results:

image

image

This benchmark data was captured by compiling Ceph from source and then executing the benchmark tool using the versions of the EC libraries that are currently included in Ceph.

The idea looks very promising. What I would love to see are numbers from tests involving OSDs. I'm worried that the actual crunching isn't the biggest overhead we have on our EC paths – @bill-scales has nailed down memcpy due to misalignment.

@rzarzynski We did some testing with Jerasure and ISA-L on an older Intel system (Xeon E5-2667 v4 @ 3.20GHz) with 4 SATA SSDs as OSDs (2+2 profile) and there was no noticeable difference to I/O performance with either plugin. I'd like to get hold of something more recent and with more drives to test different k+m values, but I suspect if we see any improvements then they would still only be very small.

I think it would still be worth trying to change the default plugin though, given that 1) Jerasure is no longer maintained, and 2) The benchmark results show that ISA-L probably has better optimisation.

@bill-scales
Copy link
Contributor

@rzarzynski You are right that although there are some big performance improvements shown here, as an overall percentage of the I/O path the time spent encoding/decoding is small so overall we might only expect to see a couple of percent improvement - however that's still worth having.

Exactly what percentage of time is spent encoding/decoding is tricky to measure meaningfully because it depends so much on the test stand configuration and I/O workload - we've seen measurements in the 3%-10% range for % CPU time spent encoding/decoding. The heaviest decode workload is likely to come from backfill when the ratio of K/M is small (e.g 2+2), there are multiple drive failures and will require that there are enough drives and network bandwidth to prevent these becoming a bottleneck. Sequential write with a small K/M ratio will be the heaviest encode workload.

@yaarith
Copy link
Contributor

yaarith commented Jun 24, 2024

See what EC plugins and techniques are reported via our telemetry:
https://telemetry-public.ceph.com/d/b5664aff-721e-4c8a-b79a-7d3e2a8eaf07/ec-plugin?orgId=1

Sometimes plugins do not specify a technique (e.g. clay), hence the "empty" value.

level: advanced
desc: default erasure code profile for new erasure-coded pools
default: plugin=jerasure technique=reed_sol_van k=2 m=2
default: plugin=isa technique=reed_sol_van k=2 m=2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid the change isn't going to be as simple as s/jerasure/isa/. The problem is we still need to test jerasure in upstream for the sake of existing clusters. I think the same idiom like with theFilestore-to-BlueStore transition is needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the same suspicion.

@markhpc
Copy link
Member

markhpc commented Jun 27, 2024

This benchmark data was captured by compiling Ceph from source and then executing the benchmark tool using the versions of the EC libraries that are currently included in Ceph.

The idea looks very promising. What I would love to see are numbers from tests involving OSDs. I'm worried that the actual crunching isn't the biggest overhead we have on our EC paths – @bill-scales has nailed down memcpy due to misalignment.

It would be great if we could discuss this at the performance meeting. :) The community is very interested in these details!

@markhpc
Copy link
Member

markhpc commented Jun 27, 2024

@rzarzynski You are right that although there are some big performance improvements shown here, as an overall percentage of the I/O path the time spent encoding/decoding is small so overall we might only expect to see a couple of percent improvement - however that's still worth having.

Exactly what percentage of time is spent encoding/decoding is tricky to measure meaningfully because it depends so much on the test stand configuration and I/O workload - we've seen measurements in the 3%-10% range for % CPU time spent encoding/decoding. The heaviest decode workload is likely to come from backfill when the ratio of K/M is small (e.g 2+2), there are multiple drive failures and will require that there are enough drives and network bandwidth to prevent these becoming a bottleneck. Sequential write with a small K/M ratio will be the heaviest encode workload.

@bill-scales This roughly falls in line with what I've seen too. Sooner or later I think we'll need to rethink denc (and absolutely the traditional encode/decode path), but there are other low hanging fruit that I'm targeting first.

@jamiepryde
Copy link
Contributor Author

I've been doing some more testing and noticed there is a problem with the graphs. The performance numbers for Jerasure are lower than they should be. I eventually realised this is because Ceph was built without the -DCMAKE_BUILD_TYPE=RelWithDebInfo flag on do_cmake, so I was testing a debug build. This clearly causes a notable impact to performance when using Jerasure. Building with RelWithDebInfo shows better performance when using Jerasure. New graphs from the tool on an Intel(R) Xeon(R) Gold 6336Y CPU @ 2.40GHz:

image

image

And in a container on a macbook pro with an M1 pro:

image

image

So on X86-64 we still see slightly better encoding performance using ISA. Decoding performance is very similar (The change in #58594 also slightly improves ISA decode performance where m=1 or erasures=1).
On ARM, we now see slightly better encoding and decoding performance when using Jerasure.

@dvanders
Copy link
Contributor

@apeters1971 FYI. Any comments on this?

@markhpc
Copy link
Member

markhpc commented Jul 29, 2024

I've been doing some more testing and noticed there is a problem with the graphs. The performance numbers for Jerasure are lower than they should be. I eventually realised this is because Ceph was built without the -DCMAKE_BUILD_TYPE=RelWithDebInfo flag on do_cmake, so I was testing a debug build. This clearly causes a notable impact to performance when using Jerasure. Building with RelWithDebInfo shows better performance when using Jerasure.

Yeah, this has been an issue that's bitten multiple people since do_cmake.sh was changed years ago. I think I was the one that added the warning note when I got bitten by it. It's also semi-related to the ubuntu package performance bug we found last winter where RocksDB wasn't being compiled with proper optimizations. Sorry you got hit by it, but glad that you figured it out before we merged this PR. FWIW, the new numbers are much more in-line with what I saw the last time I compared jerasure and isa, so I think the new numbers are likely to be correct.

Copy link
Contributor

@afreen23 afreen23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dashboard LGTM

@jamiepryde
Copy link
Contributor Author

We've raised a few related PRs that make small changes to performance when using ISA, and some PRs to try to improve the EC benchmark tool.

#58594 and #59862 - Use ISA's own RAID XOR function as an optimisation for m=1 and single erasures, rather than using and maintaining our own XOR code.
#59486 and #60121 - Changes to the EC benchmark to try to get better measurements of a plugin's encode and decode performance, rather than spending time measuring buffer allocation.
#59513 - Build the avx512 versions of ISA functions.
#59679 - Additional EC configs in teuthology to expand test coverage.
#59881 - Additional LRC unit tests to test both Jerasure and ISA.
#60246 - Use 64-byte alignment for EC buffers instead of 32-byte alignment.

Latest graphs testing with all of the above PRs:

Xeon Gold 6336Y
image
numbers - https://github.com/user-attachments/assets/34a028c5-ae9d-45a9-bc60-1299021209b0

image
numbers - https://github.com/user-attachments/assets/e7fa30bb-1126-4386-8868-7ceaefbb75ee

Macbook M1 pro linux container (aarch64)
image
numbers - https://github.com/user-attachments/assets/83dfdf96-2981-419e-8b6c-2373ab465cfd

image
numbers - https://github.com/user-attachments/assets/ee28e57d-9e12-48ab-b270-ba6c5e0a30b2

@jamiepryde
Copy link
Contributor Author

jenkins test make check

@jamiepryde
Copy link
Contributor Author

jenkins test api

@jamiepryde
Copy link
Contributor Author

jenkins test dashboard

@jamiepryde
Copy link
Contributor Author

jenkins test dashboard cephadm

@jamiepryde
Copy link
Contributor Author

jenkins test make check

1 similar comment
@jamiepryde
Copy link
Contributor Author

jenkins test make check

Signed-off-by: Jamie Pryde <jamiepry@uk.ibm.com>
…to isa

Signed-off-by: Jamie Pryde <jamiepry@uk.ibm.com>
Signed-off-by: Jamie Pryde <jamiepry@uk.ibm.com>
Signed-off-by: Jamie Pryde <jamiepry@uk.ibm.com>
Signed-off-by: Jamie Pryde <jamiepry@uk.ibm.com>
Copy link
Member

@markhpc markhpc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given jerasure's lack of maintenance and the extensive testing done here, I think this is a valid change going forward so long as we are only applying it to new pools and leaving backwards compatibility in place.

@mmgaggle
Copy link
Member

Great work, and as @markhpc noted, performance aside, the fact that jerasure is basically not maintained anymore is almost reason enough in of itself. I support moving to ISA-L as the default.

@ronen-fr
Copy link
Contributor

ronen-fr commented Feb 2, 2025

@jamiepryde - this seems to cause a failure in test_ceph_helpers.sh.
See https://tracker.ceph.com/issues/69758

@jamiepryde
Copy link
Contributor Author

@ronen-fr Taking a look, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.