core: Change the default plugin for Ceph erasure coded pools from Jerasure to ISA-L by jamiepryde · Pull Request #58052 · ceph/ceph

jamiepryde · 2024-06-14T12:49:59Z

This PR changes the default plugin for erasure coded pools from Jerasure to ISA-L. Until now, Jerasure has been Ceph's default plugin due to its flexibility and genericism. However, the jerasure and gf-complete libraries no longer appear to be maintained, and they have not been updated to use modern CPU instructions such as AVX2 and AVX512. Using these instructions for erasure coding can result in performance improvements. ISA-L on the other hand is still maintained and receives updates to take advantage of modern CPU features. ISA-L also offers comparable flexibility to Jerasure.

Testing with the ceph_erasure_code_benchmark tool shows the potential for performance improvements when using ISA-L instead of Jerasure.

This benchmark data was captured by compiling Ceph from source and then executing the benchmark tool using the versions of the EC libraries that are currently included in Ceph. The following command was used to run the tool.

TOTAL_SIZE=$((4 * 1024 * 1024 * 1024)) qa/workunits/erasure-code/bench.sh fplot | tee qa/workunits/erasure-code/bench.js

The first set of data was captured on an Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz to show the potential for improved performance on x86-64 architecture.

The second set of data was captured on a Macbook Pro with an M1 Pro and 16GB RAM. Ceph is built in a CentOS 8 aarch64 container running with 8GB of RAM assigned to the podman machine VM. This is intended to show the potential for improved performance on ARM64/aarch64 architecture, as well as showing that ISA-L supports non-Intel platforms.

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows
jenkins test rook e2e

Signed-off-by: Jamie Pryde <jamiepry@uk.ibm.com>

anthonyeleven · 2024-06-14T13:43:50Z

Fascinating. My first thought when you mention AVX2 and AVX512 was to wonder if we can exploit those when present but also not fail when they aren't. I see the ARM test, but I'd love to understand how various x64 CPUs are handled. Back in the mists of time, a 68020 system could have an optional 68881 math coprocessor. With SunOS libs and compiler there were three compilation options:

No 68881 instructions
Only run if a 68881 is present
Switch at runtime

#3 sounds appealing, but it turned out to often be slower than #1. How do we handle the analogous situation today?

My other question is as to how vetted ISA-L is with Ceph. We have loads of runtime on jerasure, but are we confident that there aren't gotchas with ISA-L in the context of Ceph?

anthonyeleven · 2024-06-14T13:45:01Z

src/test/ceph-erasure-code-tool/test_ceph-erasure-code-tool.sh


 ceph-erasure-code-tool validate-profile \
-                       plugin=jerasure,technique=reed_sol_van,k=2,m=1
+                       plugin=isa,technique=reed_sol_van,k=2,m=1


Are the technique details 1:1 transferrable?

I think they probably are. I'm going to add a test that compares the encoding output for various K/M values to try to confirm this.

They are not 1:1 transferrable. I've added a tool that prints out the hash of each chunk here #58377

Here are chunks for ISA and Jerasure using reed_sol_van with k=3 m=3

[root@9d9c7969fd9a build]# cat plugin_comparison/output 489715dfc67930395b840e6f62f87310 plugin_comparison/plugin=isa stripe-width=256 k=3 m=3 technique=reed_sol_van/0 f2dfcedc0cbd3b8247267941e44ede2b plugin_comparison/plugin=isa stripe-width=256 k=3 m=3 technique=reed_sol_van/1 3b86c3580ddae1fd8a2bf7b0d94ef931 plugin_comparison/plugin=isa stripe-width=256 k=3 m=3 technique=reed_sol_van/2 3041babbcf88ebd486a4e103e7cdc4ae plugin_comparison/plugin=isa stripe-width=256 k=3 m=3 technique=reed_sol_van/3 72af377096a80419bc2d607031ad3404 plugin_comparison/plugin=isa stripe-width=256 k=3 m=3 technique=reed_sol_van/4 4d88897a89854d5732fd094b401d3107 plugin_comparison/plugin=isa stripe-width=256 k=3 m=3 technique=reed_sol_van/5 489715dfc67930395b840e6f62f87310 plugin_comparison/plugin=jerasure stripe-width=256 k=3 m=3 technique=reed_sol_van/0 f2dfcedc0cbd3b8247267941e44ede2b plugin_comparison/plugin=jerasure stripe-width=256 k=3 m=3 technique=reed_sol_van/1 3b86c3580ddae1fd8a2bf7b0d94ef931 plugin_comparison/plugin=jerasure stripe-width=256 k=3 m=3 technique=reed_sol_van/2 3041babbcf88ebd486a4e103e7cdc4ae plugin_comparison/plugin=jerasure stripe-width=256 k=3 m=3 technique=reed_sol_van/3 858b48eb0c113ee3f217e91a426fc4fd plugin_comparison/plugin=jerasure stripe-width=256 k=3 m=3 technique=reed_sol_van/4 b50e4bdf05f80ba7cab1164505cd6e87 plugin_comparison/plugin=jerasure stripe-width=256 k=3 m=3 technique=reed_sol_van/5

The first coding chunk is the same because we are using an xor optimisation for that one, but the second and third coding chunks are different.

So if we change the default plugin, it should be for new pools only.

bill-scales · 2024-06-17T14:10:13Z

Fascinating. My first thought when you mention AVX2 and AVX512 was to wonder if we can exploit those when present but also not fail when they aren't. I see the ARM test, but I'd love to understand how various x64 CPUs are handled.

The ISA-L library includes multiple different implementations of the erasure code encode and decode functions making use of the different instruction sets of modern CPUs. For the X86 there is an SSE, AVX, AVX2 and AVX512 implementation as well as a standard C implementation for ancient CPUs without any SSE support (pre 2006?). For ARM there is a standard C implementation and a NEON implementation.

ISA-L chooses which implementation of function to use at runtime - the very first call to the encode/decode function will query the CPU capabilities to choose the latest supported implementation, it then updates a function pointer so that all subsequent calls don't repeat this test. There's a tiny overhead from de-referencing the function pointer but this is insignificant compared to the time it takes to encode/decode the data buffers.

There is a newer ISA-L library that includes a new implementation of erasure coding using GFNI instructions - we tested that but on Icelake hardware it doesn't seem to be any faster than AVX512, so we didn't think it was worthwhile updating to a new ISA-L library just yet.

anthonyeleven · 2024-06-17T14:17:09Z

Thanks for the clarification. I hadn't seen anything before that actually uses AVX512 so I was intrigued. Sun clearly was testing repeatedly or something similarly 🤦‍♂️

bill-scales · 2024-06-18T08:25:28Z

src/vstart.sh

        $(format_conf "${extra_conf}")
        mon cluster log file = $CEPH_OUT_DIR/cluster.mon.\$id.log
-        osd pool default erasure code profile = plugin=jerasure technique=reed_sol_van k=2 m=1 crush-failure-domain=osd
+        osd pool default erasure code profile = plugin=isa technique=reed_sol_van k=2 m=2 crush-failure-domain=osd


Any reason for changing to m=2? If you are going to increase it you should probably make k and m based on the number of OSDs

A previous pull request changed the default in other places to K=2 M=2 so I wanted to make it consistent here. Checking the number of OSDs when making a cluster with vstart is a good idea though.

markhpc · 2024-06-20T14:55:50Z

Great testing here @jamiepryde! Agree with both @anthonyeleven and @bill-scales's comments. Would like to get some coverage on EPYC as well.

rzarzynski · 2024-06-20T19:20:01Z

This benchmark data was captured by compiling Ceph from source and then executing the benchmark tool using the versions of the EC libraries that are currently included in Ceph.

The idea looks very promising. What I would love to see are numbers from tests involving OSDs. I'm worried that the actual crunching isn't the biggest overhead we have on our EC paths – @bill-scales has nailed down memcpy due to misalignment.

jamiepryde · 2024-06-21T20:53:35Z

Great testing here @jamiepryde! Agree with both @anthonyeleven and @bill-scales's comments. Would like to get some coverage on EPYC as well.

Thanks @markhpc. Here is a run of the benchmark on a system with an AMD EPYC 7763, which shows similar results:

This benchmark data was captured by compiling Ceph from source and then executing the benchmark tool using the versions of the EC libraries that are currently included in Ceph.

The idea looks very promising. What I would love to see are numbers from tests involving OSDs. I'm worried that the actual crunching isn't the biggest overhead we have on our EC paths – @bill-scales has nailed down memcpy due to misalignment.

@rzarzynski We did some testing with Jerasure and ISA-L on an older Intel system (Xeon E5-2667 v4 @ 3.20GHz) with 4 SATA SSDs as OSDs (2+2 profile) and there was no noticeable difference to I/O performance with either plugin. I'd like to get hold of something more recent and with more drives to test different k+m values, but I suspect if we see any improvements then they would still only be very small.

I think it would still be worth trying to change the default plugin though, given that 1) Jerasure is no longer maintained, and 2) The benchmark results show that ISA-L probably has better optimisation.

bill-scales · 2024-06-24T07:01:31Z

@rzarzynski You are right that although there are some big performance improvements shown here, as an overall percentage of the I/O path the time spent encoding/decoding is small so overall we might only expect to see a couple of percent improvement - however that's still worth having.

Exactly what percentage of time is spent encoding/decoding is tricky to measure meaningfully because it depends so much on the test stand configuration and I/O workload - we've seen measurements in the 3%-10% range for % CPU time spent encoding/decoding. The heaviest decode workload is likely to come from backfill when the ratio of K/M is small (e.g 2+2), there are multiple drive failures and will require that there are enough drives and network bandwidth to prevent these becoming a bottleneck. Sequential write with a small K/M ratio will be the heaviest encode workload.

yaarith · 2024-06-24T23:54:26Z

See what EC plugins and techniques are reported via our telemetry:
https://telemetry-public.ceph.com/d/b5664aff-721e-4c8a-b79a-7d3e2a8eaf07/ec-plugin?orgId=1

Sometimes plugins do not specify a technique (e.g. clay), hence the "empty" value.

rzarzynski · 2024-06-26T19:58:32Z

src/common/options/global.yaml.in

  level: advanced
  desc: default erasure code profile for new erasure-coded pools
-  default: plugin=jerasure technique=reed_sol_van k=2 m=2
+  default: plugin=isa technique=reed_sol_van k=2 m=2


I'm afraid the change isn't going to be as simple as s/jerasure/isa/. The problem is we still need to test jerasure in upstream for the sake of existing clusters. I think the same idiom like with theFilestore-to-BlueStore transition is needed.

I had the same suspicion.

markhpc · 2024-06-27T14:09:12Z

This benchmark data was captured by compiling Ceph from source and then executing the benchmark tool using the versions of the EC libraries that are currently included in Ceph.

The idea looks very promising. What I would love to see are numbers from tests involving OSDs. I'm worried that the actual crunching isn't the biggest overhead we have on our EC paths – @bill-scales has nailed down memcpy due to misalignment.

It would be great if we could discuss this at the performance meeting. :) The community is very interested in these details!

markhpc · 2024-06-27T14:14:05Z

@rzarzynski You are right that although there are some big performance improvements shown here, as an overall percentage of the I/O path the time spent encoding/decoding is small so overall we might only expect to see a couple of percent improvement - however that's still worth having.

Exactly what percentage of time is spent encoding/decoding is tricky to measure meaningfully because it depends so much on the test stand configuration and I/O workload - we've seen measurements in the 3%-10% range for % CPU time spent encoding/decoding. The heaviest decode workload is likely to come from backfill when the ratio of K/M is small (e.g 2+2), there are multiple drive failures and will require that there are enough drives and network bandwidth to prevent these becoming a bottleneck. Sequential write with a small K/M ratio will be the heaviest encode workload.

@bill-scales This roughly falls in line with what I've seen too. Sooner or later I think we'll need to rethink denc (and absolutely the traditional encode/decode path), but there are other low hanging fruit that I'm targeting first.

jamiepryde · 2024-07-24T16:22:04Z

I've been doing some more testing and noticed there is a problem with the graphs. The performance numbers for Jerasure are lower than they should be. I eventually realised this is because Ceph was built without the -DCMAKE_BUILD_TYPE=RelWithDebInfo flag on do_cmake, so I was testing a debug build. This clearly causes a notable impact to performance when using Jerasure. Building with RelWithDebInfo shows better performance when using Jerasure. New graphs from the tool on an Intel(R) Xeon(R) Gold 6336Y CPU @ 2.40GHz:

And in a container on a macbook pro with an M1 pro:

So on X86-64 we still see slightly better encoding performance using ISA. Decoding performance is very similar (The change in #58594 also slightly improves ISA decode performance where m=1 or erasures=1).
On ARM, we now see slightly better encoding and decoding performance when using Jerasure.

dvanders · 2024-07-29T20:34:05Z

@apeters1971 FYI. Any comments on this?

markhpc · 2024-07-29T20:41:43Z

I've been doing some more testing and noticed there is a problem with the graphs. The performance numbers for Jerasure are lower than they should be. I eventually realised this is because Ceph was built without the -DCMAKE_BUILD_TYPE=RelWithDebInfo flag on do_cmake, so I was testing a debug build. This clearly causes a notable impact to performance when using Jerasure. Building with RelWithDebInfo shows better performance when using Jerasure.

Yeah, this has been an issue that's bitten multiple people since do_cmake.sh was changed years ago. I think I was the one that added the warning note when I got bitten by it. It's also semi-related to the ubuntu package performance bug we found last winter where RocksDB wasn't being compiled with proper optimizations. Sorry you got hit by it, but glad that you figured it out before we merged this PR. FWIW, the new numbers are much more in-line with what I saw the last time I compared jerasure and isa, so I think the new numbers are likely to be correct.

afreen23

dashboard LGTM

jamiepryde · 2024-10-14T13:47:41Z

We've raised a few related PRs that make small changes to performance when using ISA, and some PRs to try to improve the EC benchmark tool.

#58594 and #59862 - Use ISA's own RAID XOR function as an optimisation for m=1 and single erasures, rather than using and maintaining our own XOR code.
#59486 and #60121 - Changes to the EC benchmark to try to get better measurements of a plugin's encode and decode performance, rather than spending time measuring buffer allocation.
#59513 - Build the avx512 versions of ISA functions.
#59679 - Additional EC configs in teuthology to expand test coverage.
#59881 - Additional LRC unit tests to test both Jerasure and ISA.
#60246 - Use 64-byte alignment for EC buffers instead of 32-byte alignment.

Latest graphs testing with all of the above PRs:

Xeon Gold 6336Y

numbers - https://github.com/user-attachments/assets/34a028c5-ae9d-45a9-bc60-1299021209b0

numbers - https://github.com/user-attachments/assets/e7fa30bb-1126-4386-8868-7ceaefbb75ee

Macbook M1 pro linux container (aarch64)

numbers - https://github.com/user-attachments/assets/83dfdf96-2981-419e-8b6c-2373ab465cfd

numbers - https://github.com/user-attachments/assets/ee28e57d-9e12-48ab-b270-ba6c5e0a30b2

jamiepryde · 2024-12-12T11:30:19Z

jenkins test make check

jamiepryde · 2024-12-12T11:30:42Z

jenkins test api

jamiepryde · 2024-12-12T11:30:49Z

jenkins test dashboard

jamiepryde · 2024-12-12T11:31:12Z

jenkins test dashboard cephadm

jamiepryde · 2024-12-12T14:55:48Z

jenkins test make check

jamiepryde · 2024-12-17T14:58:06Z

jenkins test make check

Signed-off-by: Jamie Pryde <jamiepry@uk.ibm.com>

…to isa Signed-off-by: Jamie Pryde <jamiepry@uk.ibm.com>

Signed-off-by: Jamie Pryde <jamiepry@uk.ibm.com>

markhpc

Given jerasure's lack of maintenance and the extensive testing done here, I think this is a valid change going forward so long as we are only applying it to new pools and leaving backwards compatibility in place.

mmgaggle · 2025-01-22T16:26:06Z

Great work, and as @markhpc noted, performance aside, the fact that jerasure is basically not maintained anymore is almost reason enough in of itself. I support moving to ISA-L as the default.

ronen-fr · 2025-02-02T12:57:21Z

@jamiepryde - this seems to cause a failure in test_ceph_helpers.sh.
See https://tracker.ceph.com/issues/69758

jamiepryde · 2025-02-04T10:09:15Z

@ronen-fr Taking a look, thanks.

doc: Change default erasure code profile from jerasure to isa

753a8ff

Signed-off-by: Jamie Pryde <jamiepry@uk.ibm.com>

jamiepryde requested review from a team as code owners June 14, 2024 12:50

jamiepryde requested review from avanthakkar and ivoalmeida and removed request for a team June 14, 2024 12:50

github-actions bot added common core dashboard documentation tests labels Jun 14, 2024

tchaikov added the performance label Jun 14, 2024

anthonyeleven reviewed Jun 14, 2024

View reviewed changes

bill-scales reviewed Jun 18, 2024

View reviewed changes

rzarzynski reviewed Jun 26, 2024

View reviewed changes

afreen23 approved these changes Aug 27, 2024

View reviewed changes

jamiepryde force-pushed the isal-default branch from fc1c63a to 50dd023 Compare October 10, 2024 20:50

jamiepryde force-pushed the isal-default branch from 50dd023 to bfe847a Compare October 14, 2024 15:52

jamiepryde added 5 commits December 17, 2024 18:58

qa: Change default erasure code profile from jerasure to isa

dc87053

Signed-off-by: Jamie Pryde <jamiepry@uk.ibm.com>

test/erasure-code: Change default erasure code profile from jerasure …

11d5616

…to isa Signed-off-by: Jamie Pryde <jamiepry@uk.ibm.com>

vstart.sh: Change default erasure code profile from jerasure to isa

71d0695

Signed-off-by: Jamie Pryde <jamiepry@uk.ibm.com>

common/options: Change default erasure code profile from jerasure to isa

4d3adb5

Signed-off-by: Jamie Pryde <jamiepry@uk.ibm.com>

erasure-code/lrc: Change LRC default plugin from Jerasure to ISA

25cac0f

Signed-off-by: Jamie Pryde <jamiepry@uk.ibm.com>

jamiepryde force-pushed the isal-default branch from bfe847a to 25cac0f Compare December 17, 2024 19:01

markhpc self-requested a review December 19, 2024 17:03

markhpc approved these changes Dec 19, 2024

View reviewed changes

markhpc added the needs-qa label Dec 19, 2024

ljflores added the wip-yuri4-testing label Jan 13, 2025

neha-ojha added the needs-release-note label Jan 17, 2025

yuriw merged commit a87a77c into ceph:main Jan 21, 2025

jamiepryde mentioned this pull request Feb 4, 2025

PendingReleaseNotes: add note to change default EC plugin from Jerasure to ISA-L #61638

Merged

14 tasks

BenoitKnecht mentioned this pull request Jul 8, 2025

core: set erasure-code-profile plugin based on pool algorithm rook/rook#16104

Merged

6 tasks

jamiepryde deleted the isal-default branch July 10, 2025 14:29

jamiepryde restored the isal-default branch July 10, 2025 14:30

jamiepryde deleted the isal-default branch July 10, 2025 14:32

Conversation

jamiepryde commented Jun 14, 2024

Contribution Guidelines

Checklist

Uh oh!

anthonyeleven commented Jun 14, 2024

Uh oh!

anthonyeleven Jun 14, 2024

Choose a reason for hiding this comment

Uh oh!

jamiepryde Jun 18, 2024

Choose a reason for hiding this comment

Uh oh!

jamiepryde Oct 10, 2024

Choose a reason for hiding this comment

Uh oh!

bill-scales commented Jun 17, 2024

Uh oh!

anthonyeleven commented Jun 17, 2024

Uh oh!

bill-scales Jun 18, 2024

Choose a reason for hiding this comment

Uh oh!

jamiepryde Jun 18, 2024

Choose a reason for hiding this comment

Uh oh!

markhpc commented Jun 20, 2024

Uh oh!

rzarzynski commented Jun 20, 2024

Uh oh!

jamiepryde commented Jun 21, 2024

Uh oh!

bill-scales commented Jun 24, 2024

Uh oh!

yaarith commented Jun 24, 2024

Uh oh!

rzarzynski Jun 26, 2024

Choose a reason for hiding this comment

Uh oh!

anthonyeleven Jun 26, 2024

Choose a reason for hiding this comment

Uh oh!

markhpc commented Jun 27, 2024

Uh oh!

markhpc commented Jun 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jamiepryde commented Jul 24, 2024

Uh oh!

dvanders commented Jul 29, 2024

Uh oh!

markhpc commented Jul 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

afreen23 left a comment

Choose a reason for hiding this comment

Uh oh!

jamiepryde commented Oct 14, 2024

Uh oh!

jamiepryde commented Dec 12, 2024

Uh oh!

jamiepryde commented Dec 12, 2024

Uh oh!

jamiepryde commented Dec 12, 2024

Uh oh!

jamiepryde commented Dec 12, 2024

Uh oh!

jamiepryde commented Dec 12, 2024

Uh oh!

jamiepryde commented Dec 17, 2024

Uh oh!

markhpc left a comment

Choose a reason for hiding this comment

Uh oh!

mmgaggle commented Jan 22, 2025

Uh oh!

ronen-fr commented Feb 2, 2025

Uh oh!

jamiepryde commented Feb 4, 2025

Uh oh!

markhpc commented Jun 27, 2024 •

edited

Loading

markhpc commented Jul 29, 2024 •

edited

Loading