Skip to content

erasure-code/isa: fix cache collision causing buffer overflow#66894

Merged
NitzanMordhai merged 1 commit intoceph:mainfrom
tchaikov:wip-ec-isa-fix-cache-collision
Feb 15, 2026
Merged

erasure-code/isa: fix cache collision causing buffer overflow#66894
NitzanMordhai merged 1 commit intoceph:mainfrom
tchaikov:wip-ec-isa-fix-cache-collision

Conversation

@tchaikov
Copy link
Contributor

@tchaikov tchaikov commented Jan 13, 2026

This commit fixes a critical cache key collision bug in the ISA erasure code plugin that could lead to heap-buffer-overflow and silent data corruption.

Problem:

The decoding table cache was indexed only by matrix type and erasure signature (available/missing chunk pattern), but did NOT include the (k,m) erasure code configuration parameters. This caused different EC configurations with similar erasure patterns to collide in the cache, leading to incorrectly-sized cached buffers being reused.

AddressSanitizer Report:

==4904==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x5160001397b8 at pc 0x5de8e415296b bp 0x7ffc82260310 sp 0x7ffc8225fad0 READ of size 576 at 0x5160001397b8 thread T0
    #0 __asan_memcpy
    #1 ErasureCodeIsaTableCache::getDecodingTableFromCache()
       .../ErasureCodeIsaTableCache.cc:260:5
    #2 ErasureCodeIsaDefault::isa_decode()
       .../ErasureCodeIsa.cc:490:15

0x5160001397b8 is located 0 bytes after 568-byte region [0x516000139580,0x5160001397b8) allocated by:
    #0 posix_memalign
    #1 ceph::buffer::raw_combined::alloc_data_n_controlblock()
    #2 ErasureCodeIsaTableCache::putDecodingTableToCache()
       .../ErasureCodeIsaTableCache.cc:319:18

Root Cause:

Scenario illustrating the bug:

  1. First decode operation: k=2, m=1, erasure pattern "+0+2-1"
    • Creates cache entry with key "+0+2-1"
    • Allocates buffer: 2*(1+2)*32 = 192 bytes
  2. Second decode operation: k=3, m=3, same erasure pattern "+0+2-1"
    • Looks up cache with key "+0+2-1" → COLLISION
    • Retrieves 192-byte buffer but needs 3*(3+3)*32 = 576 bytes
    • Result: Heap-buffer-overflow (reads 384 bytes beyond allocation)

Worse scenario (silent corruption):

  1. First decode: k=3, m=3 → caches 576-byte table
  2. Second decode: k=2, m=1 → retrieves wrong table
    • Uses incorrect decoding matrix
    • Result: Silent data corruption during recovery

Solution:

Include k and m parameters in cache signature

  • Old format: "+0+2+3-1-4"
  • New format: "k3m2a+0+2+3e-1-4"

Test Fix:

Also fixes a buffer overflow in TestErasureCodePlugins.cc where hashes_bl offset was calculated using chunk_size instead of sizeof(uint32_t), causing reads beyond the CRC buffer.

Production Impact:

Backward Compatibility: FULLY COMPATIBLE

  • Cache is ephemeral (in-memory only, not persisted)
  • Cache cleared on process restart
  • Rolling upgrades safe - each OSD restart gets fixed code
  • Old cache entries automatically invalidated on upgrade
  • No wire protocol or on-disk format changes
  • No configuration changes required
  • No breaking changes

Data Integrity: SIGNIFICANTLY IMPROVED

  • Eliminates silent data corruption risk
  • Eliminates heap-buffer-overflow crashes
  • Cache now correctly isolated by (k,m) configuration
  • Correct decoding tables always used for recovery
  • No risk of corrupting user data from the fix itself

Performance: IMPROVED

  • Fewer memory allocations (single buffer vs string concatenation)
  • No regression: Cache effectiveness maintained, just with correct keys
  • Negligible overhead: Signature generation tiny compared to EC operations

Why Users Haven't Complained:

Several factors likely prevented widespread reports:

  1. Low probability conditions required:

    • Need multiple EC pools with DIFFERENT (k,m) configurations
    • Need similar erasure patterns across pools
    • Need cache collision to occur during actual recovery operations
    • Recovery operations are relatively rare in healthy clusters
  2. Crash vs silent corruption detection:

    • Buffer overflows (easier to detect) occur when k2,m2 > k1,m1
    • Silent corruption (harder to detect) occurs when k2,m2 < k1,m1
    • Crashes might be attributed to other causes
    • Data corruption only detected during scrub or data verification
  3. Common deployment patterns:

    • Many deployments use single EC configuration cluster-wide
    • Default EC configurations (k=2,m=1 or k=4,m=2) reduce collision space
    • Erasure pattern variety may be insufficient for collisions
  4. ISA plugin usage:

    • Not universally deployed (requires Intel ISA-L library)
    • Some sites use jerasure plugin instead
    • Plugin selection depends on hardware and configuration
  5. Detection difficulty:

    • ASan not enabled in production builds
    • Silent corruption only appears during:
      • Degraded reads with recovery
      • Scrub operations * Deep-scrub verification
    • Corrupted data might not be immediately accessed

Fixes: https://tracker.ceph.com/issues/74382

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands

You must only issue one Jenkins command per-comment. Jenkins does not understand
comments with more than one command.

@tchaikov tchaikov requested a review from a team as a code owner January 13, 2026 01:07
@tchaikov tchaikov force-pushed the wip-ec-isa-fix-cache-collision branch 2 times, most recently from 12069bf to 8d27247 Compare January 13, 2026 01:29
@tchaikov tchaikov mentioned this pull request Jan 13, 2026
14 tasks
@tchaikov tchaikov force-pushed the wip-ec-isa-fix-cache-collision branch 2 times, most recently from 351c1f5 to 044abd8 Compare January 13, 2026 07:28
@tchaikov tchaikov removed the request for review from a team January 13, 2026 11:49
@tchaikov tchaikov marked this pull request as draft January 13, 2026 11:49
This commit fixes a critical cache key collision bug in the ISA erasure
code plugin that could lead to heap-buffer-overflow and silent data
corruption.

Problem:
--------
The decoding table cache was indexed only by matrix type and erasure
signature (available/missing chunk pattern), but did NOT include the
(k,m) erasure code configuration parameters. This caused different EC
configurations with similar erasure patterns to collide in the cache,
leading to incorrectly-sized cached buffers being reused.

AddressSanitizer Report:
------------------------
==4904==ERROR: AddressSanitizer: heap-buffer-overflow on address
0x5160001397b8 at pc 0x5de8e415296b bp 0x7ffc82260310 sp 0x7ffc8225fad0
READ of size 576 at 0x5160001397b8 thread T0
    #0 __asan_memcpy
    #1 ErasureCodeIsaTableCache::getDecodingTableFromCache()
       .../ErasureCodeIsaTableCache.cc:260:5
    #2 ErasureCodeIsaDefault::isa_decode()
       .../ErasureCodeIsa.cc:490:15

0x5160001397b8 is located 0 bytes after 568-byte region
[0x516000139580,0x5160001397b8) allocated by:
    #0 posix_memalign
    #1 ceph::buffer::raw_combined::alloc_data_n_controlblock()
    #2 ErasureCodeIsaTableCache::putDecodingTableToCache()
       .../ErasureCodeIsaTableCache.cc:319:18

Root Cause:
-----------
Scenario illustrating the bug:
1. First decode operation: k=2, m=1, erasure pattern "+0+2-1"
   - Creates cache entry with key "+0+2-1"
   - Allocates buffer: 2*(1+2)*32 = 192 bytes
2. Second decode operation: k=3, m=3, same erasure pattern "+0+2-1"
   - Looks up cache with key "+0+2-1" → COLLISION
   - Retrieves 192-byte buffer but needs 3*(3+3)*32 = 576 bytes
   - Result: Heap-buffer-overflow (reads 384 bytes beyond allocation)

Worse scenario (silent corruption):
1. First decode: k=3, m=3 → caches 576-byte table
2. Second decode: k=2, m=1 → retrieves wrong table
   - Uses incorrect decoding matrix
   - Result: Silent data corruption during recovery

Solution:
---------
Include k and m parameters in cache signature
 - Old format: "+0+2+3-1-4"
 - New format: "k3m2a+0+2+3e-1-4"

Test Fix:
---------
Also fixes a buffer overflow in TestErasureCodePlugins.cc where
hashes_bl offset was calculated using chunk_size instead of sizeof(uint32_t),
causing reads beyond the CRC buffer.

Production Impact:
------------------

Backward Compatibility: FULLY COMPATIBLE
- Cache is ephemeral (in-memory only, not persisted)
- Cache cleared on process restart
- Rolling upgrades safe - each OSD restart gets fixed code
- Old cache entries automatically invalidated on upgrade
- No wire protocol or on-disk format changes
- No configuration changes required
- No breaking changes

Data Integrity:
- Eliminates silent data corruption risk
- Eliminates heap-buffer-overflow crashes
- Cache now correctly isolated by (k,m) configuration
- Correct decoding tables always used for recovery
- No risk of corrupting user data from the fix itself

Why Users Haven't Complained:
------------------------------

Several factors likely prevented widespread reports:

1. Low probability conditions required:
   - Need multiple EC pools with DIFFERENT (k,m) configurations
   - Need similar erasure patterns across pools
   - Need cache collision to occur during actual recovery operations
   - Recovery operations are relatively rare in healthy clusters

2. Crash vs silent corruption detection:
   - Buffer overflows (easier to detect) occur when k2,m2 > k1,m1
   - Silent corruption (harder to detect) occurs when k2,m2 < k1,m1
   - Crashes might be attributed to other causes
   - Data corruption only detected during scrub or data verification

3. Common deployment patterns:
   - Many deployments use single EC configuration cluster-wide
   - Default EC configurations (k=2,m=1 or k=4,m=2) reduce collision space
   - Erasure pattern variety may be insufficient for collisions

4. ISA plugin usage:
   - Not universally deployed (requires Intel ISA-L library)
   - Some sites use jerasure plugin instead
   - Plugin selection depends on hardware and configuration

5. Detection difficulty:
   - ASan not enabled in production builds
   - Silent corruption only appears during:
     * Degraded reads with recovery
     * Scrub operations
     * Deep-scrub verification
   - Corrupted data might not be immediately accessed

Fixes: https://tracker.ceph.com/issues/74382

Signed-off-by: Kefu Chai <k.chai@proxmox.com>
@tchaikov tchaikov force-pushed the wip-ec-isa-fix-cache-collision branch from 044abd8 to 7e66871 Compare January 14, 2026 02:12
@tchaikov
Copy link
Contributor Author

jenkins test api

@tchaikov
Copy link
Contributor Author

jenkins test make check

@tchaikov
Copy link
Contributor Author

jenkins test make check arm64

@tchaikov tchaikov marked this pull request as ready for review January 14, 2026 05:50
@tchaikov
Copy link
Contributor Author

jenkins test make check arm64

1 similar comment
@tchaikov
Copy link
Contributor Author

jenkins test make check arm64

@tchaikov tchaikov requested a review from a team January 15, 2026 06:55
@tchaikov
Copy link
Contributor Author

@ceph/core Hello maintainers, could you please review this PR when you have a chance? I believe this addresses a critical issue that may warrant prioritized attention. Thank you!

}
}

std::string erasure_signature(sig_buf);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's all about the uniqueness across instances of ErasureCodeIsaTableCache. The factory class ErasureCodePluginIsa has exactly one shared among all created ErasureCodeIsaDefault instances.

class ErasureCodePluginIsa : public ceph::ErasureCodePlugin {
public:
  ErasureCodeIsaTableCache tcache;

  int factory(const std::string &directory,
              ceph::ErasureCodeProfile &profile,
              ceph::ErasureCodeInterfaceRef *erasure_code,
              std::ostream *ss) override;
};
int ErasureCodePluginIsa::factory(const std::string &directory,
                                  ceph::ErasureCodeProfile &profile,
                                  ceph::ErasureCodeInterfaceRef *erasure_code,
                                  std::ostream *ss)
{
  ErasureCodeIsa *interface;
    std::string technique;
    technique = profile.find("technique")->second;
    std::string _m = profile.find("m")->second;
    if ((technique == "reed_sol_van")) {
      interface = new ErasureCodeIsaDefault(tcache,
                                            technique,
                                            ErasureCodeIsaDefault::kVandermonde,
                                            _m);
    } else if ((technique == "cauchy")) {
      interface = new ErasureCodeIsaDefault(tcache,
                                            technique,
                                            ErasureCodeIsaDefault::kCauchy,
                                            _m);
    } else {
      *ss << "technique=" << technique << " is not a valid coding technique. "
        << " Choose one of the following: "
        << "reed_sol_van,"
        << "cauchy" << std::endl;
      return -ENOENT;
    }

    int r = interface->init(profile, ss);
    if (r) {
      delete interface;
      return r;
    }
    *erasure_code = ceph::ErasureCodeInterfaceRef(interface);
    return 0;
}

The alternative solution could be to grow the number of cache instances available.
The bug seems to very old, dating back to at least 2014.
Other, non-Jerasure plugins might be affected as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bug seems to very old, dating back to at least 2014.

please see the tracker ticket, this issue was introduced in the day 0, so all branches are impacted.

Other, non-Jerasure plugins might be affected as well.

as you explained above and also see the tracker ticket, this caching machinery is specific to the ISA erasure code plugin. so only ISA erasure code plugin is impacted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this caching machinery is specific to the ISA erasure code plugin. so only ISA erasure code plugin is impacted.

Yeah, Shec also has a global tcache but the naming schema is different than for ISA.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed. what shec's ErasureCodeShecTableCache is using is a combination of techinique, k, m, c and w. so it's immune to this problem.

see

*encoding_table[technique][k][m][c][w] = ec_in_table;

@jamiepryde
Copy link
Contributor

I have been trying to trigger the data corruption bug on a vstart cluster, but have not succeeded yet. A couple of points I noticed that I think reduce the chance of this bug being hit:

  1. I think the value of K in the profiles must be the same for there to be a signature collision, as we always add K number of +%d to the signature. So e.g. a system with pools using 2+2 and 2+3 ISA-L profiles could hit the bug, but a system using 2+2 and 4+2 wouldn't. See
  for (i = 0, r = 0; i < k; i++, r++) {
    char id[128];
    while (erasure_contains(erasures, r))
      r++;

    decode_index[i] = r;

    snprintf(id, sizeof (id), "+%d", r);
    erasure_signature += id;
  }
  1. We only use the cached decode tables, and so can only hit the bug if M > 1 and number of errors is > 1 (or M > 1 and the error is in a parity shard other than the first one). If M or number of errors is 1 (and the first parity shard) then we use ISA-L xor optimization instead. See

  if ((m == 1) || 
      ((matrixtype == kVandermonde) && (nerrs == 1) && (erasures[0] < (k + 1)))) {
    // single parity decoding
    dout(20) << "isa_decode: reconstruct using xor_gen [" << erasures[0] << "]" << dendl;
    isa_xor(recover_buf, recover_buf[k], blocksize, k);
    return 0;
  }

  unsigned char d[k * (m + k)];
  unsigned char decode_tbls[k * (m + k)*32];
  unsigned char *p_tbls = decode_tbls;

  int decode_index[k];

  std::string erasure_signature;

@NitzanMordhai
Copy link
Contributor

@NitzanMordhai NitzanMordhai merged commit b549669 into ceph:main Feb 15, 2026
16 checks passed
@tchaikov tchaikov deleted the wip-ec-isa-fix-cache-collision branch February 15, 2026 15:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants