Bug #74382
openISA Erasure Code Cache Collision Causing Buffer Overflow and Data Corruption
0%
Description
The ISA erasure code plugin has a critical cache key collision bug that can cause heap-buffer-overflow crashes and silent data corruption during recovery operations. The decoding table cache does not include (k,m) parameters in the cache key, allowing different erasure code configurations to collide and reuse incorrectly-sized cached buffers.
Problem¶
The ISA erasure code plugin caches decoding tables to improve performance during data recovery operations. The cache key is constructed from:
- Matrix type (Cauchy or Vandermonde)
- Erasure signature (pattern of available/missing chunks)
However, the cache key does not include the (k,m) erasure code configuration parameters. This allows different EC configurations with similar erasure patterns to collide in the cache, causing:
1. Buffer overflow crashes when a smaller cached buffer is accessed with a larger size
2. Silent data corruption when wrong decoding matrices are used for recovery
Root Cause¶
In ErasureCodeIsa.cc, the isa_decode() function constructs the cache signature as:
std::string erasure_signature;
for (i = 0, r = 0; i < k; i++, r++) {
// ... adds "+X" for available chunks
}
for (int p = 0; p < nerrs; p++) {
// ... adds "-Y" for missing chunks
}
The signature includes only the chunk availability pattern (e.g., "+0+2+3-1-4") but not the k,m values. Since the decoding table size is `k * (m + k) * 32` bytes, different (k,m) configurations produce different-sized tables.
Exploit Scenario 1: Buffer Overflow (Crash)¶
- First decode operation with k=2, m=1:
- Erasure pattern: "+0+2-1" (chunks 0,2 available, chunk 1 missing)
- Cache key: "+0+2-1"
- Buffer allocated: 2 * (1+2) * 32 = 192 bytes
- Second decode operation with k=3, m=3:
- Same erasure pattern: "+0+2-1" (chunks 0,2 available, chunk 1 missing)
- Cache key lookup: "+0+2-1" → COLLISION!
- Retrieves 192-byte buffer
- Attempts to copy: 3 * (3+3) * 32 = 576 bytes
- Result: Heap-buffer-overflow, reads 384 bytes beyond allocation
Exploit Scenario 2: Silent Data Corruption (Worse)¶
- First decode operation with k=3, m=3:
- Cache key: "+0+2+3-1-4"
- Stores 576-byte decoding table for k=3, m=3
- Second decode operation with k=2, m=1:
- Same cache key: "+0+2+3-1-4" → COLLISION!
- Retrieves decoding table for k=3, m=3
- Uses incorrect matrix to decode k=2, m=1 data
- Result: Silent data corruption, wrong data recovered
Test Case¶
A test case demonstrating the issue is available in `src/test/erasure-code/TestErasureCodePlugins.cc`. Running:
ctest -R unittest_erasure_code_plugins --verbose
With AddressSanitizer enabled produces:
==4904==ERROR: AddressSanitizer: heap-buffer-overflow on address
0x5160001397b8 at pc 0x5de8e415296b bp 0x7ffc82260310 sp 0x7ffc8225fad0
READ of size 576 at 0x5160001397b8 thread T0
#0 __asan_memcpy
#1 ErasureCodeIsaTableCache::getDecodingTableFromCache()
src/erasure-code/isa/ErasureCodeIsaTableCache.cc:260:5
#2 ErasureCodeIsaDefault::isa_decode()
src/erasure-code/isa/ErasureCodeIsa.cc:490:15
0x5160001397b8 is located 0 bytes after 568-byte region
[0x516000139580,0x5160001397b8) allocated by:
#0 posix_memalign
#1 ceph::buffer::raw_combined::alloc_data_n_controlblock()
#2 ErasureCodeIsaTableCache::putDecodingTableToCache()
src/erasure-code/isa/ErasureCodeIsaTableCache.cc:319:18
Expected Behavior¶
- Each (k,m) configuration should have isolated cache entries
- Decoding tables should never be shared between different EC configurations
- Cache hits should always return correctly-sized buffers
- Data recovery should use correct decoding matrices for the target configuration
Actual Behavior¶
- Different (k,m) configurations share cache entries when erasure patterns match
- Wrong-sized buffers cause buffer overflow crashes
- Wrong decoding matrices cause silent data corruption during recovery
- Cache provides incorrect performance optimization with data integrity risk
User Impact¶
Affected Scenarios:¶
- Deployments with multiple EC pools using different (k,m) configurations
- Recovery operations when OSDs are down or PGs degraded
- Data scrub/deep-scrub verification operations
- Reads from degraded placement groups
Risk Factors:¶
- Risk increases with number of distinct EC configurations
- Higher risk during cluster rebalancing/recovery
- Production builds without ASan may experience silent corruption
- Corruption may not be immediately detected
This issue was introduced in the day 0 of ISA EC plugin, see b7d0017d2398352937906b8a2777fafe313b47e7 (https://github.com/ceph/ceph/commit/b7d0017d2398352937906b8a2777fafe313b47e7).
Updated by Upkeep Bot about 1 month ago
- Status changed from Fix Under Review to Pending Backport
- Merge Commit set to b549669973b8186b6bb10f59a6108a571a3e44e1
- Fixed In set to v20.3.0-5335-gb549669973
- Upkeep Timestamp set to 2026-02-15T15:22:01+00:00
Updated by Upkeep Bot about 1 month ago
- Copied to Backport #74942: reef: ISA Erasure Code Cache Collision Causing Buffer Overflow and Data Corruption added
Updated by Upkeep Bot about 1 month ago
- Copied to Backport #74943: tentacle: ISA Erasure Code Cache Collision Causing Buffer Overflow and Data Corruption added
Updated by Upkeep Bot about 1 month ago
- Copied to Backport #74944: squid: ISA Erasure Code Cache Collision Causing Buffer Overflow and Data Corruption added
Updated by Upkeep Bot about 1 month ago
- Tags (freeform) set to backport_processed