Exclude slab_reclaimable from cgroups v2 memory tracking#100898
Exclude slab_reclaimable from cgroups v2 memory tracking#100898azat wants to merge 1 commit intoClickHouse:masterfrom
slab_reclaimable from cgroups v2 memory tracking#100898Conversation
The `CgroupsV2Reader` sums `anon + sock + kernel` from `memory.stat` to feed `MemoryTracking`. But `kernel` includes `slab_reclaimable` (dentry/inode cache) — a filesystem metadata cache the kernel drops under memory pressure, functionally equivalent to page cache. On one of instances I can see that it can take significant portion of memory: Process RSS: 8.47 GiB anon: 7.75 GiB kernel: 4.22 GiB (of which slab_reclaimable = 3.85 GiB) MemoryTracking: 11.99 GiB (inflated ~42% vs real RSS) The 3.85 GiB of reclaimable slab was dominated by `ext4_inode_cache` (with 56% usage only). This inflated tracker causes premature `MEMORY_LIMIT_EXCEEDED` errors. Cgroups v1 is not affected — its `rss` field excludes kernel memory. Refs: ClickHouse#82036, ClickHouse#83981
|
Workflow [PR], commit [3743b4d] Summary: ❌
AI ReviewSummaryThis PR changes cgroups v2 memory accounting to subtract Findings
Tests
ClickHouse Rules
Performance & SafetyThe safety concern is not throughput but correctness under parser/format drift: silent missing-key fallback to zero can under-report actual memory pressure and delay protective exceptions. Final Verdict
|
| std::lock_guard lock(mutex); | ||
| stat_buf.rewind(); | ||
| return readMetricsFromStatFile(stat_buf, {"anon", "sock", "kernel"}, {"kernel"}, &warnings_printed); | ||
| auto metrics = readNamedMetricsFromStatFile(stat_buf, {"anon", "sock", "kernel", "slab_reclaimable"}); |
There was a problem hiding this comment.
This change drops the validation behavior that readMetricsFromStatFile had (missing key / duplicate key logging). With metrics["anon"] / metrics["sock"], missing keys are now silently treated as 0, so parser/schema regressions in memory.stat can undercount memory usage without any signal.
That is risky for memory-limit enforcement (possible late/absent MEMORY_LIMIT_EXCEEDED). Please keep the previous validation semantics for required keys (anon, sock), and preserve duplicate-key diagnostics in the new helper.
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Exclude
slab_reclaimablefrom cgroups v2 memory trackingThe
CgroupsV2Readersumsanon + sock + kernelfrommemory.statto feedMemoryTracking. Butkernelincludesslab_reclaimable(dentry/inode cache) — a filesystem metadata cache the kernel drops under memory pressure, functionally equivalent to page cache.On one of instances I can see that it can take significant portion of memory:
Process RSS: 8.47 GiB
anon: 7.75 GiB
kernel: 4.22 GiB (of which slab_reclaimable = 3.85 GiB)
MemoryTracking: 11.99 GiB (inflated ~42% vs real RSS)
The 3.85 GiB of reclaimable slab was dominated by
ext4_inode_cache(with 56% usage only).This inflated tracker causes premature
MEMORY_LIMIT_EXCEEDEDerrors.Cgroups v1 is not affected — its
rssfield excludes kernel memory.Refs: #82036, #83981