Subtract slab_reclaimable from kernel memory in cgroupv2 reader#100901
Subtract slab_reclaimable from kernel memory in cgroupv2 reader#100901antonio2368 merged 3 commits intomasterfrom
Conversation
The cgroupv2 memory usage calculation now uses `anon + sock + (kernel - slab_reclaimable)` instead of `anon + sock + kernel`. The `slab_reclaimable` portion is reclaimed synchronously by the kernel under memory pressure before invoking the OOM killer, so it should not count against the application's memory budget. Also refactors `readMetricsFromStatFile` to populate a caller-owned map of individual metric values instead of returning a sum, allowing callers to apply their own aggregation logic. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Workflow [PR], commit [ea7a8e3] Summary: ❌
AI ReviewSummaryThis PR updates cgroup v2 memory accounting to compute usage as Missing context
ClickHouse Rules
Final Verdict
|
Use `const auto *` instead of `auto` for the iterator over `std::initializer_list<std::string_view>`, matching the existing style on line 88. https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=100901&sha=b4ef919a6193782e42a3c463a9bc0a43c632ebb0&name_0=PR&name_1=Build%20%28arm_tidy%29 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@azat my comment on your PR is this PR, see if some of the changes make more sense. I can close this PR. |
src/Common/MemoryWorker.cpp
Outdated
| stat_buf.rewind(); | ||
| return readMetricsFromStatFile(stat_buf, {"anon", "sock", "kernel"}, {"kernel"}, &warnings_printed); | ||
| readMetricsFromStatFile( | ||
| stat_buf, metrics, {"anon", "sock", "kernel", "slab_reclaimable"}, {"kernel", "slab_reclaimable"}, &warnings_printed); |
There was a problem hiding this comment.
AFAICS slab_reclaimable was there from the beginning
There was a problem hiding this comment.
And TBH I don't think we need this separation for optional keys and mandatory keys, we already have flag to print warnings only once, so we can simply print it for any keys, that way we will also know that the kernel was old (or we have a bug)
All keys (`kernel`, `slab_reclaimable`) have been in cgroupv2 `memory.stat` from the beginning, so there is no need for an optional/mandatory distinction. Warn once on any missing key — this also surfaces old-kernel or buggy environments. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
LLVM Coverage Report
Changed lines: 92.63% (88/95) · Uncovered code |
|
@antonio2368 let's also include details of how it can affect (you can get them from my PR) |
|
@azat do you mean to update PR description or the comment in code? |
|
PR description |
Cherry pick #100901 to 25.8: Subtract slab_reclaimable from kernel memory in cgroupv2 reader
Cherry pick #100901 to 26.1: Subtract slab_reclaimable from kernel memory in cgroupv2 reader
Cherry pick #100901 to 26.2: Subtract slab_reclaimable from kernel memory in cgroupv2 reader
Cherry pick #100901 to 26.3: Subtract slab_reclaimable from kernel memory in cgroupv2 reader
Backport #100901 to 26.3: Subtract slab_reclaimable from kernel memory in cgroupv2 reader
The cgroupv2 memory usage calculation now uses
anon + sock + (kernel - slab_reclaimable)instead ofanon + sock + kernel. Theslab_reclaimableportion is reclaimed synchronously by the kernel under memory pressure before invoking the OOM killer, so it should not count against the application's memory budget. This gives a more accurate picture of actual memory pressure.On one of instances @azat saw that it can take significant portion of memory:
Process RSS: 8.47 GiB
anon: 7.75 GiB
kernel: 4.22 GiB (of which slab_reclaimable = 3.85 GiB)
MemoryTracking: 11.99 GiB (inflated ~42% vs real RSS)
The 3.85 GiB of reclaimable slab was dominated by ext4_inode_cache (with 56% usage only).
This inflated tracker causes premature
MEMORY_LIMIT_EXCEEDEDerrors.Cgroups v1 is not affected — its rss field excludes kernel memory.
Refs: #82036, #83981
Also refactors
readMetricsFromStatFileto populate a caller-owned map of individual metric values instead of returning a sum, allowing callers to apply their own aggregation logic.Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Cgroupv2 memory tracking now excludes
slab_reclaimablefrom kernel memory, giving a more accurate measure of non-reclaimable memory usage.cc @al13n321 as you added kernel for page cache.
Documentation entry for user-facing changes