Skip to content

Exclude slab_reclaimable from cgroups v2 memory tracking#100898

Closed
azat wants to merge 1 commit intoClickHouse:masterfrom
azat:fix-cgroup-v2-slab-reclaimable
Closed

Exclude slab_reclaimable from cgroups v2 memory tracking#100898
azat wants to merge 1 commit intoClickHouse:masterfrom
azat:fix-cgroup-v2-slab-reclaimable

Conversation

@azat
Copy link
Copy Markdown
Member

@azat azat commented Mar 27, 2026

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Exclude slab_reclaimable from cgroups v2 memory tracking

The CgroupsV2Reader sums anon + sock + kernel from memory.stat to feed MemoryTracking. But kernel includes slab_reclaimable (dentry/inode cache) — a filesystem metadata cache the kernel drops under memory pressure, functionally equivalent to page cache.

On one of instances I can see that it can take significant portion of memory:

Process RSS: 8.47 GiB
anon: 7.75 GiB
kernel: 4.22 GiB (of which slab_reclaimable = 3.85 GiB)
MemoryTracking: 11.99 GiB (inflated ~42% vs real RSS)

The 3.85 GiB of reclaimable slab was dominated by ext4_inode_cache (with 56% usage only).

This inflated tracker causes premature MEMORY_LIMIT_EXCEEDED errors.

Cgroups v1 is not affected — its rss field excludes kernel memory.

Refs: #82036, #83981

The `CgroupsV2Reader` sums `anon + sock + kernel` from `memory.stat`
to feed `MemoryTracking`. But `kernel` includes `slab_reclaimable`
(dentry/inode cache) — a filesystem metadata cache the kernel drops
under memory pressure, functionally equivalent to page cache.

On one of instances I can see that it can take significant portion of
memory:

  Process RSS:      8.47 GiB
  anon:             7.75 GiB
  kernel:           4.22 GiB  (of which slab_reclaimable = 3.85 GiB)
  MemoryTracking:  11.99 GiB  (inflated ~42% vs real RSS)

The 3.85 GiB of reclaimable slab was dominated by `ext4_inode_cache`
(with 56% usage only).

This inflated tracker causes premature `MEMORY_LIMIT_EXCEEDED` errors.

Cgroups v1 is not affected — its `rss` field excludes kernel memory.

Refs: ClickHouse#82036, ClickHouse#83981
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Mar 27, 2026

Workflow [PR], commit [3743b4d]

Summary:

job_name test_name status info comment
Fast test failure
Build ClickHouse failure
Build (arm_tidy) failure
Build ClickHouse failure cidb
Build (amd_debug) dropped
Build (amd_asan_ubsan) dropped
Build (amd_tsan) dropped
Build (amd_msan) dropped
Build (amd_binary) dropped
Build (arm_asan_ubsan) dropped
Build (arm_binary) dropped
Build (amd_release) dropped

AI Review

Summary

This PR changes cgroups v2 memory accounting to subtract slab_reclaimable from kernel when computing tracked usage, which addresses inflated memory tracking. The direction is good, but the new parser path introduces a correctness/robustness regression: required metrics are no longer validated, so missing keys can silently become zero and undercount memory usage.

Findings

⚠️ Majors

  • [src/Common/MemoryWorker.cpp:174] readNamedMetricsFromStatFile + metrics["anon"]/metrics["sock"] removes previous missing/duplicate-key diagnostics from readMetricsFromStatFile. Missing required keys are now silently interpreted as 0, which can undercount cgroup memory and weaken memory-limit enforcement.
    • Suggested fix: preserve validation semantics for required keys (anon, sock) and duplicate-key detection/logging in the new helper (or reuse readMetricsFromStatFile with explicit subtraction logic).
Tests
  • ⚠️ Add a gtest_cgroups_reader case where memory.stat is missing a required key (e.g. anon) and verify we still emit an error log / keep robust behavior instead of silently returning underestimated usage.
  • ⚠️ Add a duplicate-key case (anon repeated) to ensure diagnostics are preserved after this refactor.
ClickHouse Rules
Item Status Notes
Deletion logging
Serialization versioning
Core-area scrutiny
No test removal
Experimental gate
No magic constants
Backward compatibility
SettingsChangesHistory.cpp
PR metadata quality
Safe rollout ⚠️ Silent metric-parse regressions can reduce memory-limit safety in production rollout
Compilation time
Performance & Safety

The safety concern is not throughput but correctness under parser/format drift: silent missing-key fallback to zero can under-report actual memory pressure and delay protective exceptions.

Final Verdict
  • Status: ⚠️ Request changes
  • Minimum required actions:
    • Restore required-key validation and duplicate-key diagnostics for cgroups v2 metric parsing.
    • Add tests for missing required keys and duplicate keys in memory.stat.

@clickhouse-gh clickhouse-gh bot added the pr-improvement Pull request with some product improvements label Mar 27, 2026
std::lock_guard lock(mutex);
stat_buf.rewind();
return readMetricsFromStatFile(stat_buf, {"anon", "sock", "kernel"}, {"kernel"}, &warnings_printed);
auto metrics = readNamedMetricsFromStatFile(stat_buf, {"anon", "sock", "kernel", "slab_reclaimable"});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change drops the validation behavior that readMetricsFromStatFile had (missing key / duplicate key logging). With metrics["anon"] / metrics["sock"], missing keys are now silently treated as 0, so parser/schema regressions in memory.stat can undercount memory usage without any signal.

That is risky for memory-limit enforcement (possible late/absent MEMORY_LIMIT_EXCEEDED). Please keep the previous validation semantics for required keys (anon, sock), and preserve duplicate-key diagnostics in the new helper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-improvement Pull request with some product improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants