Skip test_overcommit_tracker/test_user_overcommit under MSan#100621
Skip test_overcommit_tracker/test_user_overcommit under MSan#100621alexey-milovidov merged 8 commits intomasterfrom
Conversation
The test relies on memory pressure to selectively kill queries based on `memory_overcommit_ratio_denominator`. Under MSan (~3x memory overhead), the test is too unreliable. CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=100404&sha=4e239671fdc3523725369f5ad6028b4a52c43b45&name_0=PR&name_1=Integration%20tests%20%28amd_msan%2C%204%2F6%29 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Workflow [PR], commit [9da63c5] Summary: ✅ AI ReviewSummaryThis PR updates ClickHouse Rules
Final Verdict
|
The test failed under ASan with "No route to host" because the server was killed by OOM during the memory pressure phase (ASan has ~2x memory overhead). This matches `test_global_overcommit_tracker` which already skips under TSan, ASan, MSan, and LLVM coverage. CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=100621&sha=edaffdae42cd6e7dfcefec21e766f15f479e24ff&name_0=PR&name_1=Integration%20tests%20%28amd_asan_ubsan%2C%20flaky%29 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
||
| def test_user_overcommit(): | ||
| if ( | ||
| node.is_built_with_thread_sanitizer() |
There was a problem hiding this comment.
The PR title/description says this should be skipped under MSan, but this condition now skips under TSan, ASan, and LLVM coverage as well. That broadens the reduction in coverage and can hide regressions outside the reported failure mode. Please either narrow this to is_built_with_memory_sanitizer() only, or add evidence that each additional mode is also consistently OOM/flaky for this test.
There was a problem hiding this comment.
@alexey-milovidov not related to the change, but am I correct to suspect that we never check overcommited_killed? The logic should be
assert A_is_killed and not B_is_killed
isn't it?
Co-authored-by: Arsen Muk <arsen.mukuchyan@gmail.com>
| if err == "": | ||
| finished = True | ||
|
|
||
| assert overcommited_killed, "overcommited is not killed" |
There was a problem hiding this comment.
Typo on a changed line: overcommited → overcommitted in the assertion message.
Suggested message: assert overcommited_killed, "overcommitted query is not killed".
Address review feedback: - Narrow the sanitizer skip to MSan only (where ~3x memory overhead makes the test consistently unreliable), removing unnecessary skips for ASan, TSan, and coverage builds. - Add retry logic (up to 3 attempts) for the nondeterministic memory pressure behavior that can cause sporadic failures on regular builds. - Fix typo in assertion message: "overcommited" -> "overcommitted". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| break | ||
|
|
||
| last_error = ( | ||
| f"attempt {attempt + 1}: overcommited_killed={overcommited_killed}, finished={finished}" |
There was a problem hiding this comment.
Minor typo in changed user-visible diagnostics: overcommited_killed should be overcommitted_killed in this last_error string for consistency/readability.
Suggested change:
last_error = (
f"attempt {attempt + 1}: overcommitted_killed={overcommited_killed}, finished={finished}"
)Address review feedback: fix the spelling of "overcommitted" in variable names and diagnostic messages. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CI shows the test also times out under ASan/UBSan (600s timeout). The memory overhead from sanitizers makes memory pressure tests unreliable. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rcommit-tracker Skip test_overcommit_tracker/test_user_overcommit under MSan
| ) | ||
|
|
||
| assert finished, "all tasks are killed" | ||
| assert overcommitted_killed, f"overcommitted query is not killed ({last_error})" |
The test verifies that the memory overcommit tracker selectively kills queries with low memory_overcommit_ratio_denominator under user memory pressure. It became flaky after PR ClickHouse#100621 added the assert overcommitted_killed check, because slow CI builds (coverage, ARM, sanitizers) run fewer concurrent queries, so peak user memory rarely reached the old 2 GB limit. Root cause: 100 queries × ~20 MB each need 100+ concurrent overlap to hit 2 GB; slow builds achieve maybe 10-20 concurrent, well below the threshold. Changes: - Lower max_memory_usage_for_user from 2 GB to 300 MB — even 8 concurrent queries at 40 MB each now exceed the limit - Increase per-query memory from ~20 MB to ~40 MB (numbers(5000000)) - Reduce total queries from 100 to 40 (still plenty for the test) - Skip under ThreadSanitizer (joins MSan/ASan/coverage skip) — TSAN 5x memory overhead causes all queries to be killed including the "protected" ones, producing the opposite failure mode - Increase retry attempts from 3 to 5 for probabilistic robustness Fixes: ClickHouse#101318 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Skip the test under MSan instead of adding retries — simpler fix.
The test relies on memory pressure to selectively kill queries based on
memory_overcommit_ratio_denominator. Under MSan (~3x memory overhead), the test is too unreliable.Supersedes #100567
CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=100404&sha=4e239671fdc3523725369f5ad6028b4a52c43b45&name_0=PR&name_1=Integration%20tests%20%28amd_msan%2C%204%2F6%29
Related issue: #45173
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
...
Documentation entry for user-facing changes