Skip to content

Skip test_overcommit_tracker/test_user_overcommit under MSan#100621

Merged
alexey-milovidov merged 8 commits intomasterfrom
no-msan-test-overcommit-tracker
Mar 30, 2026
Merged

Skip test_overcommit_tracker/test_user_overcommit under MSan#100621
alexey-milovidov merged 8 commits intomasterfrom
no-msan-test-overcommit-tracker

Conversation

@alexey-milovidov
Copy link
Copy Markdown
Member

Skip the test under MSan instead of adding retries — simpler fix.

The test relies on memory pressure to selectively kill queries based on memory_overcommit_ratio_denominator. Under MSan (~3x memory overhead), the test is too unreliable.

Supersedes #100567

CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=100404&sha=4e239671fdc3523725369f5ad6028b4a52c43b45&name_0=PR&name_1=Integration%20tests%20%28amd_msan%2C%204%2F6%29

Related issue: #45173

Changelog category (leave one):

  • CI Fix or Improvement (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

...

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

The test relies on memory pressure to selectively kill queries based on
`memory_overcommit_ratio_denominator`. Under MSan (~3x memory overhead),
the test is too unreliable.

CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=100404&sha=4e239671fdc3523725369f5ad6028b4a52c43b45&name_0=PR&name_1=Integration%20tests%20%28amd_msan%2C%204%2F6%29

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Mar 24, 2026

Workflow [PR], commit [9da63c5]

Summary:


AI Review

Summary

This PR updates tests/integration/test_overcommit_tracker/test.py to reduce flakiness by skipping test_user_overcommit on sanitizer builds with high memory overhead and by adding retry logic for nondeterministic memory-pressure behavior. After reviewing the final diff and existing inline discussions, I did not find additional correctness, safety, or performance issues that warrant new inline comments.

ClickHouse Rules
Item Status Notes
Deletion logging
Serialization versioning
Core-area scrutiny
No test removal
Experimental gate
No magic constants
Backward compatibility
SettingsChangesHistory.cpp
PR metadata quality
Safe rollout
Compilation time
Final Verdict
  • Status: ✅ Approve

@clickhouse-gh clickhouse-gh bot added the pr-ci label Mar 24, 2026
The test failed under ASan with "No route to host" because the server
was killed by OOM during the memory pressure phase (ASan has ~2x memory
overhead). This matches `test_global_overcommit_tracker` which already
skips under TSan, ASan, MSan, and LLVM coverage.

CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=100621&sha=edaffdae42cd6e7dfcefec21e766f15f479e24ff&name_0=PR&name_1=Integration%20tests%20%28amd_asan_ubsan%2C%20flaky%29

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

def test_user_overcommit():
if (
node.is_built_with_thread_sanitizer()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR title/description says this should be skipped under MSan, but this condition now skips under TSan, ASan, and LLVM coverage as well. That broadens the reduction in coverage and can hide regressions outside the reported failure mode. Please either narrow this to is_built_with_memory_sanitizer() only, or add evidence that each additional mode is also consistently OOM/flaky for this test.

@arsenmuk arsenmuk self-assigned this Mar 25, 2026
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexey-milovidov not related to the change, but am I correct to suspect that we never check overcommited_killed? The logic should be

assert A_is_killed and not B_is_killed

isn't it?

Co-authored-by: Arsen Muk <arsen.mukuchyan@gmail.com>
if err == "":
finished = True

assert overcommited_killed, "overcommited is not killed"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo on a changed line: overcommitedovercommitted in the assertion message.

Suggested message: assert overcommited_killed, "overcommitted query is not killed".

alexey-milovidov and others added 2 commits March 28, 2026 16:31
Address review feedback:
- Narrow the sanitizer skip to MSan only (where ~3x memory overhead
  makes the test consistently unreliable), removing unnecessary skips
  for ASan, TSan, and coverage builds.
- Add retry logic (up to 3 attempts) for the nondeterministic memory
  pressure behavior that can cause sporadic failures on regular builds.
- Fix typo in assertion message: "overcommited" -> "overcommitted".

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
break

last_error = (
f"attempt {attempt + 1}: overcommited_killed={overcommited_killed}, finished={finished}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor typo in changed user-visible diagnostics: overcommited_killed should be overcommitted_killed in this last_error string for consistency/readability.

Suggested change:

last_error = (
    f"attempt {attempt + 1}: overcommitted_killed={overcommited_killed}, finished={finished}"
)

alexey-milovidov and others added 3 commits March 28, 2026 23:50
Address review feedback: fix the spelling of "overcommitted" in variable
names and diagnostic messages.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CI shows the test also times out under ASan/UBSan (600s timeout).
The memory overhead from sanitizers makes memory pressure tests unreliable.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@alexey-milovidov alexey-milovidov merged commit 1433e40 into master Mar 30, 2026
152 of 153 checks passed
@alexey-milovidov alexey-milovidov deleted the no-msan-test-overcommit-tracker branch March 30, 2026 03:26
@robot-ch-test-poll robot-ch-test-poll added the pr-synced-to-cloud The PR is synced to the cloud repo label Mar 30, 2026
Desel72 pushed a commit to Desel72/ClickHouse that referenced this pull request Mar 30, 2026
…rcommit-tracker

Skip test_overcommit_tracker/test_user_overcommit under MSan
)

assert finished, "all tasks are killed"
assert overcommitted_killed, f"overcommitted query is not killed ({last_error})"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

groeneai added a commit to groeneai/ClickHouse that referenced this pull request Apr 2, 2026
The test verifies that the memory overcommit tracker selectively kills
queries with low memory_overcommit_ratio_denominator under user memory
pressure. It became flaky after PR ClickHouse#100621 added the assert
overcommitted_killed check, because slow CI builds (coverage, ARM,
sanitizers) run fewer concurrent queries, so peak user memory rarely
reached the old 2 GB limit.

Root cause: 100 queries × ~20 MB each need 100+ concurrent overlap to
hit 2 GB; slow builds achieve maybe 10-20 concurrent, well below the
threshold.

Changes:
- Lower max_memory_usage_for_user from 2 GB to 300 MB — even 8
  concurrent queries at 40 MB each now exceed the limit
- Increase per-query memory from ~20 MB to ~40 MB (numbers(5000000))
- Reduce total queries from 100 to 40 (still plenty for the test)
- Skip under ThreadSanitizer (joins MSan/ASan/coverage skip) — TSAN
  5x memory overhead causes all queries to be killed including the
  "protected" ones, producing the opposite failure mode
- Increase retry attempts from 3 to 5 for probabilistic robustness

Fixes: ClickHouse#101318

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-ci pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants