Skip to content

Fix flaky test_overcommit_tracker/test_user_overcommit#100567

Closed
alexey-milovidov wants to merge 1 commit intomasterfrom
fix-flaky-test_overcommit_tracker
Closed

Fix flaky test_overcommit_tracker/test_user_overcommit#100567
alexey-milovidov wants to merge 1 commit intomasterfrom
fix-flaky-test_overcommit_tracker

Conversation

@alexey-milovidov
Copy link
Copy Markdown
Member

Fix flaky test_overcommit_tracker/test_user_overcommit under MSan.

The test is probabilistic: it relies on memory pressure to kill queries with low memory_overcommit_ratio_denominator while sparing those with high ratio. Under MSan (~3x memory overhead), all queries could get killed in a single attempt.

Changes:

  • Reduce numbers(2500000) to numbers(1000000) to lower per-query memory usage
  • Add retry loop (up to 5 attempts) since the test is inherently probabilistic

CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=100404&sha=4e239671fdc3523725369f5ad6028b4a52c43b45&name_0=PR&name_1=Integration%20tests%20%28amd_msan%2C%204%2F6%29

Related issue: #45173

Changelog category (leave one):

  • CI Fix or Improvement (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

...

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

The test is probabilistic: it relies on memory pressure to kill queries
with low `memory_overcommit_ratio_denominator` while sparing those with
high ratio. Under MSan (which has ~3x memory overhead), a single attempt
could kill all queries, causing the assertion to fail.

Two changes:
- Reduce `numbers(2500000)` to `numbers(1000000)` to lower per-query
  memory usage, making it more likely that B queries survive.
- Add a retry loop (up to 5 attempts) since the test is inherently
  probabilistic.

https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=100404&sha=4e239671fdc3523725369f5ad6028b4a52c43b45&name_0=PR&name_1=Integration%20tests%20%28amd_msan%2C%204%2F6%29

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Mar 24, 2026

Workflow [PR], commit [edf15b4]

Summary:


AI Review

Summary

This PR reduces memory load and adds retries in test_overcommit_tracker/test_user_overcommit to reduce sanitizer-related flakiness. The flakiness motivation is clear, but the change can make the test pass without validating the overcommit-selection behavior it is supposed to exercise. I recommend a small assertion update before merge.

Findings

⚠️ Majors

  • [tests/integration/test_overcommit_tracker/test.py:25] Lowering numbers(2500000) to numbers(1000000) while only asserting that at least one B query succeeds can let the test pass when no overcommit rejection happens at all. That weakens semantic coverage and may hide regressions in overcommit arbitration.
    • Suggested fix: in the successful attempt, also require at least one A query to fail with MEMORY_LIMIT_EXCEEDED (or another explicit signal that arbitration occurred).

ClickHouse Rules

Item Status Notes
Deletion logging
Serialization versioning
Core-area scrutiny
No test removal
Experimental gate
No magic constants
Backward compatibility
SettingsChangesHistory.cpp
PR metadata quality
Safe rollout
Compilation time

Final Verdict

  • Status: ⚠️ Request changes
  • Minimum required action: ensure the test still verifies overcommit arbitration (not only eventual query success), e.g. by asserting at least one A query is killed in a successful attempt.

@clickhouse-gh clickhouse-gh bot added the pr-ci label Mar 24, 2026

USER_TEST_QUERY_A = "SELECT groupArray(number) FROM numbers(2500000) SETTINGS max_memory_usage_for_user=2000000000,memory_overcommit_ratio_denominator=1"
USER_TEST_QUERY_B = "SELECT groupArray(number) FROM numbers(2500000) SETTINGS max_memory_usage_for_user=2000000000,memory_overcommit_ratio_denominator=80000000"
USER_TEST_QUERY_A = "SELECT groupArray(number) FROM numbers(1000000) SETTINGS max_memory_usage_for_user=2000000000,memory_overcommit_ratio_denominator=1"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ This change lowers numbers(...) from 2500000 to 1000000, but the test still asserts only that at least one B query finishes. With lower memory pressure, it can pass even when overcommit selection never happens (e.g. all queries succeed), so it may stop validating the intended behavior.

Please also assert that at least one low-priority A query is rejected with MEMORY_LIMIT_EXCEEDED in a successful attempt (or otherwise prove overcommit arbitration happened).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant