Skip to content

Dynamic memory_limit_mb for TSan (should fix OOMs for TSan stateless jobs)#85745

Merged
azat merged 1 commit intoClickHouse:masterfrom
azat:ci-dynamic-tsan-memory_limit_mb
Aug 18, 2025
Merged

Dynamic memory_limit_mb for TSan (should fix OOMs for TSan stateless jobs)#85745
azat merged 1 commit intoClickHouse:masterfrom
azat:ci-dynamic-tsan-memory_limit_mb

Conversation

@azat
Copy link
Copy Markdown
Member

@azat azat commented Aug 16, 2025

Otherwise after using different type of instances (in particular VMs with 32GiB of RAM for sequential stateless runs), clickhouse-server may got KILLed by OOM killer 1, and it should be TSan internal memory.

Changelog category (leave one):

  • Not for changelog (changelog entry is not required)

@azat azat added the 🍃 green ci 🌿 Fixing flaky tests in CI label Aug 16, 2025
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Aug 16, 2025

Workflow [PR], commit [83d2266]

Summary:

@clickhouse-gh clickhouse-gh bot added the pr-not-for-changelog This PR should not be mentioned in the changelog label Aug 16, 2025
@azat azat force-pushed the ci-dynamic-tsan-memory_limit_mb branch from 89a13fc to 177b17a Compare August 16, 2025 15:48
tsan_memory_limit_mb=Utils.physical_memory() * 65 // 100 // 1024 // 1024 // replicas

env = os.environ.copy()
env["TSAN_OPTIONS"] = f"memory_limit_mb={tsan_memory_limit_mb}"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I understand correctly that sanitized builds allocate memory outside the memory tracker, which appears to be causing the issue? If so, can we cap memory usage for all sanitized builds?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I understand correctly that sanitized builds allocate memory outside the memory tracker, which appears to be causing the issue?

Yes

If so, can we cap memory usage for all sanitized builds?

Sadly, it is not possible

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not that advance, you can limit some internal things in TSan, but there will be other allocations anyway

@azat
Copy link
Copy Markdown
Member Author

azat commented Aug 18, 2025

CI failures all known

@azat azat force-pushed the ci-dynamic-tsan-memory_limit_mb branch 2 times, most recently from 1fc51b9 to a99e51c Compare August 18, 2025 12:30
@nickitat nickitat self-assigned this Aug 18, 2025
self.env_variables["ASAN_OPTIONS"] = "use_sigaltstack=0"
self.env_variables["TSAN_OPTIONS"] = "use_sigaltstack=0"
# In integration tests we spawn multiple servers, so let's aim to not more then 5GiB
self.env_variables["TSAN_OPTIONS"] = f"use_sigaltstack=0 memory_limit_mb=5120"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't it be too conservative?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is OK, in integration tests we spawn multiple servers, i.e. we can have 10+ in parallel running, this leaves us with ~6GiB per server, and 5GiB limt for TSan looks reasonable

…jobs)

Otherwise after using different type of instances (in particular VMs
with 32GiB of RAM for sequential stateless runs), clickhouse-server may
got KILLed by OOM killer [1], and it should be TSan internal memory.

  [1]: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=85698&sha=528856cc8dba02ab2b6ecb7024ac6e23b0e4fc8a&name_0=PR&name_1=Stateless%20tests%20%28amd_tsan%2C%20sequential%2C%201%2F2%29
@azat azat force-pushed the ci-dynamic-tsan-memory_limit_mb branch from a99e51c to 83d2266 Compare August 18, 2025 13:32
@azat azat enabled auto-merge August 18, 2025 13:47
@azat azat added this pull request to the merge queue Aug 18, 2025
Merged via the queue into ClickHouse:master with commit b6bf02e Aug 18, 2025
122 checks passed
@azat azat deleted the ci-dynamic-tsan-memory_limit_mb branch August 18, 2025 17:23
@robot-ch-test-poll2 robot-ch-test-poll2 added the pr-synced-to-cloud The PR is synced to the cloud repo label Aug 18, 2025
@azat
Copy link
Copy Markdown
Member Author

azat commented Aug 25, 2025

And 03532_crash_in_aggregation_because_of_lost_exception became stable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🍃 green ci 🌿 Fixing flaky tests in CI pr-not-for-changelog This PR should not be mentioned in the changelog pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants