Dynamic memory_limit_mb for TSan (should fix OOMs for TSan stateless jobs)#85745
Conversation
89a13fc to
177b17a
Compare
ci/jobs/scripts/clickhouse_proc.py
Outdated
| tsan_memory_limit_mb=Utils.physical_memory() * 65 // 100 // 1024 // 1024 // replicas | ||
|
|
||
| env = os.environ.copy() | ||
| env["TSAN_OPTIONS"] = f"memory_limit_mb={tsan_memory_limit_mb}" |
There was a problem hiding this comment.
Do I understand correctly that sanitized builds allocate memory outside the memory tracker, which appears to be causing the issue? If so, can we cap memory usage for all sanitized builds?
There was a problem hiding this comment.
Do I understand correctly that sanitized builds allocate memory outside the memory tracker, which appears to be causing the issue?
Yes
If so, can we cap memory usage for all sanitized builds?
Sadly, it is not possible
There was a problem hiding this comment.
It is not that advance, you can limit some internal things in TSan, but there will be other allocations anyway
|
CI failures all known |
1fc51b9 to
a99e51c
Compare
| self.env_variables["ASAN_OPTIONS"] = "use_sigaltstack=0" | ||
| self.env_variables["TSAN_OPTIONS"] = "use_sigaltstack=0" | ||
| # In integration tests we spawn multiple servers, so let's aim to not more then 5GiB | ||
| self.env_variables["TSAN_OPTIONS"] = f"use_sigaltstack=0 memory_limit_mb=5120" |
There was a problem hiding this comment.
I think it is OK, in integration tests we spawn multiple servers, i.e. we can have 10+ in parallel running, this leaves us with ~6GiB per server, and 5GiB limt for TSan looks reasonable
…jobs) Otherwise after using different type of instances (in particular VMs with 32GiB of RAM for sequential stateless runs), clickhouse-server may got KILLed by OOM killer [1], and it should be TSan internal memory. [1]: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=85698&sha=528856cc8dba02ab2b6ecb7024ac6e23b0e4fc8a&name_0=PR&name_1=Stateless%20tests%20%28amd_tsan%2C%20sequential%2C%201%2F2%29
a99e51c to
83d2266
Compare
|
And |
Otherwise after using different type of instances (in particular VMs with 32GiB of RAM for sequential stateless runs), clickhouse-server may got KILLed by OOM killer 1, and it should be TSan internal memory.
Changelog category (leave one):