ci: fix MEMORY_LIMIT_EXCEEDED for UBsan jobs (due to 00980_alter_settings_race) by azat · Pull Request #76143 · ClickHouse/ClickHouse

azat · 2025-02-14T15:47:03Z

The problem is that 00980_alter_settings_race spawns 100 clients in parallel, which in case of UBSan is very heavy:

ubsan RSS: 655088
tsan RSS: 395624
no-san RSS: 206160

And with RSS syncing from the cgroup (that includes all processes) you will get for some tests in parallel the following error:

DB::Exception: (total) memory limit exceeded: would use 1.35 GiB (attempt to allocate chunk of 16777384 bytes), current RSS 55.65 GiB, maximum: 55.59 GiB

When I was debugging this on one of workers, the clickhouse-server RSS was around 30GiB, so we have 34GiB left and this is 0.34GiB per process, while clients takes 650MiB, so obviosly this could lead to this.

Also I looked through few other reports and the same tests was running in parallel when this error happened.

So let's simply use HTTP interface + cURL.

TL;DR;

cgroup

root@81ce66b61641:/# cat /sys/fs/cgroup/memory/memory.max_usage_in_bytes
63324766208
root@81ce66b61641:/# cat /sys/fs/cgroup/memory/memory.memsw.max_usage_in_bytes
63324766208
root@81ce66b61641:/# cat /sys/fs/cgroup/memory/memory.kmem.max_usage_in_bytes
1040412672
root@81ce66b61641:/# cat /sys/fs/cgroup/memory/memory.stat
cache 6795468800
rss 36551241728
rss_huge 0
shmem 28672
mapped_file 1227235328
dirty 23379968
writeback 0
workingset_refault_anon 0
workingset_refault_file 469295
swap 0
pgpgin 3222775972
pgpgout 3212193274
pgfault 3663593839
pgmajfault 5687
inactive_anon 348160
active_anon 36550852608
inactive_file 5145579520
active_file 1074384896
unevictable 575365120
hierarchical_memory_limit 9223372036854771712
hierarchical_memsw_limit 9223372036854771712
total_cache 6795468800
total_rss 36551241728
total_rss_huge 0
total_shmem 28672
total_mapped_file 1227235328
total_dirty 23379968
total_writeback 0
total_workingset_refault_anon 0
total_workingset_refault_file 469295
total_swap 0
total_pgpgin 3222775972
total_pgpgout 3212193274
total_pgfault 3663593839
total_pgmajfault 5687
total_inactive_anon 348160
total_active_anon 36550852608
total_inactive_file 5145579520
total_active_file 1074384896
total_unevictable 575365120

atop

$ atop -r /var/log/atop/atop_20250213

PRC| sys   19m00s |               | user  29m16s |              |              | #proc    291  |              | #trun      2 |              | #tslpi  2639  | #tslpu   106 |              | #zombie    0 |               | clones 136e3 |              |              |  #exit >62415
NPROCS               SYSCPU                USRCPU                 VSIZE                 RSIZE                 PSIZE                LOCKSZ               SWAPSZ                 RDDSK                 WRDSK                 RNET                 SNET                  MEM                CMD        1/4
     1                1m55s                15m49s                 57.5G                 30.2G                 30.1G                548.7M                   0B                  1.0G                  2.0G                    0                    0                  49%                clickhouse-ser
     1                0.06s                 0.04s                  5.1G                  2.5G                  2.5G                  0.0K                   0B                  4.2M                    0B                    0                    0                   4%                gdb
     1                2.57s                 9.31s                  2.0G                502.6M                504.6M                  0.0K                   0B                 34.3M                 49.3M                    0                    0                   1%                minio

Here you can see that #exit > 62415, so obviosly some processes was running in parallel.

CGroupMemoryUsed

select event_time, value from asynchronous_metric_log_11790954822280091442 where event_date in ('2025-02-11', '2025-02-12') and pull_request_number = 74834 and check_name = 'Stateless tests (ubsan)' and metric = 'CGroupMemoryUsed' order by 2

2860. │ 2025-02-12 00:33:56 │ 52544606208 │
2861. │ 2025-02-12 00:06:20 │ 54021611520 │
2862. │ 2025-02-12 00:06:21 │ 58855071744 │
2863. │ 2025-02-12 00:06:22 │ 61053251584 │
2864. │ 2025-02-12 00:06:22 │ 61433298944 │
      └──────────event_time─┴───────value─┘

2864 rows in set. Elapsed: 0.076 sec. Processed 49.14 thousand rows, 322.29 KB (642.89 thousand rows/s., 4.22 MB/s.)
Peak memory usage: 14.64 MiB.

OSThreadsRunnable

SELECT
    event_time,
    value
FROM asynchronous_metric_log_11790954822280091442
WHERE (event_date IN ('2025-02-11', '2025-02-12')) AND (pull_request_number = 74834) AND (check_name = 'Stateless tests (ubsan)') AND ((event_time >= '2025-02-12 00:06:00') AND (event_time <= '2025-02-12 00:06:30')) AND (metric = 'OSThreadsRunnable')
ORDER BY 1 ASC

    ┌──────────event_time─┬─value─┐
 1. │ 2025-02-12 00:06:01 │    13 │
 2. │ 2025-02-12 00:06:01 │    12 │
 3. │ 2025-02-12 00:06:02 │    13 │
 4. │ 2025-02-12 00:06:03 │    12 │
 5. │ 2025-02-12 00:06:04 │    12 │
 6. │ 2025-02-12 00:06:06 │    13 │
 7. │ 2025-02-12 00:06:06 │    12 │
 8. │ 2025-02-12 00:06:08 │    18 │
 9. │ 2025-02-12 00:06:08 │     9 │
10. │ 2025-02-12 00:06:09 │     9 │
11. │ 2025-02-12 00:06:10 │    10 │
12. │ 2025-02-12 00:06:11 │    10 │
13. │ 2025-02-12 00:06:12 │    10 │
14. │ 2025-02-12 00:06:13 │    10 │
15. │ 2025-02-12 00:06:14 │     9 │
16. │ 2025-02-12 00:06:15 │     8 │
17. │ 2025-02-12 00:06:17 │     8 │
18. │ 2025-02-12 00:06:17 │    10 │
19. │ 2025-02-12 00:06:18 │     8 │
20. │ 2025-02-12 00:06:20 │   166 │
21. │ 2025-02-12 00:06:21 │   162 │
22. │ 2025-02-12 00:06:22 │   162 │
23. │ 2025-02-12 00:06:22 │   158 │
24. │ 2025-02-12 00:06:24 │    13 │
25. │ 2025-02-12 00:06:24 │     9 │
26. │ 2025-02-12 00:06:25 │    10 │
27. │ 2025-02-12 00:06:27 │    10 │
28. │ 2025-02-12 00:06:28 │     8 │
29. │ 2025-02-12 00:06:28 │     9 │
30. │ 2025-02-12 00:06:30 │     8 │
    └─────────────────────┴───────┘

parallel tests

SELECT DISTINCT log_comment
FROM query_log_14189064888146039029
WHERE (event_date IN ('2025-02-11', '2025-02-12')) AND (pull_request_number = 74834) AND (check_name = 'Stateless tests (ubsan)') AND ((event_time >= '2025-02-12 00:06:18') AND (event_time <= '2025-02-12 00:06:20')) AND (log_comment LIKE '%.sh')

   ┌─log_comment─────────────────────────────────────────────────┐
1. │ 02124_insert_deduplication_token_multiple_blocks_replica.sh │
2. │ 02473_optimize_old_parts.sh                                 │
3. │ 02766_connection_string.sh                                  │
4. │ 00980_alter_settings_race.sh                                │
5. │ 02534_parquet_fixed_binary_array.sh                         │
6. │ 01825_type_json_7.sh                                        │
   └─────────────────────────────────────────────────────────────┘

Changelog category (leave one):

CI Fix or Improvement (changelog entry is not required)

Fixes: #75966
Fixes: #45372

The problem is that 00980_alter_settings_race spawns 100 clients in parallel, which in case of UBSan is very heavy: - ubsan RSS: 655088 - tsan RSS: 395624 - no-san RSS: 206160 And with RSS syncing from the cgroup (that includes all processes) you will get for some tests in parallel the following error: DB::Exception: (total) memory limit exceeded: would use 1.35 GiB (attempt to allocate chunk of 16777384 bytes), current RSS 55.65 GiB, maximum: 55.59 GiB When I was debugging this on one of workers, the clickhouse-server RSS was around 30GiB, so we have 34GiB left and this is 0.34GiB per process, while clients takes 650MiB, so obviosly this could lead to this. Also I looked through few other reports and the same tests was running in parallel when this error happened. So let's simply use HTTP interface + cURL. TL;DR; <details> <summary>cgroup</summary> ``` root@81ce66b61641:/# cat /sys/fs/cgroup/memory/memory.max_usage_in_bytes 63324766208 root@81ce66b61641:/# cat /sys/fs/cgroup/memory/memory.memsw.max_usage_in_bytes 63324766208 root@81ce66b61641:/# cat /sys/fs/cgroup/memory/memory.kmem.max_usage_in_bytes 1040412672 root@81ce66b61641:/# cat /sys/fs/cgroup/memory/memory.stat cache 6795468800 rss 36551241728 rss_huge 0 shmem 28672 mapped_file 1227235328 dirty 23379968 writeback 0 workingset_refault_anon 0 workingset_refault_file 469295 swap 0 pgpgin 3222775972 pgpgout 3212193274 pgfault 3663593839 pgmajfault 5687 inactive_anon 348160 active_anon 36550852608 inactive_file 5145579520 active_file 1074384896 unevictable 575365120 hierarchical_memory_limit 9223372036854771712 hierarchical_memsw_limit 9223372036854771712 total_cache 6795468800 total_rss 36551241728 total_rss_huge 0 total_shmem 28672 total_mapped_file 1227235328 total_dirty 23379968 total_writeback 0 total_workingset_refault_anon 0 total_workingset_refault_file 469295 total_swap 0 total_pgpgin 3222775972 total_pgpgout 3212193274 total_pgfault 3663593839 total_pgmajfault 5687 total_inactive_anon 348160 total_active_anon 36550852608 total_inactive_file 5145579520 total_active_file 1074384896 total_unevictable 575365120 ``` </details> <details> <summary>atop</summary> ``` $ atop -r /var/log/atop/atop_20250213 PRC| sys 19m00s | | user 29m16s | | | #proc 291 | | #trun 2 | | #tslpi 2639 | #tslpu 106 | | #zombie 0 | | clones 136e3 | | | #exit >62415 NPROCS SYSCPU USRCPU VSIZE RSIZE PSIZE LOCKSZ SWAPSZ RDDSK WRDSK RNET SNET MEM CMD 1/4 1 1m55s 15m49s 57.5G 30.2G 30.1G 548.7M 0B 1.0G 2.0G 0 0 49% clickhouse-ser 1 0.06s 0.04s 5.1G 2.5G 2.5G 0.0K 0B 4.2M 0B 0 0 4% gdb 1 2.57s 9.31s 2.0G 502.6M 504.6M 0.0K 0B 34.3M 49.3M 0 0 1% minio ``` Here you can see that #exit > 62415, so obviosly some processes was running in parallel. </details> <details> <summary>CGroupMemoryUsed</summary> ``` select event_time, value from asynchronous_metric_log_11790954822280091442 where event_date in ('2025-02-11', '2025-02-12') and pull_request_number = 74834 and check_name = 'Stateless tests (ubsan)' and metric = 'CGroupMemoryUsed' order by 2 2860. │ 2025-02-12 00:33:56 │ 52544606208 │ 2861. │ 2025-02-12 00:06:20 │ 54021611520 │ 2862. │ 2025-02-12 00:06:21 │ 58855071744 │ 2863. │ 2025-02-12 00:06:22 │ 61053251584 │ 2864. │ 2025-02-12 00:06:22 │ 61433298944 │ └──────────event_time─┴───────value─┘ 2864 rows in set. Elapsed: 0.076 sec. Processed 49.14 thousand rows, 322.29 KB (642.89 thousand rows/s., 4.22 MB/s.) Peak memory usage: 14.64 MiB. ``` </details> <details> <summary>OSThreadsRunnable</summary> ``` SELECT event_time, value FROM asynchronous_metric_log_11790954822280091442 WHERE (event_date IN ('2025-02-11', '2025-02-12')) AND (pull_request_number = 74834) AND (check_name = 'Stateless tests (ubsan)') AND ((event_time >= '2025-02-12 00:06:00') AND (event_time <= '2025-02-12 00:06:30')) AND (metric = 'OSThreadsRunnable') ORDER BY 1 ASC ┌──────────event_time─┬─value─┐ 1. │ 2025-02-12 00:06:01 │ 13 │ 2. │ 2025-02-12 00:06:01 │ 12 │ 3. │ 2025-02-12 00:06:02 │ 13 │ 4. │ 2025-02-12 00:06:03 │ 12 │ 5. │ 2025-02-12 00:06:04 │ 12 │ 6. │ 2025-02-12 00:06:06 │ 13 │ 7. │ 2025-02-12 00:06:06 │ 12 │ 8. │ 2025-02-12 00:06:08 │ 18 │ 9. │ 2025-02-12 00:06:08 │ 9 │ 10. │ 2025-02-12 00:06:09 │ 9 │ 11. │ 2025-02-12 00:06:10 │ 10 │ 12. │ 2025-02-12 00:06:11 │ 10 │ 13. │ 2025-02-12 00:06:12 │ 10 │ 14. │ 2025-02-12 00:06:13 │ 10 │ 15. │ 2025-02-12 00:06:14 │ 9 │ 16. │ 2025-02-12 00:06:15 │ 8 │ 17. │ 2025-02-12 00:06:17 │ 8 │ 18. │ 2025-02-12 00:06:17 │ 10 │ 19. │ 2025-02-12 00:06:18 │ 8 │ 20. │ 2025-02-12 00:06:20 │ 166 │ 21. │ 2025-02-12 00:06:21 │ 162 │ 22. │ 2025-02-12 00:06:22 │ 162 │ 23. │ 2025-02-12 00:06:22 │ 158 │ 24. │ 2025-02-12 00:06:24 │ 13 │ 25. │ 2025-02-12 00:06:24 │ 9 │ 26. │ 2025-02-12 00:06:25 │ 10 │ 27. │ 2025-02-12 00:06:27 │ 10 │ 28. │ 2025-02-12 00:06:28 │ 8 │ 29. │ 2025-02-12 00:06:28 │ 9 │ 30. │ 2025-02-12 00:06:30 │ 8 │ └─────────────────────┴───────┘ ``` </details> <details> <summary>parallel tests</summary> ``` SELECT DISTINCT log_comment FROM query_log_14189064888146039029 WHERE (event_date IN ('2025-02-11', '2025-02-12')) AND (pull_request_number = 74834) AND (check_name = 'Stateless tests (ubsan)') AND ((event_time >= '2025-02-12 00:06:18') AND (event_time <= '2025-02-12 00:06:20')) AND (log_comment LIKE '%.sh') ┌─log_comment─────────────────────────────────────────────────┐ 1. │ 02124_insert_deduplication_token_multiple_blocks_replica.sh │ 2. │ 02473_optimize_old_parts.sh │ 3. │ 02766_connection_string.sh │ 4. │ 00980_alter_settings_race.sh │ 5. │ 02534_parquet_fixed_binary_array.sh │ 6. │ 01825_type_json_7.sh │ └─────────────────────────────────────────────────────────────┘ ``` </details>

clickhouse-gh · 2025-02-14T15:47:40Z

Workflow [PR], commit [bd6fa32]

azat · 2025-02-15T07:43:21Z

CI failures are unrelated (perf tests will be fixed, freebsd build will be fixed, and there is on logical error, that already has an issue) and this PR is a requirement to make CI more stable, so I will merge it

Jobs with sanitizers shows lots of OOM, last time (ClickHouse#76143) it happened due to some test spawns tons of clients, and sanitizer binary takes 650MiB each in RAM Refs: ClickHouse#82036

Jobs with sanitizers shows lots of OOM, last time (ClickHouse#76143) it happened due to some test spawns tons of clients, and sanitizer binary takes 650MiB each in RAM Refs: ClickHouse#82036 Resubmit of: ClickHouse#76388

Jobs with sanitizers shows lots of OOM, last time (ClickHouse#76143) it happened due to some test spawns tons of clients, and sanitizer binary takes 650MiB each in RAM This is resubmit of ClickHouse#76388, which did not work because address space limit, which is way different thing, especially for the binaries built with sanitizers Refs: ClickHouse#82036

In case of sanitizers the binary is pretty heavy (and can even lead to OOM, see ClickHouse#76143), better to send queries via HTTP with curl.

Jobs with sanitizers shows lots of OOM, last time (ClickHouse#76143) it happened due to some test spawns tons of clients, and sanitizer binary takes 650MiB each in RAM This is resubmit of ClickHouse#76388, which did not work because address space limit, which is way different thing, especially for the binaries built with sanitizers Refs: ClickHouse#82036 v2: use cgroup v2

azat added the 🍃 green ci 🌿 Fixing flaky tests in CI label Feb 14, 2025

clickhouse-gh bot added the pr-ci label Feb 14, 2025

yariks5s self-assigned this Feb 14, 2025

azat changed the title ~~ci: fix MEMORY_LIMIT_EXCEEDED due to 00980_alter_settings_race~~ ci: fix MEMORY_LIMIT_EXCEEDED for UBsan jobs (due to 00980_alter_settings_race) Feb 14, 2025

azat added this pull request to the merge queue Feb 15, 2025

Merged via the queue into ClickHouse:master with commit 4e58521 Feb 15, 2025
112 of 116 checks passed

azat deleted the ci/fix-ubsan-oom branch February 15, 2025 07:55

robot-ch-test-poll2 added the pr-synced-to-cloud The PR is synced to the cloud repo label Feb 15, 2025

This was referenced Feb 16, 2025

UBsan: Hung check fails with Connection reset by peer due to Memory limit (total) exceeded #45372

Closed

Remove some outdated workarounds and fix few minor issues in CI bash scripts #76238

Merged

azat mentioned this pull request Jun 18, 2025

Memory Tracking Issues #82036

Closed

1 task

azat mentioned this pull request Jun 22, 2025

ci: limit stateless tests to 5G of RAM and fix tests that exceeds this limit #82361

Merged

azat mentioned this pull request Aug 16, 2025

tests: better 03532_crash_in_aggregation_because_of_lost_exception #85744

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: fix MEMORY_LIMIT_EXCEEDED for UBsan jobs (due to 00980_alter_settings_race)#76143

ci: fix MEMORY_LIMIT_EXCEEDED for UBsan jobs (due to 00980_alter_settings_race)#76143
azat merged 1 commit intoClickHouse:masterfrom
azat:ci/fix-ubsan-oom

azat commented Feb 14, 2025 •

edited

Loading

Uh oh!

clickhouse-gh bot commented Feb 14, 2025

Uh oh!

azat commented Feb 15, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

azat commented Feb 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Uh oh!

clickhouse-gh bot commented Feb 14, 2025

Uh oh!

azat commented Feb 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

azat commented Feb 14, 2025 •

edited

Loading

azat commented Feb 15, 2025 •

edited

Loading