Skip to content

Hopelessly fix test_userspace_page_cache flakiness#93653

Merged
alexey-milovidov merged 12 commits intomasterfrom
pcflak
Feb 18, 2026
Merged

Hopelessly fix test_userspace_page_cache flakiness#93653
alexey-milovidov merged 12 commits intomasterfrom
pcflak

Conversation

@al13n321
Copy link
Copy Markdown
Member

@al13n321 al13n321 commented Jan 8, 2026

Changelog category (leave one):

  • CI Fix or Improvement (changelog entry is not required)

Closes #92761

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Jan 8, 2026

Workflow [PR], commit [732d355]

Summary:

@clickhouse-gh clickhouse-gh bot added the pr-ci label Jan 8, 2026
@alexey-milovidov
Copy link
Copy Markdown
Member

@al13n321

File: test_userspace_page_cache/test.py:206 - in test_size_adjustment
assert initial_cache_size > 50e6
E assert 33559712 > 50000000.0

@al13n321
Copy link
Copy Markdown
Member Author

I don't understand the output in CI. https://s3.amazonaws.com/clickhouse-test-reports/PRs/93653/1322a027cf43343fdaec5765054d63bf127fd9e4//integration_tests_amd_asan_targeted/job.log says test_size_adjustment was SKIPPED 10 times, then got some docker ERROR at teardown:

[2026-01-13 23:53:51] ==================================== ERRORS ====================================
[2026-01-13 23:53:51] _______________ ERROR at teardown of test_size_adjustment[10-10] _______________
[2026-01-13 23:53:51] [gw2] linux -- Python 3.10.12 /usr/bin/python3
[2026-01-13 23:53:51] helpers/cluster.py:3930: in shutdown
[2026-01-13 23:53:51]     run_and_check(self.base_cmd + ["stop"])
[2026-01-13 23:53:51] helpers/cluster.py:170: in run_and_check
[2026-01-13 23:53:51]     res = subprocess.run(
[2026-01-13 23:53:51] /usr/lib/python3.10/subprocess.py:505: in run
[2026-01-13 23:53:51]     stdout, stderr = process.communicate(input, timeout=timeout)
[2026-01-13 23:53:51] /usr/lib/python3.10/subprocess.py:1154: in communicate
[2026-01-13 23:53:51]     stdout, stderr = self._communicate(input, endtime, timeout)
[2026-01-13 23:53:51] /usr/lib/python3.10/subprocess.py:2022: in _communicate
[2026-01-13 23:53:51]     self._check_timeout(endtime, orig_timeout, stdout, stderr)
[2026-01-13 23:53:51] /usr/lib/python3.10/subprocess.py:1198: in _check_timeout
[2026-01-13 23:53:51]     raise TimeoutExpired(
[2026-01-13 23:53:51] E   subprocess.TimeoutExpired: Command '['docker', 'compose', '--env-file', '/home/ubuntu/actions-runner/_work/ClickHouse/ClickHouse/tests/integration/test_userspace_page_cache/_instances-gw2/.env', '--project-name', 'roottestuserspacepagecache-gw2', '--file', '/home/ubuntu/actions-runner/_work/ClickHouse/ClickHouse/tests/integration/test_userspace_page_cache/_instances-gw2/node1/docker-compose.yml', '--file', '/home/ubuntu/actions-runner/_work/ClickHouse/ClickHouse/tests/integration/helpers/../../../tests/integration/compose/docker_compose_minio.yml', '--file', '/home/ubuntu/actions-runner/_work/ClickHouse/ClickHouse/tests/integration/test_userspace_page_cache/_instances-gw2/node_smol/docker-compose.yml', 'stop']' timed out after 300 seconds
[2026-01-13 23:53:51] 
[2026-01-13 23:53:51] During handling of the above exception, another exception occurred:
[2026-01-13 23:53:51] test_userspace_page_cache/test.py:32: in started_cluster
[2026-01-13 23:53:51]     cluster.shutdown()
[2026-01-13 23:53:51] helpers/cluster.py:3936: in shutdown
[2026-01-13 23:53:51]     run_and_check(self.base_cmd + ["kill"])
[2026-01-13 23:53:51] helpers/cluster.py:191: in run_and_check
[2026-01-13 23:53:51]     raise Exception(
[2026-01-13 23:53:51] E   Exception: Command [docker compose --env-file /home/ubuntu/actions-runner/_work/ClickHouse/ClickHouse/tests/integration/test_userspace_page_cache/_instances-gw2/.env --project-name roottestuserspacepagecache-gw2 --file /home/ubuntu/actions-runner/_work/ClickHouse/ClickHouse/tests/integration/test_userspace_page_cache/_instances-gw2/node1/docker-compose.yml --file /home/ubuntu/actions-runner/_work/ClickHouse/ClickHouse/tests/integration/helpers/../../../tests/integration/compose/docker_compose_minio.yml --file /home/ubuntu/actions-runner/_work/ClickHouse/ClickHouse/tests/integration/test_userspace_page_cache/_instances-gw2/node_smol/docker-compose.yml kill] return non-zero code 1:  Container roottestuserspacepagecache-gw2-proxy2-1  Killing
[2026-01-13 23:53:51] E    Container roottestuserspacepagecache-gw2-proxy2-1  Error while Killing
[2026-01-13 23:53:51] E   Error response from daemon: cannot kill container: 800c7a00b263d4a1f380b4e9231a50a463dcef5f97a08e9fb438454d5d5eea7b: container 800c7a00b263d4a1f380b4e9231a50a463dcef5f97a08e9fb438454d5d5eea7b is not running

Why skipped? What causes the error? On my machine python -m ci.praktika run 'Integration tests (amd_binary, 5/5)' --test test_userspace_page_cache/test.py::test_size_adjustment succeeded 100/100 times.

@maxknv does this look familiar by any chance?

@maxknv
Copy link
Copy Markdown
Member

maxknv commented Jan 14, 2026

Why skipped? What causes the error? On my machine python -m ci.praktika run 'Integration tests (amd_binary, 5/5)' --test test_userspace_page_cache/test.py::test_size_adjustment succeeded 100/100 times.

skipped because of this


This job basically just starts the cluster at the test suite setup and stops it on the teardown. Stopping cluster hangs for some reason. So failure is not exactly in the test case

no problem in server logs? You can try running the job locally: pyhhon -m ci.praktika run 'Integration tests (amd_asan, targeted' --test test_userspace_page_cache

@alexey-milovidov alexey-milovidov changed the title Hopefully fix test_userspace_page_cache flakiness Hopelessly fix test_userspace_page_cache flakiness Jan 28, 2026
@alexey-milovidov
Copy link
Copy Markdown
Member

@al13n321, is there some hope?

alexey-milovidov and others added 7 commits February 15, 2026 09:20
Don't create `node_smol` container in sanitizer builds at all.
Previously the test was skipped inside the test body, but the container
was still started and had to be shut down. Under ASan the leak checker
at exit is very slow with high `max_server_memory_usage`, causing
`docker compose stop` to hang.

Detect sanitizer builds early using `clickhouse local` (no server needed)
and conditionally skip adding the instance before `cluster.start`.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The `_is_sanitizer_build` function used non-existent `hasAddressSanitizer()`
function, causing `clickhouse local` to fail and always return False.
This meant sanitizer builds were never detected, so `node_smol` was always
created and `test_size_adjustment` was never skipped in ASan/TSan runs.

Fix by querying `system.build_options` for `CXX_FLAGS` containing
`-fsanitize=`, which is the same approach used by the integration test
framework's `is_built_with_sanitizer` method.

Also add proper cleanup: drop table at test start (to handle previous
failed runs in the flaky runner) and use try/finally for cleanup to
prevent cascading "Table already exists" failures.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…hing type

When `FunctionVariantAdaptor` executes a function on a Variant column
that contains a single variant (e.g., `Array(Nothing)` from an empty
array literal `[]`), the nested function can return a result of type
`Nothing`. The code only checked for `Nullable(Nothing)` via `onlyNull`
but missed plain `Nothing`, causing a failed cast to the expected
`Variant(...)` result type and a `LOGICAL_ERROR` exception.

Add `isNothing` checks alongside existing `onlyNull` checks in all
three execution paths (single variant no NULLs, single variant with
NULLs, multiple variants) to treat `Nothing` results as defaults/NULLs.

https://s3.amazonaws.com/clickhouse-test-reports/json.html?REF=master&sha=42be5daa2cfd617b45ee36eeec6d72fd405fba41&name_0=MasterCI&name_1=AST%20fuzzer%20%28amd_debug%29

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rom test

The setting is obsolete (always true). The test already has `SET enable_analyzer = 1`
which is needed for Variant type inference in UNION ALL.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@alexey-milovidov alexey-milovidov added this pull request to the merge queue Feb 18, 2026
Merged via the queue into master with commit a2c3e8b Feb 18, 2026
147 checks passed
@alexey-milovidov alexey-milovidov deleted the pcflak branch February 18, 2026 10:54
@robot-clickhouse-ci-2 robot-clickhouse-ci-2 added the pr-synced-to-cloud The PR is synced to the cloud repo label Feb 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-ci pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky test: test_userspace_page_cache/test.py::test_size_adjustment

4 participants