Replace SANITIZE_COVERAGE with LLVM source-based coverage for per-test coverage collection by fm4v · Pull Request #99513 · ClickHouse/ClickHouse

fm4v · 2026-03-14T22:06:01Z

Motivation

The existing nightly coverage pipeline used -fsanitize-coverage=trace-pc-guard,pc-table (SANITIZE_COVERAGE), which required custom __sanitizer_cov_* callbacks, a 6 GB unstripped binary as a build artifact, offline DWARF queries via clickhouse local, and a complex two-stage export involving Python symbol normalization with Pool.imap and sort -u.

Test selection based on coverage queried checks_coverage_inverted2 using normalized C++ symbol names, which required a 200-line normalization function to strip return types, argument lists, and template arguments — and still had false positives due to basename-only DWARF matching.

Client-side coverage via CLICKHOUSE_WRITE_COVERAGE was broken in practice: the dumped addresses were virtual addresses without the binary load base subtracted, so addressToSymbol could not resolve them and the symbol arrays were always empty.

What changed

Replace SANITIZE_COVERAGE with LLVM's standard source-based coverage (-fprofile-instr-generate -fcoverage-mapping, WITH_COVERAGE).

In-process coverage mapping reader. At server startup, readLLVMCoverageMapping("/proc/self/exe") parses the binary's own __llvm_covfun and __llvm_covmap ELF sections. It builds a map from NameRef (MD5 hash of the mangled function name, from __llvm_profile_data) to (file, line_start, line_end) using LLVM's coverage mapping format. No build artifacts or external tools are required at runtime.

New SQL command SYSTEM SET COVERAGE TEST 'name'. Before each test, the test runner calls this command, which atomically: flushes the previous test's coverage into system.coverage_log, resets LLVM profiling counters via __llvm_profile_reset_counters, and arms the new test name. Replaces the old 3-step RESET → test → INSERT sequence.

system.coverage_log new schema. Columns files Array(String), line_starts Array(UInt32), line_ends Array(UInt32) replace coverage Array(UInt64) and symbol Array(String).

Simplified export. A single ARRAY JOIN-based INSERT INTO FUNCTION remoteSecure(...) replaces the two-stage Python export (raw symbols + normalized symbols). Target table: checks_coverage_lines (new CIDB table, keyed by (check_start_time, file, line_start)).

Line-based test selection. find_tests.py now queries checks_coverage_lines with endsWith(file, basename) AND line_start <= N AND line_end >= N for each changed line in the PR diff. Tests are ranked by how many changed lines they cover — tests covering more of the diff appear first. Eliminates find_symbols.py, normalize_symbol, and the DWARF binary query (was 85–180 s per run).

Validated locally

Ran 100 stateless tests with per-test coverage collection:

100 rows in system.coverage_log, one per test
0 inverted line ranges; line numbers in range [5, 10900], avg span ~14 lines
Test discrimination confirmed: SELECT 1 covers ⊂ ALTER TABLE (1372 unique files unique to alter); hot-path files (ProfileEvents.cpp, MergeTreeBackgroundExecutor.cpp) in all 100 tests
InterpreterAlterQuery.cpp line 100 is covered by exactly one test: 00030_alter_table

What disappears

SANITIZE_COVERAGE cmake option and __sanitizer_cov_* callbacks
6 GB build artifact (binary no longer needed at test time)
Python symbol normalization (200 lines + 61 test cases)
find_symbols.py (DWARF query taking 85–180 s)
checks_coverage_inverted and checks_coverage_inverted2 CIDB tables (can be dropped after one month)
Client-side coverage file dump via CLICKHOUSE_WRITE_COVERAGE (was broken anyway — dumped virtual addresses without subtracting the binary load base, so addressToSymbol could not resolve them and symbol arrays were always empty)

Changelog category (leave one):

Build/Testing/Packaging Improvement

Changelog entry:

Replace SANITIZE_COVERAGE (custom sanitizer callbacks, symbol-level granularity) with LLVM source-based coverage (WITH_COVERAGE, -fprofile-instr-generate -fcoverage-mapping) for the nightly per-test coverage pipeline. The server now reads its own coverage mapping from ELF sections at startup and collects (file, line_start, line_end) tuples per test via a new SYSTEM SET COVERAGE TEST 'name' command. Test selection in targeted CI checks uses line-range queries against a new checks_coverage_lines CIDB table and ranks candidate tests by how many changed diff lines they cover.

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

clickhouse-gh · 2026-03-14T22:08:04Z

Workflow [PR], commit [52e50a8]

Summary: ⏳

job_name	test_name	status	info
Stateless tests (amd_tsan, flaky check)		failure
	02916_another_move_partition_inactive_replica	FAIL	cidb
Integration tests (amd_llvm_coverage, 2/5)		failure
	test_overcommit_tracker/test.py::test_user_overcommit	FAIL	cidb
Integration tests (amd_asan_ubsan, targeted)		error

AI Review

Summary

This PR migrates per-test coverage collection from SANITIZE_COVERAGE symbol-based flow to LLVM source-based line coverage, adds SYSTEM SET COVERAGE TEST, and rewires CI targeting logic. The direction is good, but there are correctness issues that can corrupt or silently degrade collected coverage data, so this should not merge as-is.

Findings

❌ Blockers

[base/base/coverage.cpp:338] getCurrentIndirectCalls resets node->count before storing it, then pushes node->count; emitted call_count becomes zero, breaking indirect-call weighting/diagnostics.
Suggested fix: save node->count into a local variable before reset and write that saved value.

⚠️ Majors

[src/Parsers/ParserSystemQuery.cpp:396-398] SYSTEM SET COVERAGE TEST accepts missing string literal (no return false on parse failure), so malformed SYSTEM SET COVERAGE TEST silently triggers flush/reset with empty test name instead of syntax exception.
Suggested fix: make string literal mandatory for this branch.
[tests/clickhouse-test:2907-2911] Failure of SYSTEM SET COVERAGE TEST is swallowed (print + continue). In per-test coverage mode this can produce stale/incomplete coverage while the run still appears successful.
Suggested fix: fail fast in coverage mode (re-raise or return failing TestResult).

Tests

⚠️ Add/extend parser tests to assert SYSTEM SET COVERAGE TEST without a string literal fails with syntax exception.
⚠️ Add a regression test validating system.coverage_indirect_calls.call_count is non-zero for a known indirect-call scenario (ensures count is captured before reset).
⚠️ Add a runner-level test for per-test coverage mode that verifies a failed SYSTEM SET COVERAGE TEST call fails the job/test run rather than continuing silently.

ClickHouse Rules

Item	Status	Notes
Deletion logging	➖
Serialization versioning	➖
Core-area scrutiny	✅
No test removal	✅
Experimental gate	➖
No magic constants	✅
Backward compatibility	✅
`SettingsChangesHistory.cpp`	➖
PR metadata quality	✅
Safe rollout	⚠️	Coverage collection can silently degrade when `SYSTEM SET COVERAGE TEST` fails in runner.
Compilation time	✅

Final Verdict

Status: ⚠️ Request changes
Minimum required actions:
- Fix call_count capture in getCurrentIndirectCalls.
- Make SYSTEM SET COVERAGE TEST require a string literal in parser.
- Stop swallowing SYSTEM SET COVERAGE TEST failures in per-test coverage mode in tests/clickhouse-test.

ci/jobs/scripts/find_tests.py

ci/jobs/scripts/functional_tests/export_coverage.py

ci/jobs/scripts/find_tests.py

fm4v · 2026-03-16T07:24:58Z

FIRST IMPLEMENTATION WITH SYMBOLS NORMALIZATION (REMOVED)
Intermediate results, see a couple more opportunities to improve it


PR #99481
     Only in OLD (11) — noise removed by new (non-text-index tests):
       - 00282_merging.sql ← NOISE
       - 01167_isolation_hermitage.sh ← NOISE
       - 01169_alter_partition_isolation_stress.sh ← NOISE
       - 01171_mv_select_insert_isolation_long.sh ← NOISE
       - 01174_select_insert_isolation.sh ← NOISE
       - 02074_http_flush_compressed_buffers_on_cancel.sh ← NOISE
       - 02122_4letter_words_stress_zookeeper.sh ← NOISE
       - 02479_race_condition_between_insert_and_droppin_mv.sh ← NOISE
       - 02561_sorting_constants_and_distinct_crash.sql ← NOISE
       - 02951_parallel_parsing_json_compact_each_row.sh ← NOISE
       - 03642_system_instrument_stress.sh ← NOISE

     Only in NEW (7) — extra coverage gained:
       + 03151_unload_index_race.sh ← other
       + 03911_context_getAccess_race.sh ← other
       + 04024_fold_utf8.sql ← other
       + 04034_autopr_dataflow_cache_reuse_between_different_queries.sql ← other
       + 04035_text_index_map_values_in.sql ← text-index
       + 04036_text_index_map_keys_values_in.sql ← text-index
       + 04037_text_index_map_empty_values_in.sql ← text-index

  Summary: New implementation is more precise (removes 11 noise tests) while slightly improving recall (adds 3 genuine text-index map tests). The 5% frequency threshold correctly filters out
  the hot-path noise.



PR 99481
     Only in OLD (2) — potentially missed by new:
       - 03221_key_condition_bug.sql
       - 04039_text_index_direct_read_select_alias.sql

     Only in NEW (11) — extra coverage gained:
       + 01542_dictionary_load_exception_race.sh
       + 02346_text_index_bug89605.sql
       + 02346_text_index_function_hasAnyAllTokens_partially_materialized.sql
       + 02346_text_index_mark_file_compatibility.sql
       + 02346_text_index_match_predicate.sql
       + 02346_text_index_queries.sql
       + 02479_race_condition_between_insert_and_droppin_mv.sh
       + 03640_skip_indexes_data_types_with_or.sql
       + 04024_fold_utf8.sql
       + 04035_text_index_map_values_in.sql
       + 04038_text_index_preprocessor_type_validation.sql

  The new implementation finds 9% more relevant tests (72 vs 63) while missing 2 that were likely present due to CIDB data differences (date cutoff / missing test from snapshot). The quality
  is comparable with slightly better recall.



● PR 99354 "Fix UInt64 overflow in ParserSampleRatio for large denominators" — parser fix, should match query parsing tests:
     Only in OLD (7) — removed by new:
       - 00938_fix_rwlock_segfault_long.sh ← NOISE
       - 02165_auto_format_by_file_extension.sh ← NOISE
       - 02555_davengers_rename_chain.sh ← NOISE
       - 02703_keeper_map_concurrent_create_drop.sh ← NOISE
       - 02751_protobuf_ipv6.sh ← NOISE
       - 03008_local_plain_rewritable.sh ← NOISE
       - 04038_sample_offset_large_denominator.sql ← sampling-related

     Only in NEW (6) — gained by new:
       + 00825_protobuf_format_splitted_nested.sh ← sampling/related
       + 01542_dictionary_load_exception_race.sh ← other
       + 03151_unload_index_race.sh ← sampling/related
       + 03320_insert_close_connection_on_error.sh ← sampling/related
       + 03833_server_ast_fuzzer.sql ← sampling/related
       + 04034_buffer_sample_no_sampling_key.sql ← sampling/related

ci/jobs/scripts/find_tests.py

ci/jobs/scripts/functional_tests/export_coverage.py

ci/jobs/scripts/find_tests.py

…rkaround LLVM_COVERAGE_BUILD (which works) has no toolchain file and builds compiler-rt from source — the profile runtime is available as a result. AMD_COVERAGE used the x86_64 toolchain which sets USE_SYSTEM_COMPILER_RT, causing the linker to look for libclang_rt.profile.a at a system path that does not exist in the CI Docker image. Drop the toolchain file from AMD_COVERAGE to match LLVM_COVERAGE_BUILD. Also revert the sanitize.cmake --print-file-name workaround introduced in the previous commit since it is no longer needed. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

AMD_COVERAGE is an x86 build; it was misconfigured to run on ARM_LARGE. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

system.build_options uses @FULL_CXX_FLAGS_NORMALIZED@ for the CXX_FLAGS row, not @CMAKE_CXX_FLAGS@. This normalized variable does not contain the -DWITH_COVERAGE=1 preprocessor define we set in sanitize.cmake. Query the dedicated WITH_COVERAGE row instead, which stores @WITH_COVERAGE@ and returns 'ON' when the coverage build option is enabled. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Temporary print to confirm WITH_COVERAGE is detected correctly in CI. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

- Print build_flags + coverage_active status early in clickhouse-test so the next CI run shows whether WITH_COVERAGE is detected. - Use verbose=True in CoverageExporter so clickhouse local output (including row counts and errors) appears in the job log. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

After creating system.coverage_log, run a quick self-test: SYSTEM SET COVERAGE TEST '_selftest' → SELECT 1 → SYSTEM SET COVERAGE TEST '' then check if any rows were inserted. Prints 'OK' or 'EMPTY - coverage mapping not working' to surface the issue in the job log. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Every invocation logs: test name received, number of covered NameRefs, coverage map size, and resolved file/line count. Makes the CI server log show exactly what's happening at each stage. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…age label - Replace AMD_COVERAGE build with LLVM_COVERAGE_BUILD in coverage_build_jobs: no sysroot, compiler-rt built from source (no libclang_rt.profile.a issues). - Add per_test_coverage label to functional_tests_jobs_coverage parameters: Stateless tests (llvm_coverage_build, per_test_coverage, N/8) - In functional_tests.py: per_test_coverage → is_coverage=True (new per-test SYSTEM SET COVERAGE TEST path); amd_llvm_coverage still routes to old profraw. - Add INFO logging to SYSTEM SET COVERAGE TEST and CoverageCollection so CI server logs show NameRef count, map size, and resolved file count. - Add coverage self-test and build_flags dump in clickhouse-test startup. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

clickhouse-gh · 2026-03-18T23:49:46Z

src/Functions/coverage.cpp

+        if (kind == Kind::Files)
+            return std::make_shared<DataTypeArray>(std::make_shared<DataTypeString>());
+        return std::make_shared<DataTypeArray>(std::make_shared<DataTypeUInt32>());
    }


coverageCurrentFiles / coverageCurrentLineStarts / coverageCurrentLineEnds are documented as returning live coverage data, but this implementation always returns empty arrays.

Because these SQL functions are still registered and user-visible in WITH_COVERAGE builds, this becomes a functional regression (the API appears to work but yields no data). Please either wire these functions to the new NameRef -> (file,line) mapping path or remove/feature-gate them until implemented.

…erage_log stats - Register per_test_coverage as a known option so functional_tests.py doesn't assert on unknown option. - After all tests complete, print system.coverage_log row count and distinct test count so the job log shows whether data was collected. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

clickhouse-gh · 2026-03-19T01:00:58Z

src/Parsers/ParserSystemQuery.cpp

+        case Type::SET_COVERAGE_TEST:
+        {
+            ASTPtr ast;
+            if (ParserStringLiteral{}.parse(pos, ast, expected))


SYSTEM SET COVERAGE TEST currently accepts a missing test name because the parser does not fail when ParserStringLiteral doesn't match. This means malformed input like SYSTEM SET COVERAGE TEST silently executes with an empty name and triggers a coverage flush/reset path instead of returning a syntax exception.

Please make the argument mandatory here:

ASTPtr ast; if (!ParserStringLiteral{}.parse(pos, ast, expected)) return false; res->coverage_test_name = ast->as<ASTLiteral &>().value.safeGet<String>();

Keeping explicit empty string support (SYSTEM SET COVERAGE TEST '') still works with this change.

Reduce all remaining ThreadFuzzer parameters: - Migrate probabilities: 1.0 → 0.5 (halve thread migration on mutex ops) - Mutex sleep probabilities: 0.001 → 0.0005 - Mutex sleep time max: 10ms → 1ms

…lf as often)

…r overhead

…-quality # Conflicts: # .github/workflows/master.yml # .github/workflows/pull_request.yml # ci/defs/defs.py # ci/jobs/build_clickhouse.py

…query Exclude (commit_sha, check_name) pairs with >= 20 test failures — these indicate a broken build or environment issue rather than genuine per-test flakiness, and would flood the result with noise. The filter was removed earlier; re-adding it as a subquery IN-filter so it composes cleanly with the flat group-by and recency-weighted ordering.

…d of a PR Usage: # All uncommitted changes vs HEAD PYTHONPATH=ci:. python3 ci/jobs/scripts/find_tests.py --local # All branch changes vs master PYTHONPATH=ci:. python3 ci/jobs/scripts/find_tests.py --local --base origin/master # Last commit only PYTHONPATH=ci:. python3 ci/jobs/scripts/find_tests.py --local --base HEAD~1 The --local flag: - Runs `git diff <base>` (default base: HEAD) to get the diff - Detects changed test files from the local diff (not gh pr diff) - Implies --coverage-only (no PR number → previously-failed pass is skipped) - Makes the `pr` positional argument optional The PR number remains required for normal (non-local) usage.

…e instead of a PR" This reverts commit 498df6f.

The HTTP URL query parameter has a server-side size limit, causing "Field value too long" errors for large queries (e.g. IN-lists with 1000+ test names in find_tests.py indirect callee pass). ClickHouse HTTP interface accepts the query equally in the POST body, which has no practical size limit. Move the query from params= to data= to fix large-query failures for all CIDB callers.

…callee seeds

Per-hunk SQL ranges for .cpp files: - Replace bounding-box (min_hunk to max_hunk) with per-hunk OR conditions ±1 - Merge hunks ≤ 5 lines apart to bridge inter-hunk gaps - Extend .h files to also use per-hunk ranges (same as .cpp) instead of fetching all regions in the file — prevents massive template headers like FunctionsConversion.h from flooding results with unrelated instantiations Indirect callee pass: - Jaccard threshold: lower only for rc < 20 (sparse files); rc ≥ 20 keeps 70% (was: 9% for rc=78, admitting tests sharing only generic callees) - INDIRECT_LIMIT: inversely proportional to seed count (200/n_seeds×5) so files with many direct hits get fewer indirect additions - min_depth ≤ 3 as additional seed criterion for high-rc files - MAX_TESTS_PER_LINE: raise from 150 to 500 for seed selection Output: - MAX_OUTPUT_TESTS: 300 → 250 - Always run supplementary keyword pass (was suppressed at >150 tests) - Keyword guarantee: inject top keyword tests into output tail, replacing lowest-scoring items to ensure domain-specific tests always appear get_changed_tests: reuse _diff_text when available to detect new test files from pre-fetched diff without extra GitHub API call Ultra-broad tier-3 expansion: query rc 8001–30000 when both primary and broad-tier2 return zero results

clickhouse-gh · 2026-03-31T00:19:42Z

ci/jobs/scripts/find_tests.py

+            kw_extra = [t for t in keyword_guarantee if t not in ranked_set]
+            if kw_extra:
+                # Replace tail items to stay within MAX_OUTPUT_TESTS
+                n = min(len(kw_extra), MAX_OUTPUT_TESTS)


n = min(len(kw_extra), MAX_OUTPUT_TESTS) can discard all coverage-ranked tests when ranked is below the cap and kw_extra is large.

Concrete trace:

len(ranked)=80, len(kw_extra)=300, MAX_OUTPUT_TESTS=250

n=250

ranked[: MAX_OUTPUT_TESTS - n] becomes ranked[:0] (empty)

result is only kw_extra[:250]

So the keyword guarantee can replace strong coverage-backed hits instead of only filling/replacing the tail. Please cap replacement by available tail slots, or split into append-when-below-cap and bounded-tail-replace-when-at-cap.

clickhouse-gh · 2026-03-31T00:19:51Z

ci/jobs/scripts/find_tests.py

+        # Broad-tier2 guarantee: if the cap cut off high-cov_regions broad-tier2 tests,
+        # append the top few (by cov_regions) that didn't make it — but only up to the cap.
+        broad_guarantee = getattr(self, '_broad_tier2_guarantee', [])
+        if broad_guarantee and len(ranked) < MAX_OUTPUT_TESTS:


The broad-tier2 guarantee block currently runs only when len(ranked) < MAX_OUTPUT_TESTS, but the failure mode described above it is specifically when the cap already cut off high-coverage broad-tier2 tests.

With len(ranked) == MAX_OUTPUT_TESTS, this block is skipped and the guarantee is not applied, so high-cov_regions broad-tier2 tests can still be dropped.

Please apply a bounded tail replacement when at cap so the guarantee actually addresses the capped-list case.

clickhouse-gh · 2026-03-31T02:34:12Z

LLVM Coverage Report

Metric	Baseline	Current	Δ
Lines	84.20%	84.00%	-0.20%
Functions	90.90%	90.90%	+0.00%
Branches	76.70%	76.60%	-0.10%

Changed lines: 35.00% (21/60) · Uncovered code

Full report · Diff report

…onflict - .claude/docs/relevant_tests.md: update with latest algorithm details (per-hunk ranges, adaptive Jaccard, keyword guarantee, quality metrics) - .claude/docs/relevant_test_quality.md: new doc summarizing test selection quality from 53-PR analysis (DWARF FPs, ultra-broad exclusions, structural limitations, improvements applied) - tests/queries/0_stateless/03222_datetime64_small_value_const.sql: fix parallel-run conflict (TABLE_ALREADY_EXISTS on shard_0.dt64_03222). Use {CLICKHOUSE_DATABASE}/{CLICKHOUSE_DATABASE_1} with need-query-parameters tag, test_shard_localhost cluster, single database for all shard data.

…add 20 concrete examples

clickhouse-gh · 2026-03-31T09:56:37Z

base/base/coverage.cpp

+                if (node->value < load_base)
+                    continue;
+                const uint64_t offset = node->value - load_base;
+                result.push_back({data->NameRef, data->FuncHash, offset, node->count});


getCurrentIndirectCalls zeroes node->count before storing it, so every emitted row writes call_count = 0:

node->count = 0; ... result.push_back({..., node->count});

This silently destroys the value-profile signal and makes system.coverage_indirect_calls.call_count unusable for weighting or diagnostics.

Please save the counter value first, then reset:

const uint64_t call_count = node->count; node->count = 0; ... result.push_back({data->NameRef, data->FuncHash, offset, call_count});

clickhouse-gh · 2026-03-31T10:24:09Z

tests/clickhouse-test

            client_options = self.add_effective_settings(client_options)

+            if args.collect_per_test_coverage and BuildFlags.WITH_COVERAGE_DEPTH in args.build_flags:
+                try:


⚠️ In per-test coverage mode this except path only prints and continues, so a failed SYSTEM SET COVERAGE TEST call silently degrades the whole run (coverage may accumulate under stale test names or be dropped), while the job can still appear green.

Since this code runs specifically when --collect-per-test-coverage and WITH_COVERAGE_DEPTH are enabled, it should fail fast (or at least mark the run failed) if the command cannot be armed. Otherwise targeted test selection gets built from corrupted/incomplete data.

Could you re-raise here (or return a failing TestResult) in coverage mode?

clickhouse-gh bot added the pr-ci label Mar 14, 2026

clickhouse-gh bot reviewed Mar 14, 2026

View reviewed changes