Replace SANITIZE_COVERAGE with LLVM source-based coverage for per-test coverage collection#99513
Replace SANITIZE_COVERAGE with LLVM source-based coverage for per-test coverage collection#99513
Conversation
|
Workflow [PR], commit [52e50a8] Summary: ⏳
AI ReviewSummaryThis PR migrates per-test coverage collection from Findings❌ Blockers
Tests
ClickHouse Rules
Final Verdict
|
|
FIRST IMPLEMENTATION WITH SYMBOLS NORMALIZATION (REMOVED) |
a2b8d7f to
52c3b10
Compare
…rkaround LLVM_COVERAGE_BUILD (which works) has no toolchain file and builds compiler-rt from source — the profile runtime is available as a result. AMD_COVERAGE used the x86_64 toolchain which sets USE_SYSTEM_COMPILER_RT, causing the linker to look for libclang_rt.profile.a at a system path that does not exist in the CI Docker image. Drop the toolchain file from AMD_COVERAGE to match LLVM_COVERAGE_BUILD. Also revert the sanitize.cmake --print-file-name workaround introduced in the previous commit since it is no longer needed. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
AMD_COVERAGE is an x86 build; it was misconfigured to run on ARM_LARGE. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
system.build_options uses @FULL_CXX_FLAGS_NORMALIZED@ for the CXX_FLAGS row, not @CMAKE_CXX_FLAGS@. This normalized variable does not contain the -DWITH_COVERAGE=1 preprocessor define we set in sanitize.cmake. Query the dedicated WITH_COVERAGE row instead, which stores @WITH_COVERAGE@ and returns 'ON' when the coverage build option is enabled. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Temporary print to confirm WITH_COVERAGE is detected correctly in CI. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- Print build_flags + coverage_active status early in clickhouse-test so the next CI run shows whether WITH_COVERAGE is detected. - Use verbose=True in CoverageExporter so clickhouse local output (including row counts and errors) appears in the job log. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
After creating system.coverage_log, run a quick self-test: SYSTEM SET COVERAGE TEST '_selftest' → SELECT 1 → SYSTEM SET COVERAGE TEST '' then check if any rows were inserted. Prints 'OK' or 'EMPTY - coverage mapping not working' to surface the issue in the job log. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Every invocation logs: test name received, number of covered NameRefs, coverage map size, and resolved file/line count. Makes the CI server log show exactly what's happening at each stage. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…age label - Replace AMD_COVERAGE build with LLVM_COVERAGE_BUILD in coverage_build_jobs: no sysroot, compiler-rt built from source (no libclang_rt.profile.a issues). - Add per_test_coverage label to functional_tests_jobs_coverage parameters: Stateless tests (llvm_coverage_build, per_test_coverage, N/8) - In functional_tests.py: per_test_coverage → is_coverage=True (new per-test SYSTEM SET COVERAGE TEST path); amd_llvm_coverage still routes to old profraw. - Add INFO logging to SYSTEM SET COVERAGE TEST and CoverageCollection so CI server logs show NameRef count, map size, and resolved file count. - Add coverage self-test and build_flags dump in clickhouse-test startup. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
| if (kind == Kind::Files) | ||
| return std::make_shared<DataTypeArray>(std::make_shared<DataTypeString>()); | ||
| return std::make_shared<DataTypeArray>(std::make_shared<DataTypeUInt32>()); | ||
| } |
There was a problem hiding this comment.
coverageCurrentFiles / coverageCurrentLineStarts / coverageCurrentLineEnds are documented as returning live coverage data, but this implementation always returns empty arrays.
Because these SQL functions are still registered and user-visible in WITH_COVERAGE builds, this becomes a functional regression (the API appears to work but yields no data). Please either wire these functions to the new NameRef -> (file,line) mapping path or remove/feature-gate them until implemented.
…erage_log stats - Register per_test_coverage as a known option so functional_tests.py doesn't assert on unknown option. - After all tests complete, print system.coverage_log row count and distinct test count so the job log shows whether data was collected. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
| case Type::SET_COVERAGE_TEST: | ||
| { | ||
| ASTPtr ast; | ||
| if (ParserStringLiteral{}.parse(pos, ast, expected)) |
There was a problem hiding this comment.
SYSTEM SET COVERAGE TEST currently accepts a missing test name because the parser does not fail when ParserStringLiteral doesn't match. This means malformed input like SYSTEM SET COVERAGE TEST silently executes with an empty name and triggers a coverage flush/reset path instead of returning a syntax exception.
Please make the argument mandatory here:
ASTPtr ast;
if (!ParserStringLiteral{}.parse(pos, ast, expected))
return false;
res->coverage_test_name = ast->as<ASTLiteral &>().value.safeGet<String>();Keeping explicit empty string support (SYSTEM SET COVERAGE TEST '') still works with this change.
Reduce all remaining ThreadFuzzer parameters: - Migrate probabilities: 1.0 → 0.5 (halve thread migration on mutex ops) - Mutex sleep probabilities: 0.001 → 0.0005 - Mutex sleep time max: 10ms → 1ms
…-quality # Conflicts: # .github/workflows/master.yml # .github/workflows/pull_request.yml # ci/defs/defs.py # ci/jobs/build_clickhouse.py
…query Exclude (commit_sha, check_name) pairs with >= 20 test failures — these indicate a broken build or environment issue rather than genuine per-test flakiness, and would flood the result with noise. The filter was removed earlier; re-adding it as a subquery IN-filter so it composes cleanly with the flat group-by and recency-weighted ordering.
…d of a PR Usage: # All uncommitted changes vs HEAD PYTHONPATH=ci:. python3 ci/jobs/scripts/find_tests.py --local # All branch changes vs master PYTHONPATH=ci:. python3 ci/jobs/scripts/find_tests.py --local --base origin/master # Last commit only PYTHONPATH=ci:. python3 ci/jobs/scripts/find_tests.py --local --base HEAD~1 The --local flag: - Runs `git diff <base>` (default base: HEAD) to get the diff - Detects changed test files from the local diff (not gh pr diff) - Implies --coverage-only (no PR number → previously-failed pass is skipped) - Makes the `pr` positional argument optional The PR number remains required for normal (non-local) usage.
…e instead of a PR" This reverts commit 498df6f.
The HTTP URL query parameter has a server-side size limit, causing "Field value too long" errors for large queries (e.g. IN-lists with 1000+ test names in find_tests.py indirect callee pass). ClickHouse HTTP interface accepts the query equally in the POST body, which has no practical size limit. Move the query from params= to data= to fix large-query failures for all CIDB callers.
Per-hunk SQL ranges for .cpp files: - Replace bounding-box (min_hunk to max_hunk) with per-hunk OR conditions ±1 - Merge hunks ≤ 5 lines apart to bridge inter-hunk gaps - Extend .h files to also use per-hunk ranges (same as .cpp) instead of fetching all regions in the file — prevents massive template headers like FunctionsConversion.h from flooding results with unrelated instantiations Indirect callee pass: - Jaccard threshold: lower only for rc < 20 (sparse files); rc ≥ 20 keeps 70% (was: 9% for rc=78, admitting tests sharing only generic callees) - INDIRECT_LIMIT: inversely proportional to seed count (200/n_seeds×5) so files with many direct hits get fewer indirect additions - min_depth ≤ 3 as additional seed criterion for high-rc files - MAX_TESTS_PER_LINE: raise from 150 to 500 for seed selection Output: - MAX_OUTPUT_TESTS: 300 → 250 - Always run supplementary keyword pass (was suppressed at >150 tests) - Keyword guarantee: inject top keyword tests into output tail, replacing lowest-scoring items to ensure domain-specific tests always appear get_changed_tests: reuse _diff_text when available to detect new test files from pre-fetched diff without extra GitHub API call Ultra-broad tier-3 expansion: query rc 8001–30000 when both primary and broad-tier2 return zero results
| kw_extra = [t for t in keyword_guarantee if t not in ranked_set] | ||
| if kw_extra: | ||
| # Replace tail items to stay within MAX_OUTPUT_TESTS | ||
| n = min(len(kw_extra), MAX_OUTPUT_TESTS) |
There was a problem hiding this comment.
n = min(len(kw_extra), MAX_OUTPUT_TESTS) can discard all coverage-ranked tests when ranked is below the cap and kw_extra is large.
Concrete trace:
len(ranked)=80,len(kw_extra)=300,MAX_OUTPUT_TESTS=250n=250ranked[: MAX_OUTPUT_TESTS - n]becomesranked[:0](empty)- result is only
kw_extra[:250]
So the keyword guarantee can replace strong coverage-backed hits instead of only filling/replacing the tail. Please cap replacement by available tail slots, or split into append-when-below-cap and bounded-tail-replace-when-at-cap.
| # Broad-tier2 guarantee: if the cap cut off high-cov_regions broad-tier2 tests, | ||
| # append the top few (by cov_regions) that didn't make it — but only up to the cap. | ||
| broad_guarantee = getattr(self, '_broad_tier2_guarantee', []) | ||
| if broad_guarantee and len(ranked) < MAX_OUTPUT_TESTS: |
There was a problem hiding this comment.
The broad-tier2 guarantee block currently runs only when len(ranked) < MAX_OUTPUT_TESTS, but the failure mode described above it is specifically when the cap already cut off high-coverage broad-tier2 tests.
With len(ranked) == MAX_OUTPUT_TESTS, this block is skipped and the guarantee is not applied, so high-cov_regions broad-tier2 tests can still be dropped.
Please apply a bounded tail replacement when at cap so the guarantee actually addresses the capped-list case.
LLVM Coverage Report
Changed lines: 35.00% (21/60) · Uncovered code |
…onflict
- .claude/docs/relevant_tests.md: update with latest algorithm details
(per-hunk ranges, adaptive Jaccard, keyword guarantee, quality metrics)
- .claude/docs/relevant_test_quality.md: new doc summarizing test selection
quality from 53-PR analysis (DWARF FPs, ultra-broad exclusions, structural
limitations, improvements applied)
- tests/queries/0_stateless/03222_datetime64_small_value_const.sql:
fix parallel-run conflict (TABLE_ALREADY_EXISTS on shard_0.dt64_03222).
Use {CLICKHOUSE_DATABASE}/{CLICKHOUSE_DATABASE_1} with need-query-parameters
tag, test_shard_localhost cluster, single database for all shard data.
…add 20 concrete examples
| if (node->value < load_base) | ||
| continue; | ||
| const uint64_t offset = node->value - load_base; | ||
| result.push_back({data->NameRef, data->FuncHash, offset, node->count}); |
There was a problem hiding this comment.
getCurrentIndirectCalls zeroes node->count before storing it, so every emitted row writes call_count = 0:
node->count = 0;
...
result.push_back({..., node->count});This silently destroys the value-profile signal and makes system.coverage_indirect_calls.call_count unusable for weighting or diagnostics.
Please save the counter value first, then reset:
const uint64_t call_count = node->count;
node->count = 0;
...
result.push_back({data->NameRef, data->FuncHash, offset, call_count});| client_options = self.add_effective_settings(client_options) | ||
|
|
||
| if args.collect_per_test_coverage and BuildFlags.WITH_COVERAGE_DEPTH in args.build_flags: | ||
| try: |
There was a problem hiding this comment.
except path only prints and continues, so a failed SYSTEM SET COVERAGE TEST call silently degrades the whole run (coverage may accumulate under stale test names or be dropped), while the job can still appear green.
Since this code runs specifically when --collect-per-test-coverage and WITH_COVERAGE_DEPTH are enabled, it should fail fast (or at least mark the run failed) if the command cannot be armed. Otherwise targeted test selection gets built from corrupted/incomplete data.
Could you re-raise here (or return a failing TestResult) in coverage mode?
Motivation
The existing nightly coverage pipeline used
-fsanitize-coverage=trace-pc-guard,pc-table(SANITIZE_COVERAGE), which required custom__sanitizer_cov_*callbacks, a 6 GB unstripped binary as a build artifact, offline DWARF queries viaclickhouse local, and a complex two-stage export involving Python symbol normalization withPool.imapandsort -u.Test selection based on coverage queried
checks_coverage_inverted2using normalized C++ symbol names, which required a 200-line normalization function to strip return types, argument lists, and template arguments — and still had false positives due to basename-only DWARF matching.Client-side coverage via
CLICKHOUSE_WRITE_COVERAGEwas broken in practice: the dumped addresses were virtual addresses without the binary load base subtracted, soaddressToSymbolcould not resolve them and the symbol arrays were always empty.What changed
Replace
SANITIZE_COVERAGEwith LLVM's standard source-based coverage (-fprofile-instr-generate -fcoverage-mapping,WITH_COVERAGE).In-process coverage mapping reader. At server startup,
readLLVMCoverageMapping("/proc/self/exe")parses the binary's own__llvm_covfunand__llvm_covmapELF sections. It builds a map fromNameRef(MD5 hash of the mangled function name, from__llvm_profile_data) to(file, line_start, line_end)using LLVM's coverage mapping format. No build artifacts or external tools are required at runtime.New SQL command
SYSTEM SET COVERAGE TEST 'name'. Before each test, the test runner calls this command, which atomically: flushes the previous test's coverage intosystem.coverage_log, resets LLVM profiling counters via__llvm_profile_reset_counters, and arms the new test name. Replaces the old 3-stepRESET → test → INSERTsequence.system.coverage_lognew schema. Columnsfiles Array(String),line_starts Array(UInt32),line_ends Array(UInt32)replacecoverage Array(UInt64)andsymbol Array(String).Simplified export. A single
ARRAY JOIN-basedINSERT INTO FUNCTION remoteSecure(...)replaces the two-stage Python export (raw symbols + normalized symbols). Target table:checks_coverage_lines(new CIDB table, keyed by(check_start_time, file, line_start)).Line-based test selection.
find_tests.pynow querieschecks_coverage_lineswithendsWith(file, basename) AND line_start <= N AND line_end >= Nfor each changed line in the PR diff. Tests are ranked by how many changed lines they cover — tests covering more of the diff appear first. Eliminatesfind_symbols.py,normalize_symbol, and the DWARF binary query (was 85–180 s per run).Validated locally
Ran 100 stateless tests with per-test coverage collection:
system.coverage_log, one per testSELECT 1covers ⊂ALTER TABLE(1372 unique files unique to alter); hot-path files (ProfileEvents.cpp,MergeTreeBackgroundExecutor.cpp) in all 100 testsInterpreterAlterQuery.cppline 100 is covered by exactly one test:00030_alter_tableWhat disappears
SANITIZE_COVERAGEcmake option and__sanitizer_cov_*callbacksfind_symbols.py(DWARF query taking 85–180 s)checks_coverage_invertedandchecks_coverage_inverted2CIDB tables (can be dropped after one month)CLICKHOUSE_WRITE_COVERAGE(was broken anyway — dumped virtual addresses without subtracting the binary load base, soaddressToSymbolcould not resolve them and symbol arrays were always empty)Changelog category (leave one):
Changelog entry:
Replace
SANITIZE_COVERAGE(custom sanitizer callbacks, symbol-level granularity) with LLVM source-based coverage (WITH_COVERAGE,-fprofile-instr-generate -fcoverage-mapping) for the nightly per-test coverage pipeline. The server now reads its own coverage mapping from ELF sections at startup and collects(file, line_start, line_end)tuples per test via a newSYSTEM SET COVERAGE TEST 'name'command. Test selection in targeted CI checks uses line-range queries against a newchecks_coverage_linesCIDB table and ranks candidate tests by how many changed diff lines they cover.Documentation entry for user-facing changes