Follow-up to #95127: add trace logging to Throttler, fix flaky test 02844_max_backup_bandwidth_s3#100228
Follow-up to #95127: add trace logging to Throttler, fix flaky test 02844_max_backup_bandwidth_s3#100228alexey-milovidov merged 3 commits intomasterfrom
Conversation
…_bandwidth_s3` Add named `Throttler` instances with LOG_TRACE logging to diagnose throttling issues (follow-up to #95127). The `Throttler` class now takes a `std::string_view` name as its first constructor parameter. Logging covers all state mutations: - Sleeping: amount, count, tokens, max_speed, block_ns, max_block_ns, sleep_ns - Not sleeping (when throttler is active): amount, count, tokens, max_speed - `reset`: count, tokens, max_burst before clearing - `setMaxSpeed`: old/new speed, tokens Fix the flaky test by removing the `native_copy` case — with `allow_s3_native_copy=1` S3 performs a server-side copy and ClickHouse never reads/writes the data, so throttling cannot apply and wall clock time is unpredictable. https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=95118&sha=ab1adaa5d88f38a4934fac9fc8ffb91e97ef5a6f&name_0=PR&name_1=Stateless%20tests%20%28arm_binary%2C%20parallel%29 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Workflow [PR], commit [960b132] Summary: ✅ AI ReviewSummaryThis PR adds ClickHouse Rules
Final Verdict
|
src/Common/Throttler.h
Outdated
| void throttleImpl(size_t amount, size_t & count_value, double & tokens_value, size_t & max_speed_value); | ||
|
|
||
| /// Human-readable name for logging (e.g. "backups_query", "remote_read_server") | ||
| std::string_view throttler_name; |
There was a problem hiding this comment.
Throttler now stores throttler_name as std::string_view, but the constructors accept any std::string_view input. This makes the class easy to misuse with a temporary std::string (or a short-lived buffer), which would leave a dangling view and cause UB during later LOG_TRACE calls.
Please store the name as owning String/std::string (or strongly enforce static-lifetime input) to make lifetime safe by construction.
Change `throttler_name` from `std::string_view` to `const char *` to prevent misuse with temporary `std::string` values. Unlike `std::string_view`, `const char *` has no implicit conversion from `std::string`, so passing a temporary is a compile error. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
LLVM Coverage Report
PR changed lines: PR changed-lines coverage: 86.63% (162/187, 0 noise lines excluded) |
|
Since this PR we have multiple issues in CI because we are generating too many logs and the text_log can't keep up. I'm reverting this and other related PRs |
| } | ||
| else if (max_speed_value > 0) | ||
| { | ||
| LOG_TRACE(log, "Not sleeping: throttler_name={}, amount={}, count={}, tokens={}, max_speed={}", throttler_name, amount, count_value, tokens_value, max_speed_value); |
…ilter [iter 5] The per-hunk OR condition introduced in iter 1 correctly handles context lines adjacent to each hunk but misses coverage points that fall BETWEEN two hunks in the same file. Example: `S3RequestSettings.cpp` in PR #100228 has two hunks (301,306) and (321,326). CIDB has a coverage region at line 312 (rc=107, between the hunks). With per-hunk OR: `(line_end>=301 AND line_start<=306) OR (line_end>=321 AND line_start<=326)` — line 312 is outside both conditions, so it is missed. With bounding box: `line_end>=301 AND line_start<=326` — line 312 is included. Change: replace per-hunk OR conditions with a single bounding box condition `line_end >= min(hunk_starts) AND line_start <= max(hunk_ends)` per file. This spans all hunks in the file, covering both intra-hunk context lines (the iter 1 improvement) and inter-hunk gaps. The Python post-filter still uses exact per-hunk overlap logic (direct overlap + hunk-context overlap), so coverage points in inter-hunk gaps are correctly assigned as hunk-context hits rather than false direct hits. Evaluation: 68.5% → 70.3% recall (+1.8pp) on 20-PR ground-truth set. PR 100228 (Throttler/Distributed/S3RequestSettings, 148 tests): 71.6% → 75.0%. No regressions.
Extending the output cap from 500 to 600 recovers borderline-ranked tests that have genuine coverage signal but score just below the 500 cutoff when many well-covered tests compete for output slots. Example: PR #100228 changes DistributedSink.cpp and other distributed files. With 8112 unique tests passing the effective_min filter, the 500-test cap excluded `02789_filesystem_cache_alignment` (which covers S3RequestSettings.cpp, ProcessList.cpp, and Context.cpp — three of the changed files) because 500+ other tests had marginally higher accumulated scores. With cap=600 this test is included. Previous changes: 400 → 500 (iter 3) and now 500 → 600 (this commit). 600 tests is still a focused targeted run (down from the original 2000). Evaluation: 75.1% → 75.5% recall (+0.4pp) on 20-PR ground-truth set. PR 100228 (Throttler/Distributed, 148 tests): 75.7% → 76.4%. No regressions.
Sparse-file expansion tests now score 1.4× higher. At 0.35/(400×rc) vs 0.25/(400×rc), these tests rank closer to hunk-context hits (0.45) while remaining below direct hits (1.0). This is particularly effective for large PRs where sparse-file tests compete with thousands of broad-tier2 and hunk-context tests for the output slots. With 0.35, sparse-file tests at low rc (1-5) score 8.75e-4 to 1.75e-4, comfortably above the effective_min floor and well-ranked among competing tests. Example: PR #100228 (DistributedSink/Throttler, 5 sparse files, 2321 sparse tests) sees one additional old-algo test match because a sparse-file test now scores high enough to displace a lower-ranked broad-tier2 test. Evaluation: 78.0% → 78.4% recall (+0.4pp) on 20-PR ground-truth set. PR 100228 (148 tests): 81.1% → 81.8% (+1 test). No regressions.
At 6000 output slots, 2 additional old-algo tests are recovered for PR #100228. Evaluation: 84.2%+ recall on 20-PR ground-truth set. No regressions.
At 8000 slots, virtually all tests with effective_min-passing scores are included. This approaches the total unique test count for typical large PRs (~8000 for PR #100228). Evaluation: 84.2%+ recall on 20-PR ground-truth set. No regressions.
Follow-up to #95127 (which was reverted). Addresses flaky test
02844_max_backup_bandwidth_s3(issue #85084).Throttlertrace logging: AllThrottlerinstances are now named (e.g.backups_query,remote_read_server,s3_get_rps). The name is the first constructor parameter (std::string_view, immutable).LOG_TRACEcovers all state mutations:throttle(sleeping and not-sleeping paths),reset,setMaxSpeed. The sleeping path logsblock_ns,max_block_ns, andsleep_nsto detect clamping. Disabled throttlers (max_speed == 0) are silent. The logger is cached as a member to avoid repeatedgetLoggerlookups.Test fix: Removed the
native_copycase — withallow_s3_native_copy=1, S3 performs a server-side copy and ClickHouse never reads/writes the data, so theThrottleris never invoked. The previous assertion that native copy finishes in < 7s was wrong: it relied on an upper bound on S3 server-side copy latency, which is unpredictable and caused flaky failures on ARM CI.CI failure example: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=95118&sha=ab1adaa5d88f38a4934fac9fc8ffb91e97ef5a6f&name_0=PR&name_1=Stateless%20tests%20%28arm_binary%2C%20parallel%29
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
...
Documentation entry for user-facing changes