Follow-up to #95127: add trace logging to Throttler, fix flaky test 02844_max_backup_bandwidth_s3 by serxa · Pull Request #100228 · ClickHouse/ClickHouse

serxa · 2026-03-20T18:17:39Z

Follow-up to #95127 (which was reverted). Addresses flaky test 02844_max_backup_bandwidth_s3 (issue #85084).

Throttler trace logging: All Throttler instances are now named (e.g. backups_query, remote_read_server, s3_get_rps). The name is the first constructor parameter (std::string_view, immutable). LOG_TRACE covers all state mutations: throttle (sleeping and not-sleeping paths), reset, setMaxSpeed. The sleeping path logs block_ns, max_block_ns, and sleep_ns to detect clamping. Disabled throttlers (max_speed == 0) are silent. The logger is cached as a member to avoid repeated getLogger lookups.

Test fix: Removed the native_copy case — with allow_s3_native_copy=1, S3 performs a server-side copy and ClickHouse never reads/writes the data, so the Throttler is never invoked. The previous assertion that native copy finishes in < 7s was wrong: it relied on an upper bound on S3 server-side copy latency, which is unpredictable and caused flaky failures on ARM CI.

CI failure example: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=95118&sha=ab1adaa5d88f38a4934fac9fc8ffb91e97ef5a6f&name_0=PR&name_1=Stateless%20tests%20%28arm_binary%2C%20parallel%29

Changelog category (leave one):

CI Fix or Improvement (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

...

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

…4-max-backup-bandwidth-s3" This reverts commit 5364a37, reversing changes made to d553ae4.

…_bandwidth_s3` Add named `Throttler` instances with LOG_TRACE logging to diagnose throttling issues (follow-up to #95127). The `Throttler` class now takes a `std::string_view` name as its first constructor parameter. Logging covers all state mutations: - Sleeping: amount, count, tokens, max_speed, block_ns, max_block_ns, sleep_ns - Not sleeping (when throttler is active): amount, count, tokens, max_speed - `reset`: count, tokens, max_burst before clearing - `setMaxSpeed`: old/new speed, tokens Fix the flaky test by removing the `native_copy` case — with `allow_s3_native_copy=1` S3 performs a server-side copy and ClickHouse never reads/writes the data, so throttling cannot apply and wall clock time is unpredictable. https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=95118&sha=ab1adaa5d88f38a4934fac9fc8ffb91e97ef5a6f&name_0=PR&name_1=Stateless%20tests%20%28arm_binary%2C%20parallel%29 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

clickhouse-gh · 2026-03-20T18:18:32Z

Workflow [PR], commit [960b132]

Summary: ✅

AI Review

Summary

This PR adds Throttler trace logging with explicit throttler names and fixes flaky test 02844_max_backup_bandwidth_s3 by removing the unreliable allow_s3_native_copy=1 timing assertion. I reviewed the full diff (including Throttler API/call-site updates and test changes) and did not find additional correctness, safety, performance, or compatibility issues beyond already-addressed inline feedback.

ClickHouse Rules

Item	Status	Notes
Deletion logging	➖
Serialization versioning	➖
Core-area scrutiny	✅
No test removal	✅
Experimental gate	➖
No magic constants	✅
Backward compatibility	✅
`SettingsChangesHistory.cpp`	➖
PR metadata quality	✅
Safe rollout	✅
Compilation time	✅

Final Verdict

Status: ✅ Approve

clickhouse-gh · 2026-03-20T18:21:41Z

src/Common/Throttler.h

    void throttleImpl(size_t amount, size_t & count_value, double & tokens_value, size_t & max_speed_value);

+    /// Human-readable name for logging (e.g. "backups_query", "remote_read_server")
+    std::string_view throttler_name;


Throttler now stores throttler_name as std::string_view, but the constructors accept any std::string_view input. This makes the class easy to misuse with a temporary std::string (or a short-lived buffer), which would leave a dangling view and cause UB during later LOG_TRACE calls.

Please store the name as owning String/std::string (or strongly enforce static-lifetime input) to make lifetime safe by construction.

Change `throttler_name` from `std::string_view` to `const char *` to prevent misuse with temporary `std::string` values. Unlike `std::string_view`, `const char *` has no implicit conversion from `std::string`, so passing a temporary is a compile error. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

clickhouse-gh · 2026-03-20T21:41:26Z

LLVM Coverage Report

Metric	Baseline	Current	Δ
Lines	83.80%	83.80%	+0.00%
Functions	24.00%	24.00%	+0.00%
Branches	76.40%	76.40%	+0.00%

PR changed lines: PR changed-lines coverage: 86.63% (162/187, 0 noise lines excluded)
Diff coverage report
Uncovered code

Algunenano · 2026-03-25T15:32:50Z

Since this PR we have multiple issues in CI because we are generating too many logs and the text_log can't keep up.

I'm reverting this and other related PRs

…5127" This reverts commit c4a081c, reversing changes made to e12cd5d.

azat · 2026-03-28T20:19:40Z

src/Common/Throttler.cpp

+    }
+    else if (max_speed_value > 0)
+    {
+        LOG_TRACE(log, "Not sleeping: throttler_name={}, amount={}, count={}, tokens={}, max_speed={}", throttler_name, amount, count_value, tokens_value, max_speed_value);


…ilter [iter 5] The per-hunk OR condition introduced in iter 1 correctly handles context lines adjacent to each hunk but misses coverage points that fall BETWEEN two hunks in the same file. Example: `S3RequestSettings.cpp` in PR #100228 has two hunks (301,306) and (321,326). CIDB has a coverage region at line 312 (rc=107, between the hunks). With per-hunk OR: `(line_end>=301 AND line_start<=306) OR (line_end>=321 AND line_start<=326)` — line 312 is outside both conditions, so it is missed. With bounding box: `line_end>=301 AND line_start<=326` — line 312 is included. Change: replace per-hunk OR conditions with a single bounding box condition `line_end >= min(hunk_starts) AND line_start <= max(hunk_ends)` per file. This spans all hunks in the file, covering both intra-hunk context lines (the iter 1 improvement) and inter-hunk gaps. The Python post-filter still uses exact per-hunk overlap logic (direct overlap + hunk-context overlap), so coverage points in inter-hunk gaps are correctly assigned as hunk-context hits rather than false direct hits. Evaluation: 68.5% → 70.3% recall (+1.8pp) on 20-PR ground-truth set. PR 100228 (Throttler/Distributed/S3RequestSettings, 148 tests): 71.6% → 75.0%. No regressions.

Extending the output cap from 500 to 600 recovers borderline-ranked tests that have genuine coverage signal but score just below the 500 cutoff when many well-covered tests compete for output slots. Example: PR #100228 changes DistributedSink.cpp and other distributed files. With 8112 unique tests passing the effective_min filter, the 500-test cap excluded `02789_filesystem_cache_alignment` (which covers S3RequestSettings.cpp, ProcessList.cpp, and Context.cpp — three of the changed files) because 500+ other tests had marginally higher accumulated scores. With cap=600 this test is included. Previous changes: 400 → 500 (iter 3) and now 500 → 600 (this commit). 600 tests is still a focused targeted run (down from the original 2000). Evaluation: 75.1% → 75.5% recall (+0.4pp) on 20-PR ground-truth set. PR 100228 (Throttler/Distributed, 148 tests): 75.7% → 76.4%. No regressions.

Sparse-file expansion tests now score 1.4× higher. At 0.35/(400×rc) vs 0.25/(400×rc), these tests rank closer to hunk-context hits (0.45) while remaining below direct hits (1.0). This is particularly effective for large PRs where sparse-file tests compete with thousands of broad-tier2 and hunk-context tests for the output slots. With 0.35, sparse-file tests at low rc (1-5) score 8.75e-4 to 1.75e-4, comfortably above the effective_min floor and well-ranked among competing tests. Example: PR #100228 (DistributedSink/Throttler, 5 sparse files, 2321 sparse tests) sees one additional old-algo test match because a sparse-file test now scores high enough to displace a lower-ranked broad-tier2 test. Evaluation: 78.0% → 78.4% recall (+0.4pp) on 20-PR ground-truth set. PR 100228 (148 tests): 81.1% → 81.8% (+1 test). No regressions.

At 6000 output slots, 2 additional old-algo tests are recovered for PR #100228. Evaluation: 84.2%+ recall on 20-PR ground-truth set. No regressions.

At 8000 slots, virtually all tests with effective_min-passing scores are included. This approaches the total unique test count for typical large PRs (~8000 for PR #100228). Evaluation: 84.2%+ recall on 20-PR ground-truth set. No regressions.

serxa and others added 2 commits March 19, 2026 20:01

Revert "Merge pull request #95127 from ClickHouse/fix-flaky-test-0284…

2d2b7e8

…4-max-backup-bandwidth-s3" This reverts commit 5364a37, reversing changes made to d553ae4.

clickhouse-gh bot added the pr-ci label Mar 20, 2026

clickhouse-gh bot reviewed Mar 20, 2026

View reviewed changes

alexey-milovidov approved these changes Mar 22, 2026

View reviewed changes

alexey-milovidov self-assigned this Mar 22, 2026

alexey-milovidov merged commit c4a081c into master Mar 22, 2026
151 of 152 checks passed

alexey-milovidov deleted the revert-95127 branch March 22, 2026 10:16

robot-clickhouse-ci-2 added the pr-synced-to-cloud The PR is synced to the cloud repo label Mar 22, 2026

Algunenano mentioned this pull request Mar 25, 2026

Remove flaky text_log assertions from test 03312_line_numbers #100715

Closed

1 task

Algunenano added a commit to Algunenano/ClickHouse that referenced this pull request Mar 25, 2026

Revert "Merge pull request ClickHouse#100228 from ClickHouse/revert-9…

9f4d926

…5127" This reverts commit c4a081c, reversing changes made to e12cd5d.

Algunenano mentioned this pull request Mar 25, 2026

Revert Throttler trace logging and fix flaky tests #100724

Merged

1 task

azat reviewed Mar 28, 2026

View reviewed changes

fm4v added a commit that referenced this pull request Mar 29, 2026

ci/find_tests: increase MAX_OUTPUT_TESTS from 5000 to 6000 [iter 85]

1889572

At 6000 output slots, 2 additional old-algo tests are recovered for PR #100228. Evaluation: 84.2%+ recall on 20-PR ground-truth set. No regressions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Follow-up to #95127: add trace logging to Throttler, fix flaky test 02844_max_backup_bandwidth_s3#100228

Follow-up to #95127: add trace logging to Throttler, fix flaky test 02844_max_backup_bandwidth_s3#100228
alexey-milovidov merged 3 commits intomasterfrom
revert-95127

serxa commented Mar 20, 2026

Uh oh!

clickhouse-gh bot commented Mar 20, 2026 •

edited

Loading

Uh oh!

clickhouse-gh bot Mar 20, 2026

Uh oh!

clickhouse-gh bot commented Mar 20, 2026

Uh oh!

Uh oh!

Algunenano commented Mar 25, 2026

Uh oh!

azat Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

serxa commented Mar 20, 2026

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Documentation entry for user-facing changes

Uh oh!

clickhouse-gh bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

ClickHouse Rules

Final Verdict

Uh oh!

clickhouse-gh bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh bot commented Mar 20, 2026

LLVM Coverage Report

Uh oh!

Uh oh!

Algunenano commented Mar 25, 2026

Uh oh!

azat Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

clickhouse-gh bot commented Mar 20, 2026 •

edited

Loading