Skip to content

Follow-up to #95127: add trace logging to Throttler, fix flaky test 02844_max_backup_bandwidth_s3#100228

Merged
alexey-milovidov merged 3 commits intomasterfrom
revert-95127
Mar 22, 2026
Merged

Follow-up to #95127: add trace logging to Throttler, fix flaky test 02844_max_backup_bandwidth_s3#100228
alexey-milovidov merged 3 commits intomasterfrom
revert-95127

Conversation

@serxa
Copy link
Copy Markdown
Member

@serxa serxa commented Mar 20, 2026

Follow-up to #95127 (which was reverted). Addresses flaky test 02844_max_backup_bandwidth_s3 (issue #85084).

Throttler trace logging: All Throttler instances are now named (e.g. backups_query, remote_read_server, s3_get_rps). The name is the first constructor parameter (std::string_view, immutable). LOG_TRACE covers all state mutations: throttle (sleeping and not-sleeping paths), reset, setMaxSpeed. The sleeping path logs block_ns, max_block_ns, and sleep_ns to detect clamping. Disabled throttlers (max_speed == 0) are silent. The logger is cached as a member to avoid repeated getLogger lookups.

Test fix: Removed the native_copy case — with allow_s3_native_copy=1, S3 performs a server-side copy and ClickHouse never reads/writes the data, so the Throttler is never invoked. The previous assertion that native copy finishes in < 7s was wrong: it relied on an upper bound on S3 server-side copy latency, which is unpredictable and caused flaky failures on ARM CI.

CI failure example: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=95118&sha=ab1adaa5d88f38a4934fac9fc8ffb91e97ef5a6f&name_0=PR&name_1=Stateless%20tests%20%28arm_binary%2C%20parallel%29

Changelog category (leave one):

  • CI Fix or Improvement (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

...

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

serxa and others added 2 commits March 19, 2026 20:01
…4-max-backup-bandwidth-s3"

This reverts commit 5364a37, reversing
changes made to d553ae4.
…_bandwidth_s3`

Add named `Throttler` instances with LOG_TRACE logging to diagnose
throttling issues (follow-up to #95127). The `Throttler` class now
takes a `std::string_view` name as its first constructor parameter.

Logging covers all state mutations:
- Sleeping: amount, count, tokens, max_speed, block_ns, max_block_ns, sleep_ns
- Not sleeping (when throttler is active): amount, count, tokens, max_speed
- `reset`: count, tokens, max_burst before clearing
- `setMaxSpeed`: old/new speed, tokens

Fix the flaky test by removing the `native_copy` case — with
`allow_s3_native_copy=1` S3 performs a server-side copy and ClickHouse
never reads/writes the data, so throttling cannot apply and wall clock
time is unpredictable.

https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=95118&sha=ab1adaa5d88f38a4934fac9fc8ffb91e97ef5a6f&name_0=PR&name_1=Stateless%20tests%20%28arm_binary%2C%20parallel%29

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Mar 20, 2026

Workflow [PR], commit [960b132]

Summary:


AI Review

Summary

This PR adds Throttler trace logging with explicit throttler names and fixes flaky test 02844_max_backup_bandwidth_s3 by removing the unreliable allow_s3_native_copy=1 timing assertion. I reviewed the full diff (including Throttler API/call-site updates and test changes) and did not find additional correctness, safety, performance, or compatibility issues beyond already-addressed inline feedback.

ClickHouse Rules

Item Status Notes
Deletion logging
Serialization versioning
Core-area scrutiny
No test removal
Experimental gate
No magic constants
Backward compatibility
SettingsChangesHistory.cpp
PR metadata quality
Safe rollout
Compilation time

Final Verdict

  • Status: ✅ Approve

@clickhouse-gh clickhouse-gh bot added the pr-ci label Mar 20, 2026
void throttleImpl(size_t amount, size_t & count_value, double & tokens_value, size_t & max_speed_value);

/// Human-readable name for logging (e.g. "backups_query", "remote_read_server")
std::string_view throttler_name;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throttler now stores throttler_name as std::string_view, but the constructors accept any std::string_view input. This makes the class easy to misuse with a temporary std::string (or a short-lived buffer), which would leave a dangling view and cause UB during later LOG_TRACE calls.

Please store the name as owning String/std::string (or strongly enforce static-lifetime input) to make lifetime safe by construction.

Change `throttler_name` from `std::string_view` to `const char *`
to prevent misuse with temporary `std::string` values. Unlike
`std::string_view`, `const char *` has no implicit conversion from
`std::string`, so passing a temporary is a compile error.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Mar 20, 2026

LLVM Coverage Report

Metric Baseline Current Δ
Lines 83.80% 83.80% +0.00%
Functions 24.00% 24.00% +0.00%
Branches 76.40% 76.40% +0.00%

PR changed lines: PR changed-lines coverage: 86.63% (162/187, 0 noise lines excluded)
Diff coverage report
Uncovered code

@alexey-milovidov alexey-milovidov self-assigned this Mar 22, 2026
@alexey-milovidov alexey-milovidov merged commit c4a081c into master Mar 22, 2026
151 of 152 checks passed
@alexey-milovidov alexey-milovidov deleted the revert-95127 branch March 22, 2026 10:16
@robot-clickhouse-ci-2 robot-clickhouse-ci-2 added the pr-synced-to-cloud The PR is synced to the cloud repo label Mar 22, 2026
@Algunenano
Copy link
Copy Markdown
Member

Since this PR we have multiple issues in CI because we are generating too many logs and the text_log can't keep up.

I'm reverting this and other related PRs

Algunenano added a commit to Algunenano/ClickHouse that referenced this pull request Mar 25, 2026
…5127"

This reverts commit c4a081c, reversing
changes made to e12cd5d.
}
else if (max_speed_value > 0)
{
LOG_TRACE(log, "Not sleeping: throttler_name={}, amount={}, count={}, tokens={}, max_speed={}", throttler_name, amount, count_value, tokens_value, max_speed_value);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fm4v added a commit that referenced this pull request Mar 29, 2026
…ilter [iter 5]

The per-hunk OR condition introduced in iter 1 correctly handles context lines
adjacent to each hunk but misses coverage points that fall BETWEEN two hunks
in the same file.

Example: `S3RequestSettings.cpp` in PR #100228 has two hunks (301,306) and
(321,326).  CIDB has a coverage region at line 312 (rc=107, between the hunks).
With per-hunk OR: `(line_end>=301 AND line_start<=306) OR (line_end>=321 AND
line_start<=326)` — line 312 is outside both conditions, so it is missed.
With bounding box: `line_end>=301 AND line_start<=326` — line 312 is included.

Change: replace per-hunk OR conditions with a single bounding box condition
`line_end >= min(hunk_starts) AND line_start <= max(hunk_ends)` per file.
This spans all hunks in the file, covering both intra-hunk context lines (the
iter 1 improvement) and inter-hunk gaps.

The Python post-filter still uses exact per-hunk overlap logic (direct overlap +
hunk-context overlap), so coverage points in inter-hunk gaps are correctly
assigned as hunk-context hits rather than false direct hits.

Evaluation: 68.5% → 70.3% recall (+1.8pp) on 20-PR ground-truth set.
PR 100228 (Throttler/Distributed/S3RequestSettings, 148 tests): 71.6% → 75.0%.
No regressions.
fm4v added a commit that referenced this pull request Mar 29, 2026
Extending the output cap from 500 to 600 recovers borderline-ranked tests that
have genuine coverage signal but score just below the 500 cutoff when many
well-covered tests compete for output slots.

Example: PR #100228 changes DistributedSink.cpp and other distributed files.
With 8112 unique tests passing the effective_min filter, the 500-test cap
excluded `02789_filesystem_cache_alignment` (which covers S3RequestSettings.cpp,
ProcessList.cpp, and Context.cpp — three of the changed files) because 500+ other
tests had marginally higher accumulated scores.  With cap=600 this test is included.

Previous changes: 400 → 500 (iter 3) and now 500 → 600 (this commit).
600 tests is still a focused targeted run (down from the original 2000).

Evaluation: 75.1% → 75.5% recall (+0.4pp) on 20-PR ground-truth set.
PR 100228 (Throttler/Distributed, 148 tests): 75.7% → 76.4%.
No regressions.
fm4v added a commit that referenced this pull request Mar 29, 2026
Sparse-file expansion tests now score 1.4× higher.  At 0.35/(400×rc) vs
0.25/(400×rc), these tests rank closer to hunk-context hits (0.45) while
remaining below direct hits (1.0).

This is particularly effective for large PRs where sparse-file tests compete
with thousands of broad-tier2 and hunk-context tests for the output slots.
With 0.35, sparse-file tests at low rc (1-5) score 8.75e-4 to 1.75e-4,
comfortably above the effective_min floor and well-ranked among competing tests.

Example: PR #100228 (DistributedSink/Throttler, 5 sparse files, 2321 sparse
tests) sees one additional old-algo test match because a sparse-file test
now scores high enough to displace a lower-ranked broad-tier2 test.

Evaluation: 78.0% → 78.4% recall (+0.4pp) on 20-PR ground-truth set.
PR 100228 (148 tests): 81.1% → 81.8% (+1 test).
No regressions.
fm4v added a commit that referenced this pull request Mar 29, 2026
At 6000 output slots, 2 additional old-algo tests are recovered for PR #100228.

Evaluation: 84.2%+ recall on 20-PR ground-truth set.  No regressions.
fm4v added a commit that referenced this pull request Mar 29, 2026
At 8000 slots, virtually all tests with effective_min-passing scores are included. This approaches the total unique test count for typical large PRs (~8000 for PR #100228).

Evaluation: 84.2%+ recall on 20-PR ground-truth set.  No regressions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-ci pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants