Skip to content

kvserver: add logging to TestReadLoadMetricAccounting#141599

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
tbg:TestReadLoadMetricAccounting
Feb 20, 2025
Merged

kvserver: add logging to TestReadLoadMetricAccounting#141599
craig[bot] merged 1 commit intocockroachdb:masterfrom
tbg:TestReadLoadMetricAccounting

Conversation

@tbg
Copy link
Copy Markdown
Member

@tbg tbg commented Feb 17, 2025

The test failed in #141586. It looks like a write snuck in,
but it's hard to prove or reproduce.

Add logging to the test that tracks all commands evaluated
on that range, so if it fails again, the failure is likely
actionable.

Closes #141586.

Epic: none
Release note: None

The test failed in cockroachdb#141586. It looks like a write snuck in,
but it's hard to prove or reproduce.

Add logging to the test that tracks all commands evaluated
on that range, so if it fails again, the failure is likely
actionable.

Closes cockroachdb#141586.

Epic: none
Release note: None
@tbg tbg requested a review from a team as a code owner February 17, 2025 16:43
@tbg tbg requested a review from kvoli February 17, 2025 16:43
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Contributor

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @tbg)

@tbg
Copy link
Copy Markdown
Member Author

tbg commented Feb 20, 2025

TFTR!

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Feb 20, 2025

@craig craig bot merged commit a260378 into cockroachdb:master Feb 20, 2025
23 of 24 checks passed
craig bot pushed a commit that referenced this pull request Dec 19, 2025
159877: kvserver: deflake TestReadLoadMetricAccounting r=tbg a=tbg

`TestReadLoadMetricAccounting` has a history of flaking due to lease-related writes interfering with load metric measurements.

Issue #141716 (and #141586) identified the same failure signature:
```
Error: Max difference between 0 and 85 allowed is 4, but difference was -85
```

The root cause was identified by `@pav-kv:` an "unexpected" leader lease upgrade write was interfering with the test's write bytes measurements. PR #141843 added `tc.MaybeWaitForLeaseUpgrade()` to wait for lease upgrades before starting measurements.

**The fix from #141843 IS present** in the failing SHA. However, the test still flaked with the same error signature (85 write bytes when expecting 0).

The logs show:
1. AddSSTableRequest evaluated (test setup)
2. Many LeaseInfoRequest polls (from MaybeWaitForLeaseUpgrade)
3. RequestLeaseRequest (the lease upgrade write)
4. More LeaseInfoRequest polls
5. "lease is now of type: LeaseLeader" - **upgrade complete**
6. "test #1" - test loop begins
7. GetRequest evaluated (the actual test request)
8. **Assertion fails** - 85 write bytes observed

The race condition is subtle: `MaybeWaitForLeaseUpgrade()` waits until `FindRangeLeaseEx()` reports the lease is upgraded, but it does **not** guarantee that the write bytes have been recorded to load stats. This is because stats are recorded "awkwardly late" on the client goroutine (`SendWithWriteBytes`).

The fix:
1. Wraps each test case iteration in `SucceedsSoon`
2. Resets load stats, sends the request, checks results
3. If any stat doesn't match (due to background activity like lease upgrades), returns an error to trigger retry
4. Adds a comment noting that test cases must be idempotent (they are—all reads)

## Related Issues/PRs

| Issue/PR | Status | Relevance |
|----------|--------|-----------|
| #159719 | OPEN | Current failure |
| #141716 | CLOSED | Duplicate, Feb 2025 |
| #141586 | CLOSED | Original issue, Feb 2025 |
| #141843 | MERGED | Deflake attempt (wait for lease upgrade) |
| #141599 | MERGED | Added logging to help debug |
| #141905 | CLOSED | Duplicate |
| #134799 | CLOSED | Older occurrence |

This is more robust than trying to synchronize with specific background operations because it handles **any** source of interference, not just lease upgrades.

Epic: none
Closes #159719.


Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

kv/kvserver: TestReadLoadMetricAccounting failed

3 participants