kvserver: add tracing to TestStoreRangeSplitBackpressureWrites#160459
Conversation
tbg
left a comment
There was a problem hiding this comment.
given how rare this is, I'm fine with closing this issue as part of this PR.
But let's also skip this test under duress, since we know it has a timing-based vulnerability.
@tbg made 2 comments.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @stevendanna).
pkg/kv/kvserver/client_split_test.go line 1785 at r1 (raw file):
close(blockSplits) t.Fatalf("delete was not blocked on split, returned err %v", err) case <-time.After(100 * time.Millisecond):
So the thinking is that we hit this branch before the put/del even got anywhere, and so then we unblock the split prematurely, and get no error from the <-resCh checks below, right?
Annoying problem. We could have a callback via a testing knob that confirms that these are blocking.
stevendanna
left a comment
There was a problem hiding this comment.
@stevendanna made 1 comment.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @tbg).
pkg/kv/kvserver/client_split_test.go line 1785 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
So the thinking is that we hit this branch before the put/del even got anywhere, and so then we unblock the split prematurely, and get no error from the
<-resChchecks below, right?Annoying problem. We could have a callback via a testing knob that confirms that these are blocking.
Exactly. I started looking at a testing knob but none of the existing ones seemed like they would be called at the right time (TestingRequestFilter is called after we block on backpressure and moving that feels like it would be more disruptive to other tests than it is worth) and it didn't seem worth it to add a knob for this single test that is otherwise pretty stable.
In cockroachdb#160141 we observed this test fail. This test is rather stable but does depend on two goroutines sending batches to the KV server within a 100ms timeout. One suspicion is that very rarely we are too slow and this time-based coordination is not enough. This test is rather stable and adding more sophisticated coordination will further complicate an already complicated test. Here, I add some tracing that will hopefully allow us to confirm the timing based theory if this ever happens again. Fixes cockroachdb#160141 Release note: None
5faead5 to
a1d38b8
Compare
|
bors r=tbg |
159819: cmd: add ghsearchdump and ghweeklysummary r=wenyihu6 a=wenyihu6 Epic: none Release note: none --- **ghsearchdump** --- **cmd/ghweeklysummary: add tool for generating weekly PR summaries** ``` Example usage: # Last 1 week go run ./pkg/cmd/ghweeklysummary --author wenyihu6 --weeks 1 # Specific date range go run ./pkg/cmd/ghweeklysummary --author wenyihu6 --after 2025-01-01 --before 2025-02-01 # Save to file go run ./pkg/cmd/ghweeklysummary --author wenyihu6 > weekly_summary.md ``` Release Notes: None Epic: None --- **cmd: move kv dev tools under pkg/cmd/kv/** Move ghsearchdump and ghweeklysummary under pkg/cmd/kv/ for clearer ownership. This seems okay given we already have pkg/cmd/allocsim. ``` Example usage: ❯ go run ./pkg/cmd/ghweeklysummary --author wenyihu6 --weeks 1 ❯ go run ./pkg/cmd/kv/ghsearchdump \ --search "repo:cockroachdb/cockroach author:wenyihu6" \ --contribution-after 2025-12-12 tracking contributions from: [wenyihu6] ``` 160459: kvserver: add tracing to TestStoreRangeSplitBackpressureWrites r=tbg a=stevendanna In #160141 we observed this test fail. This test is rather stable but does depend on two goroutines sending batches to the KV server within a 100ms timeout. One suspicion is that very rarely we are too slow and this time-based coordination is not enough. This test is rather stable and adding more sophisticated coordination will further complicate an already complicated test. Here, I add some tracing that will hopefully allow us to confirm the timing based theory if this ever happens again. Fixes #160141 Release note: None 160499: pgwire: handle negative scale r=drewkimball,yuzefovich a=dils2k The node could crash if it received a decimal with a negative scale value. This commit adds a check and returns an error instead. Epic: CRDB-57539 Fixes: #158884 Release note (bug fix): CockroachDB could previously crash when handling decimals with negative scales via extended PGWire protocol and this has been fixed (an error is returned, similar to PG). 160526: sql: fix TestUnsplitRanges to work with external test tenants r=rafiss a=rafiss Previously, TestUnsplitRanges was skipped in external test tenant mode because it scanned meta ranges directly and performed AdminSplit/ AdminUnsplit operations that external tenants cannot do. This commit fixes the test by: 1. Using the system layer's DB for meta range operations (scanning meta ranges, checking sticky bits, splitting ranges) since tenants cannot access meta ranges directly. 2. Using the application layer's DB for table data operations which tenants can access. 3. Granting the CanAdminUnsplit capability to the external tenant so the GC job can unsplit ranges after dropping tables/indexes. Resolves: #142388 Epic: CRDB-48944 Release note: None Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com> Co-authored-by: wenyihu6 <wenyi@cockroachlabs.com> Co-authored-by: Steven Danna <danna@cockroachlabs.com> Co-authored-by: dils2k <dils.matchanov@gmail.com> Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
|
Build failed (retrying...): |
In #160141 we observed this test fail. This test is rather stable but does depend on two goroutines sending batches to the KV server within a 100ms timeout. One suspicion is that very rarely we are too slow and this time-based coordination is not enough.
This test is rather stable and adding more sophisticated coordination will further complicate an already complicated test. Here, I add some tracing that will hopefully allow us to confirm the timing based theory if this ever happens again.
Fixes #160141
Release note: None