Skip to content

sql: trace meta2 scans made by crdb_internal.probe_ranges#81107

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
joshimhoff:tweak_crdb_internal_probe
May 12, 2022
Merged

sql: trace meta2 scans made by crdb_internal.probe_ranges#81107
craig[bot] merged 1 commit intocockroachdb:masterfrom
joshimhoff:tweak_crdb_internal_probe

Conversation

@joshimhoff
Copy link
Copy Markdown
Collaborator

sql: trace meta2 scans made by crdb_internal.probe_ranges

With this commit, CRDB traces the meta2 scan made by
crdb_internal.probe_ranges. If the scan fails, the trace is attached to the
error returned to the SQL client. This way, outages involving a down meta2
range are observable.

According to my testing, meta2 is the only system range that can make
crdb_internal.probe_ranges unavailable. As a result, with this commit, we
should get a trace of a query to the unavailable range, regardless of which
range is unavailable, either in the SQL table output of
crdb_internal.probe_ranges or in an error string.

Release note. None.


This is based on some thinking & testing I did at #79546 (comment).

With this commit, we get:

root@:26257/defaultdb> select range_id, error from crdb_internal.probe_ranges(INTERVAL '1s', 'write');
ERROR:      0.000ms      0.000ms    === operation:meta2scan _verbose:1 client:127.0.0.1:49588 node:1 user:root
     0.010ms      0.010ms        === operation:txn coordinator send _verbose:1 client:127.0.0.1:49588 node:1 txnID:be5903f5-1817-481b-8dcf-2e2bc8b0f052 user:root
     0.020ms      0.010ms            === operation:dist sender send _verbose:1 client:127.0.0.1:49588 node:1 txn:be5903f5 user:root
     0.063ms      0.043ms            event:kv/kvclient/kvcoord/range_iter.go:175 [n1,client=127.0.0.1:49588,user=root,txn=be5903f5] querying next range at /Meta2/"\x00"
     0.122ms      0.059ms            event:kv/kvclient/kvcoord/dist_sender.go:2032 [n1,client=127.0.0.1:49588,user=root,txn=be5903f5] r1: sending batch 1 Scan to (n1,s1):1
     0.132ms      0.010ms            event:rpc/nodedialer/nodedialer.go:161 [n1,client=127.0.0.1:49588,user=root,txn=be5903f5] sending request to local client
     0.136ms      0.004ms                === operation:/cockroach.roachpb.Internal/Batch _verbose:1 node:1 span.kind:server
     0.144ms      0.008ms                event:server/node.go:996 [n1] node received request: 1 Scan
     0.174ms      0.030ms                event:kv/kvserver/store_send.go:157 [n1,s1] executing Scan [/Meta2/"\x00",/Meta2/Max), [txn: be5903f5], [can-forward-ts]
     0.335ms      0.161ms                event:server/node.go:1009 [n1] error from stores.Send: boom: "sql txn" meta={id=be5903f5 key=/Min pri=0.02292157 epo=0 ts=1651868092.726275000,0 min=1651868092.726275000,0 seq=0} lock=false stat=PENDING rts=1651868092.726275000,0 wto=false gul=1651868093.226275000,0
     0.384ms      0.252ms            event:kv/kvclient/kvcoord/dist_sender.go:2144 [n1,client=127.0.0.1:49588,user=root,txn=be5903f5] boom: "sql txn" meta={id=be5903f5 key=/Min pri=0.02292157 epo=0 ts=1651868092.726275000,0 min=1651868092.726275000,0 seq=0} lock=false stat=PENDING rts=1651868092.726275000,0 wto=false gul=1651868093.226275000,0
     0.442ms      0.058ms            event:kv/kvclient/kvcoord/dist_sender.go:1674 [n1,client=127.0.0.1:49588,user=root,txn=be5903f5] reply error Scan [/Meta2/"\x00",/Meta2/Max), [txn: be5903f5], [can-forward-ts]: boom: "sql txn" meta={id=be5903f5 key=/Min pri=0.02292157 epo=0 ts=1651868092.726275000,0 min=1651868092.726275000,0 seq=0} lock=false stat=PENDING rts=1651868092.726275000,0 wto=false gul=1651868093.226275000,0
: boom

If run a test like this:

~/g/s/g/c/cockroach [tweak_crdb_internal_probe] $ git diff
diff --git a/pkg/kv/kvserver/replica_send.go b/pkg/kv/kvserver/replica_send.go
index 3295886e91..ff7a1ed4df 100644
--- a/pkg/kv/kvserver/replica_send.go
+++ b/pkg/kv/kvserver/replica_send.go
@@ -12,7 +12,10 @@ package kvserver
+var sigCh = func() chan os.Signal {
+       ch := make(chan os.Signal, 1)
+       signal.Notify(ch, syscall.SIGHUP)
+       return ch
+}()
+
+var weDead atomic.Int32
+
 // sendWithoutRangeID used to be called sendWithRangeID, accepted a `_forStacks
 // roachpb.RangeID` argument, and had the description below. Ever since Go
 // switched to the register-based calling convention though, this stopped
@@ -126,6 +138,16 @@ func (r *Replica) sendWithoutRangeID(
 ) (_ *roachpb.BatchResponse, rErr *roachpb.Error) {
        var br *roachpb.BatchResponse
 
+       select {
+       case <-sigCh:
+               log.Warningf(ctx, "time to drop incoming raft messages for range")
+               weDead.Store(1)
+       default:
+       }
+       if weDead.Load() == 1 && ba.RangeID >= 1 {
+               return nil, roachpb.NewError(errors.New("boom"))
+       }
+
        if r.leaseholderStats != nil && ba.Header.GatewayNodeID != 0 {
                r.leaseholderStats.RecordCount(r.getBatchRequestQPS(ctx, ba), ba.Header.GatewayNodeID)
        }

Without this commit, we just get:

root@:26257/?> select range_id, error from crdb_internal.probe_ranges(INTERVAL '1s', 'write');
ERROR: boom

@joshimhoff joshimhoff requested review from a team, Santamaura and tbg May 6, 2022 20:17
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

// Trace the query to meta2. Return it as part of the error string if the query fails.
// This improves observability into a meta2 outage. We expect crdb_internal.probe_range
// to be available, unless meta2 is down.
meta2Ctx, sp := tracing.EnsureChildSpan(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preference to rewrite as follows, that way there is only ever one ctx in scope and it's the right one:

var ranges []WhateverTypeThisIs
{
  ctx, sp := tracing.EnsureChildSpan(...)
  // ...
  var err error
  ranges, err = kvclient.ScanMetaKVs(meta2Ctx, ...)
  if err != nil { ... }
}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@joshimhoff joshimhoff force-pushed the tweak_crdb_internal_probe branch from 5579cdc to f6a55bb Compare May 12, 2022 13:38
@joshimhoff
Copy link
Copy Markdown
Collaborator Author

TTFR!

@joshimhoff
Copy link
Copy Markdown
Collaborator Author

bors r+

With this commit, CRDB traces the meta2 scan made by
crdb_internal.probe_ranges. If the scan fails, the trace is attached to the
error returned to the SQL client. This way, outages involving a down meta2
range are observable.

According to my testing, meta2 is the only system range that can make
crdb_internal.probe_ranges unavailable. As a result, with this commit, we
should get a trace of a query to the unavailable range, regardless of which
range is unavailable, either in the SQL table output of
crdb_internal.probe_ranges or in an error string.

Release note. None.
@joshimhoff joshimhoff force-pushed the tweak_crdb_internal_probe branch from f6a55bb to fec45b8 Compare May 12, 2022 14:18
@craig
Copy link
Copy Markdown
Contributor

craig bot commented May 12, 2022

Canceled.

@joshimhoff
Copy link
Copy Markdown
Collaborator Author

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented May 12, 2022

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants