sql: improve stack trace for get-user-timeout timeouts#95797
sql: improve stack trace for get-user-timeout timeouts#95797craig[bot] merged 1 commit intocockroachdb:masterfrom ecwall:sql_get-user-timeout_stacktrace
Conversation
|
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
ecwall
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis, @knz, and @rafiss)
pkg/util/contextutil/context.go line 89 at r1 (raw file):
err := ctx.Context.Err() if errors.Is(err, context.DeadlineExceeded) || errors.Is(err, context.Canceled) { return errors.WithStack(err)
The stack trace can change depending on when the timeout occurs (ctx.Err() returns non-nil), but it will have additional information like this:
internal error while retrieving user account: context deadline exceeded
(1) attached stack trace
-- stack trace:
| github.com/cockroachdb/cockroach/pkg/sql.retrieveSessionInitInfoWithCache
| /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/sql/user.go:242
| github.com/cockroachdb/cockroach/pkg/sql.GetUserSessionInitInfo.func2
| /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/sql/user.go:123
| github.com/cockroachdb/cockroach/pkg/util/contextutil.RunWithTimeout
| /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/util/contextutil/context.go:104
| github.com/cockroachdb/cockroach/pkg/sql.getUserInfoRunFn.func2
| /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/sql/user.go:196
| github.com/cockroachdb/cockroach/pkg/sql.GetUserSessionInitInfo
| /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/sql/user.go:120
| github.com/cockroachdb/cockroach/pkg/server.(*authenticationServer).verifyPasswordDBConsole
| /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/server/authentication.go:425
| github.com/cockroachdb/cockroach/pkg/server.TestVerifyPasswordDBConsole.func1
| /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/server/authentication_test.go:278
| testing.tRunner
| /Users/ewall/go/go1.17.13/src/testing/testing.go:1259
| runtime.goexit
| /Users/ewall/go/go1.17.13/src/runtime/asm_amd64.s:1581
Wraps: (2) internal error while retrieving user account
Wraps: (3) context deadline exceeded
| -- cause hidden behind barrier
| context deadline exceeded
| (1) attached stack trace
| -- stack trace:
| | github.com/cockroachdb/cockroach/pkg/util/contextutil.(*ctxWithStacktrace).Err
| | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/util/contextutil/context.go:89
| | github.com/cockroachdb/cockroach/pkg/sql/sessioninit.(*Cache).loadValueOutsideOfCache
| | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/sql/sessioninit/cache.go:226
| | github.com/cockroachdb/cockroach/pkg/sql/sessioninit.(*Cache).GetAuthInfo
| | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/sql/sessioninit/cache.go:163
| | github.com/cockroachdb/cockroach/pkg/sql.retrieveSessionInitInfoWithCache.func1
| | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/sql/user.go:210
| | github.com/cockroachdb/cockroach/pkg/sql.retrieveSessionInitInfoWithCache
| | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/sql/user.go:237
| | github.com/cockroachdb/cockroach/pkg/sql.GetUserSessionInitInfo.func2
| | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/sql/user.go:123
| | github.com/cockroachdb/cockroach/pkg/util/contextutil.RunWithTimeout
| | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/util/contextutil/context.go:104
| | github.com/cockroachdb/cockroach/pkg/sql.getUserInfoRunFn.func2
| | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/sql/user.go:196
| | github.com/cockroachdb/cockroach/pkg/sql.GetUserSessionInitInfo
| | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/sql/user.go:120
| | github.com/cockroachdb/cockroach/pkg/server.(*authenticationServer).verifyPasswordDBConsole
| | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/server/authentication.go:425
| | github.com/cockroachdb/cockroach/pkg/server.TestVerifyPasswordDBConsole.func1
| | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/server/authentication_test.go:278
| | testing.tRunner
| | /Users/ewall/go/go1.17.13/src/testing/testing.go:1259
| | runtime.goexit
| | /Users/ewall/go/go1.17.13/src/runtime/asm_amd64.s:1581
| Wraps: (2) context deadline exceeded
| Error types: (1) *withstack.withStack (2) context.deadlineExceededError
Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *barriers.barrierErr
which contains these additional lines under Wraps: (3) context deadline exceeded:
| | github.com/cockroachdb/cockroach/pkg/util/contextutil.(*ctxWithStacktrace).Err
| | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/util/contextutil/context.go:89
| | github.com/cockroachdb/cockroach/pkg/sql/sessioninit.(*Cache).loadValueOutsideOfCache
| | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/sql/sessioninit/cache.go:226
| | github.com/cockroachdb/cockroach/pkg/sql/sessioninit.(*Cache).GetAuthInfo
| | /Users/ewall/go/src/github.com/cockroachdb/cockroach/pkg/sql/sessioninit/cache.go:163
| | github.com/cockroachdb/cockroach/pkg/sql.retrieveSessionInitInfoWithCache.func1
yuzefovich
left a comment
There was a problem hiding this comment.
Looks good to me, but I'll defer to @knz for approval.
Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @ecwall, @jordanlewis, @knz, and @rafiss)
-- commits line 24 at r1:
nit: I'd probably not include any release note for this given it's such a low level detail only CRDB engineers will look at.
knz
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @ecwall, @jordanlewis, and @rafiss)
pkg/util/contextutil/context.go line 86 at r1 (raw file):
} func (ctx *ctxWithStacktrace) Err() error {
This design is elegant. I like it. But I don't understand the complexity here -- why did you not simplify down to
func (ctx *ctxWithStacktrace) Err() error {
return errors.WithStack(ctx.Context.Err())
}
knz
left a comment
There was a problem hiding this comment.
May I also suggest an errors.WithStack at the end of runTxn() in pkg/kv/db.go, and another at the end of sendAndFill().
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @ecwall, @jordanlewis, and @rafiss)
ecwall
left a comment
There was a problem hiding this comment.
Added to those places also.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis, @knz, @rafiss, and @yuzefovich)
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: I'd probably not include any release note for this given it's such a low level detail only CRDB engineers will look at.
Fixed
pkg/util/contextutil/context.go line 86 at r1 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
This design is elegant. I like it. But I don't understand the complexity here -- why did you not simplify down to
func (ctx *ctxWithStacktrace) Err() error { return errors.WithStack(ctx.Context.Err()) }
Yeah there is not really a reason to make it more complex so I simplified it to what you suggested.
ecwall
left a comment
There was a problem hiding this comment.
It looks like some tests started failing so I'm taking those out to investigate.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @knz, @rafiss, and @yuzefovich)
Fixes #95794 The cause of the `get-user-timeout errors` is unknown. Part of the problem is that the stack trace gets cut off at ``` | | github.com/cockroachdb/cockroach/pkg/sql.retrieveSessionInitInfoWithCache | | github.com/cockroachdb/cockroach/pkg/sql/user.go:238 ``` which does not explain what is actually being blocked. The reason that the stack trace is cut off is that the timeout is initiated by `contextutil.RunWithTimeout` which results in a "simple" (no stack trace) `context.DeadlineExceeded` error. `retrieveSessionInitInfoWithCache` is the first line in the stack trace because it calls `errors.Wrap` on `context.DeadlineExceeded`. To get a fuller stack trace, `context.DeadlineExceeded` must be wrapped immediately (`errors.Wrap` or `errors.WithStack`) before it bubbles up. Release note: None
|
bors r=knz |
|
Build failed (retrying...): |
|
Build succeeded: |
|
This seems to have caused an issue with gRPC error propagation: #96167 |
89752: jobs/cdc: add metrics for paused jobs r=miretskiy a=jayshrivastava This change adds new metrics to count paused jobs for every job type. For example, the metric for paused changefeed jobs is `jobs.changefeed.currently_paused`. These metrics are counted at an interval defined by the cluster setting `jobs.metrics.interval.poll`. This is implemented by a job which periodically queries `system.jobs` to count the number of paused jobs. This job is of the newly added type `jobspb.TypePollJobsStats`. When a node starts it's job registry, it will create an adoptable stats polling job if it does not exist already using a transaction. This change adds a test which pauses and resumes changefeeds while asserting the value of the `jobs.changefeed.currently_paused` metric. It also adds a logictest to ensure one instance of the stats polling job is created in a cluster. Resolves: #85467 Release note (general change): This change adds new metrics to count paused jobs for every job type. For example, the metric for paused changefeed jobs is `jobs.changefeed.currently_paused`. These metrics are updated at an interval defined by the cluster setting `jobs.metrics.interval.poll`, which is defauled to 10 seconds. Epic: None 94633: kvserver: document reproposals r=nvanbenschoten a=tbg Reproposals are a deep rabbit hole and an area in which past changes were all related to subtle bugs. Write it all up and in particular make some simplifications that ought to be possible if my understanding is correct: - have proposals always enter `(*Replica).propose` without a MaxLeaseIndex or prior encoded command set, i.e. `propose` behaves the same for reproposals as for first proposals. - assert that after a failed call to tryReproposeWithNewLeaseIndex, the command is not in the proposals map, i.e. check absence of a leak. - replace code that should be impossible to reach (and had me confused for a good amount of time) with an assertion. - add long comment on `r.mu.proposals`. This commit also moves `tryReproposeWithNewLeaseIndex` off `(*Replica)`, which is possible due to recent changes[^1]. In doing so, I realized there was a (small) data race (now fixed): when returning a `NotLeaseholderError` from that method, we weren't acquiring `r.mu`. It may have looked as though we were holding it already since we're accessing `r.mu.propBuf`, however that field has special semantics - it wraps `r.mu` and acquires it when needed. [^1]: The "below raft" test mentioned in the previous comment was changed in #93785 and no longer causes a false positive. Epic: CRDB-220 Release note: None 96650: kvserver: extract kvstorage.DestroyReplica r=pavelkalinnikov a=tbg This series of commits extracts `(*Replica).preDestroyRaftMuLocked` into a free-standing method `kvstorage.DestroyReplica`. In doing so, it separates the in-memory and on-disk steps that are a part of replica removal, and makes the on-disk steps unit testable. Touches #93241. Epic: CRDB-220 Release note: None 96659: sql: wrap stacktraceless errors with errors.Wrap r=andreimatei a=ecwall Fixes #95794 This replaces the previous attempt to add logging here #95797. The context itself cannot be augmented to add a stack trace to errors because it interferes with grpc timeout logic - gRPC compares errors directly without checking causes https://github.com/grpc/grpc-go/blob/v1.46.0/rpc_util.go#L833. Although the method signature allows it, `Context.Err()` should not be overriden to customize the error: ``` // If Done is not yet closed, Err returns nil. // If Done is closed, Err returns a non-nil error explaining why: // Canceled if the context was canceled // or DeadlineExceeded if the context's deadline passed. // After Err returns a non-nil error, successive calls to Err return the same error. Err() error ``` Additionally, a child context of the augmented context may end up being used which will circumvent the stack trace capture. This change instead wraps `errors.Wrap` in a few places that might end up helping debug the original problem: 1) Where we call `Context.Err()` directly. 2) Where gRPC returns an error after possibly calling `Context.Err()` internally or returns an error that does not have a stack trace. Release note: None 96770: storage: don't modify the given cfg.Opts r=RaduBerinde a=RaduBerinde This change improves the `NewPebble` code to not modify the given `cfg.Opts`. Such behavior is surprising and can trip up tests that reuse the same config. Also, `ResolveEncryptedEnvOptions` and `wrapFilesystemMiddleware` no longer modify the `Options` directly; and `CheckNoRegistryFile` is now a standalone function. Release note: None Epic: none 96793: kvserver: de-flake TestReplicaProbeRequest r=pavelkalinnikov a=tbg Chanced upon this failure mode in unrelated PR #96781. Epic: none Release note: None Co-authored-by: Jayant Shrivastava <jayants@cockroachlabs.com> Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com> Co-authored-by: Evan Wall <wall@cockroachlabs.com> Co-authored-by: Radu Berinde <radu@cockroachlabs.com>
Fixes #95794
The cause of the
get-user-timeouterrors is unknown. Part of the problem is that the stack trace gets cut off atwhich does not explain what is actually being blocked.
The reason that the stack trace is cut off is that the timeout is initiated by
contextutil.RunWithTimeoutwhich results in a "simple" (no stack trace)context.DeadlineExceedederror.retrieveSessionInitInfoWithCacheis the first line in the stack trace because it callserrors.Wraponcontext.DeadlineExceeded.To get a fuller stack trace,
context.DeadlineExceededmust be wrapped immediately (errors.Wraporerrors.WithStack) before it bubbles up.Release note: None