rpc: improve detection of onlyOnceDialer redials#96781
Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom Feb 8, 2023
Merged
rpc: improve detection of onlyOnceDialer redials#96781craig[bot] merged 1 commit intocockroachdb:masterfrom
craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
While looking into cockroachdb#96543, I wasn't 100% sure we weren't accidentally redialing a connection internally. This improved logging and the test makes it more obvious that things are working as intended. Touches cockroachdb#96543. Epic: none Release note: None
Member
erikgrinaker
approved these changes
Feb 8, 2023
tbg
added a commit
to tbg/cockroach
that referenced
this pull request
Feb 8, 2023
Chanced upon this failure mode in unrelated PR cockroachdb#96781. Epic: none Release note: None
Member
Author
|
bors r=erikgrinaker |
Contributor
|
Build succeeded: |
craig bot
pushed a commit
that referenced
this pull request
Feb 8, 2023
89752: jobs/cdc: add metrics for paused jobs r=miretskiy a=jayshrivastava This change adds new metrics to count paused jobs for every job type. For example, the metric for paused changefeed jobs is `jobs.changefeed.currently_paused`. These metrics are counted at an interval defined by the cluster setting `jobs.metrics.interval.poll`. This is implemented by a job which periodically queries `system.jobs` to count the number of paused jobs. This job is of the newly added type `jobspb.TypePollJobsStats`. When a node starts it's job registry, it will create an adoptable stats polling job if it does not exist already using a transaction. This change adds a test which pauses and resumes changefeeds while asserting the value of the `jobs.changefeed.currently_paused` metric. It also adds a logictest to ensure one instance of the stats polling job is created in a cluster. Resolves: #85467 Release note (general change): This change adds new metrics to count paused jobs for every job type. For example, the metric for paused changefeed jobs is `jobs.changefeed.currently_paused`. These metrics are updated at an interval defined by the cluster setting `jobs.metrics.interval.poll`, which is defauled to 10 seconds. Epic: None 94633: kvserver: document reproposals r=nvanbenschoten a=tbg Reproposals are a deep rabbit hole and an area in which past changes were all related to subtle bugs. Write it all up and in particular make some simplifications that ought to be possible if my understanding is correct: - have proposals always enter `(*Replica).propose` without a MaxLeaseIndex or prior encoded command set, i.e. `propose` behaves the same for reproposals as for first proposals. - assert that after a failed call to tryReproposeWithNewLeaseIndex, the command is not in the proposals map, i.e. check absence of a leak. - replace code that should be impossible to reach (and had me confused for a good amount of time) with an assertion. - add long comment on `r.mu.proposals`. This commit also moves `tryReproposeWithNewLeaseIndex` off `(*Replica)`, which is possible due to recent changes[^1]. In doing so, I realized there was a (small) data race (now fixed): when returning a `NotLeaseholderError` from that method, we weren't acquiring `r.mu`. It may have looked as though we were holding it already since we're accessing `r.mu.propBuf`, however that field has special semantics - it wraps `r.mu` and acquires it when needed. [^1]: The "below raft" test mentioned in the previous comment was changed in #93785 and no longer causes a false positive. Epic: CRDB-220 Release note: None 96650: kvserver: extract kvstorage.DestroyReplica r=pavelkalinnikov a=tbg This series of commits extracts `(*Replica).preDestroyRaftMuLocked` into a free-standing method `kvstorage.DestroyReplica`. In doing so, it separates the in-memory and on-disk steps that are a part of replica removal, and makes the on-disk steps unit testable. Touches #93241. Epic: CRDB-220 Release note: None 96659: sql: wrap stacktraceless errors with errors.Wrap r=andreimatei a=ecwall Fixes #95794 This replaces the previous attempt to add logging here #95797. The context itself cannot be augmented to add a stack trace to errors because it interferes with grpc timeout logic - gRPC compares errors directly without checking causes https://github.com/grpc/grpc-go/blob/v1.46.0/rpc_util.go#L833. Although the method signature allows it, `Context.Err()` should not be overriden to customize the error: ``` // If Done is not yet closed, Err returns nil. // If Done is closed, Err returns a non-nil error explaining why: // Canceled if the context was canceled // or DeadlineExceeded if the context's deadline passed. // After Err returns a non-nil error, successive calls to Err return the same error. Err() error ``` Additionally, a child context of the augmented context may end up being used which will circumvent the stack trace capture. This change instead wraps `errors.Wrap` in a few places that might end up helping debug the original problem: 1) Where we call `Context.Err()` directly. 2) Where gRPC returns an error after possibly calling `Context.Err()` internally or returns an error that does not have a stack trace. Release note: None 96770: storage: don't modify the given cfg.Opts r=RaduBerinde a=RaduBerinde This change improves the `NewPebble` code to not modify the given `cfg.Opts`. Such behavior is surprising and can trip up tests that reuse the same config. Also, `ResolveEncryptedEnvOptions` and `wrapFilesystemMiddleware` no longer modify the `Options` directly; and `CheckNoRegistryFile` is now a standalone function. Release note: None Epic: none 96793: kvserver: de-flake TestReplicaProbeRequest r=pavelkalinnikov a=tbg Chanced upon this failure mode in unrelated PR #96781. Epic: none Release note: None Co-authored-by: Jayant Shrivastava <jayants@cockroachlabs.com> Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com> Co-authored-by: Evan Wall <wall@cockroachlabs.com> Co-authored-by: Radu Berinde <radu@cockroachlabs.com>
katmayb
pushed a commit
to katmayb/cockroach
that referenced
this pull request
Feb 9, 2023
96768: server: remove stale comment r=andreimatei a=andreimatei Release note: None Epic: None 96781: rpc: improve detection of onlyOnceDialer redials r=erikgrinaker a=tbg While looking into cockroachdb#96543, I wasn't 100% sure we weren't accidentally redialing a connection internally. This improved logging and the test makes it more obvious that things are working as intended. Touches cockroachdb#96543. Epic: none Release note: None 96782: roachtest: verbose gRPC logging for admission-control/tpcc-olap r=irfansharif a=tbg Closes cockroachdb#96543. Next time we'll know more. Epic: none Release note: None Co-authored-by: Andrei Matei <andrei@cockroachlabs.com> Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
While looking into #96543, I wasn't 100% sure we weren't accidentally
redialing a connection internally. This improved logging and the test
makes it more obvious that things are working as intended.
Touches #96543.
Epic: none
Release note: None