kv: convert uni-directional network partitions to bi-directional by andrewbaptist · Pull Request #94778 · cockroachdb/cockroach

andrewbaptist · 2023-01-05T19:17:49Z

Previously one-way partitions where a node could initiate a successful TCP
connection in one direction, but the reverse connection fails which causes
problems. The node that initiates outgoing connections can acquire
leases and cause failures for reads and writes to those ranges. This is
particularly a problem if it acquires the liveness range leases, but is
a problem even for other ranges.

This commit adds an additional check during server-to-server
communication where the recipient of a new PingRequest first validates
that it is able to open a reverse connection to the initiator before
responding. Additionally, it will monitor whether it has a successful
reverse connection over time and asynchronously validate reverse
connections to the sender. The ongoing validation is asynchronous to
avoid adding delays to PingResponses as they are used for measuring
clock offsets.

Release note (bug fix): Detects and addresses one-way partitions.
Epic: none

cockroach-teamcity · 2023-01-05T19:17:58Z

This change is

erikgrinaker · 2023-01-12T12:57:18Z

pkg/rpc/context.go

+		// We want this connection to block on connection so we verify it
+		// succeeds.
+		dialOpts = append(dialOpts, grpc.WithBlock())
+		conn, err := grpc.DialContext(dialCtx, target, dialOpts...)


As we discussed elsewhere, we should go via grpcDialNodeInternal() such that we'll retain this connection for future use and join any in-flight connection attempts.

For an initial inbound connection, this will naïvely result in a circular dependency since we'll be attempting to ping each other in either direction, both waiting on the other's ping response. However, for the dialback connection attempt we don't have to wait for the ping response itself, we can simply wait for the underlying gRPC connection to be established asynchronously, e.g. via ClientConn.WaitForStateChange.

We may want to limit this to only initial ping attempts, such that we later verify that we're actually receiving pings in both directions too, but this is a lesser concern.

erikgrinaker · 2023-01-12T13:25:30Z

pkg/server/server.go


+	// This is somewhat tangled due to the need to use gossip to resolve the
+	// address. If we are not able to resolve, we simply skip this verification.
+	rpcContext.PingResolver = func(ctx context.Context, nodeID roachpb.NodeID) string {


We can avoid this tangle by not passing rpcContext in to gossip.New. Gossip only uses the RPC context once it starts making outbound connections, and that only happens once we call Gossip.Start() far below here. The cleanest approach is probably to remove the Gossip.rpcContext field entirely, and thread the RPC context via Gossip.Start() down into bootstrap() and manage(), but alternatively we can set the rpcContext field during Gossip.Start() as long as we properly synchronize access to it. This should preferably be done as a separate commit.

When we do that, we can set up gossip before the RPC context and just pass a nodedialer.AddressResolver to rpc.NewContext() (if we get away with it without any dependency cycles, alternatively use some other corresponding type).

erikgrinaker · 2023-01-12T13:31:33Z

pkg/rpc/context.go

+func (rpcCtx *Context) VerifyDialback(ctx context.Context, nodeID roachpb.NodeID) error {
+	log.Errorf(ctx, "Verifying health of connection to n%d", nodeID)
+	if nodeID == 0 {
+		//FIXME: at startup, the nodeID might not be set. Unfortunately


This only happens when bootstrapping a new node, right? Once the node has joined a cluster, this will always be set even on the initial connection attempts? If so, this seems totally fine, since the node can't participate in consensus or acquire any leases before it's allocated a node ID.

erikgrinaker · 2023-01-12T13:34:21Z

pkg/rpc/heartbeat.go

+	// established. If there is already a healthy connection set up, it will
+	// simply return immediately, however if not, it attempts to establish a
+	// "dummy" connection which is never used to send messages on.
+	verifyDialback func(context.Context, roachpb.NodeID) error


We already have onHandlePing which effectively does the same thing, should we hook into that instead? We can't go via ContextOptions.OnIncomingPing because that's necessarily constructed before the RPC context itself, but we can inject a handler during RPC context construction before setting onHandlePing. Not particularly important, and I'm not sure if it's really making things more clear or less, so take it or leave it.

tbg · 2023-02-27T08:33:37Z

@koorosh pinging you so you're aware of this PR, as it's likely to conflict with #96566 (not in unresolvable ways but still).

erikgrinaker

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist)

pkg/rpc/context.go line 2593 at r3 (raw file):

but some methods like ConnHealthTryDial is really broken today. If there is not already a connection, it will establish a new connection, but then call Health on it immediately which is basically guaranteed to return ErrNotHeartbeated

Right. I originally implemented ConnHealthTryDial back in #70017, carrying over the old behavior of ConnHealth for backwards compatibility. I think the immediate motivation here was that if ConnHealthTryDial doesn't kick off a dial attempt then it's possible that noone else will try to dial the node either, and so DistSQL (the only caller) may never be able to reach the remote node and always fall back to local processing. But it may not be better for ConnHealthTryDial to block until connected, since that would increase latencies in the face of network outages (it would have to wait for a timeout instead of scheduling it elsewhere), but it's possible that the RPC circuit breakers will take care of this -- in any case, it really shouldn't be DistSQL's responsibility to maintain RPC connections in the first place.

pkg/rpc/context.go line 2589 at r6 (raw file):
As mentioned above:

We shouldn't block on it, we should accept the ping until we have a dial result. Otherwise we may as well just always use blocking.

These delays will [...] count towards the ping timeout budget (which is not as high as it seems since pings are subject to head of line blocking and can already have been sitting around in an RPC buffer for 4 seconds before we get it).

In fact, now that we use GRPCDialNode for the non-blocking return dial, we may need 6 network roundtrips for this to succeed (one for the outbound connection and one for the blocking return connection).

I think this basically means that we should check ConnHealth(), and allow the ping on ErrNotHeartbeated but reject it on any other error (with the assumption that the setup and heartbeat will eventually succeed or error out).

pkg/rpc/context.go line 2594 at r6 (raw file):

		// Check in our regular connection map to see if we are healthy.
		err = rpcCtx.ConnHealth(target, nodeID, DefaultClass)

Shouldn't this use the same connection class as the incoming connection?

andrewbaptist · 2023-03-06T14:32:02Z

Likely addresses #54762 as well. The dialback will fail due to the reverse ping not working. This may have also been addressed by other changes, but I will test and mark this as fixed once this PR merges.

andrewbaptist

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)

pkg/rpc/context.go line 2557 at r3 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Yeah no, I agree, it makes sense to verify it here.

sounds good!

pkg/rpc/context.go line 2593 at r3 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

but some methods like ConnHealthTryDial is really broken today. If there is not already a connection, it will establish a new connection, but then call Health on it immediately which is basically guaranteed to return ErrNotHeartbeated

Right. I originally implemented ConnHealthTryDial back in #70017, carrying over the old behavior of ConnHealth for backwards compatibility. I think the immediate motivation here was that if ConnHealthTryDial doesn't kick off a dial attempt then it's possible that noone else will try to dial the node either, and so DistSQL (the only caller) may never be able to reach the remote node and always fall back to local processing. But it may not be better for ConnHealthTryDial to block until connected, since that would increase latencies in the face of network outages (it would have to wait for a timeout instead of scheduling it elsewhere), but it's possible that the RPC circuit breakers will take care of this -- in any case, it really shouldn't be DistSQL's responsibility to maintain RPC connections in the first place.

Makes sense - as we discussed, we don't want to change this behavior as part of this PR, but at a minimum, this method is poorly named. I added comments to avoid confusion for the next person reading this and will make a separate PR to refactor this code.

pkg/rpc/context.go line 2589 at r6 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

As mentioned above:

We shouldn't block on it, we should accept the ping until we have a dial result. Otherwise we may as well just always use blocking.

These delays will [...] count towards the ping timeout budget (which is not as high as it seems since pings are subject to head of line blocking and can already have been sitting around in an RPC buffer for 4 seconds before we get it).

In fact, now that we use GRPCDialNode for the non-blocking return dial, we may need 6 network roundtrips for this to succeed (one for the outbound connection and one for the blocking return connection).

I think this basically means that we should check ConnHealth(), and allow the ping on ErrNotHeartbeated but reject it on any other error (with the assumption that the setup and heartbeat will eventually succeed or error out).

Assume we have healthy connections from A -> B and B -> A. If the reverse connection (B->A) gets broken, the following sequence happens.

T0: A -> B - send Ping
T0 + RTT/2 - B -> A - send successful Pong, but initiate GRPCDialNode
T0 + RTT/2 + 2RTT - A (success case) - receives ping and responds immediately since it has a healthy connection. (the 2RTT is to handle all the network setup).
T0 + 3*RTT - B is aware that the connection is established.
T0 + ?? - A (fail case) - connection attempt fails/times out. This
T0 + 3s - A sleep delay between pings (currently 3 seconds)
T0 + 3s - A -> B - send next Ping
T0 + 3s + RTT/2 - B -> A - checks the status of the connection attempt from 3 sec ago - if successful, then continues if failure, send an err Pong.

So the assumption is that the reverse connection can be set up in less than one PingInterval. I understand we want to lower it further, but it will always have to be at least 3x network RTT so I don't see this as a problem as long as we don't get into a situation where this holds. I can't imagine we want to set it lower than that as we wouldn't be able to deal with much latency. So assuming we support up to 500ms RTT, this allows setting PingTime down to 1.5s.

Lets discuss in more detail. If you are OK with this I will add some more comments to that code to make it clear why this dialback is fine.

pkg/rpc/context.go line 2594 at r6 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Shouldn't this use the same connection class as the incoming connection?

I had considered this, but I'm not exactly sure if this is right. The connection classes are just used for "separating" traffic, not for prioritization as there is no mechanism for doing this.

If we did this it would have two implications. First, it would require passing this over the PingRequest (or at least pass it in the case where there is a DialBack type set). But second, the Default class is the most likely to "stay connected" since it is used for most traffic, so it is the most likely to be an accurate indicator of the health of the reverse connection.

Lets discuss this a little more.

pkg/rpc/heartbeat.proto line 74 at r3 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

I left this for now since we are treating NONE as UNKNOWN. I need to do some compatibility testing on this.

This works fine with mixed-version clusters and CLIs from different versions. I will do one more pass to make sure I didn't miss anything, but it seems best to not add a new flag.

andrewbaptist

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)

pkg/rpc/context.go line 2589 at r6 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

Assume we have healthy connections from A -> B and B -> A. If the reverse connection (B->A) gets broken, the following sequence happens.

T0: A -> B - send Ping

T0 + RTT/2 - B -> A - send successful Pong, but initiate GRPCDialNode

T0 + RTT/2 + 2RTT - A (success case) - receives ping and responds immediately since it has a healthy connection. (the 2RTT is to handle all the network setup).

T0 + 3*RTT - B is aware that the connection is established.

T0 + ?? - A (fail case) - connection attempt fails/times out. This

T0 + 3s - A sleep delay between pings (currently 3 seconds)

T0 + 3s - A -> B - send next Ping

T0 + 3s + RTT/2 - B -> A - checks the status of the connection attempt from 3 sec ago - if successful, then continues if failure, send an err Pong.

So the assumption is that the reverse connection can be set up in less than one PingInterval. I understand we want to lower it further, but it will always have to be at least 3x network RTT so I don't see this as a problem as long as we don't get into a situation where this holds. I can't imagine we want to set it lower than that as we wouldn't be able to deal with much latency. So assuming we support up to 500ms RTT, this allows setting PingTime down to 1.5s.

Lets discuss in more detail. If you are OK with this I will add some more comments to that code to make it clear why this dialback is fine.

Changed to wait until the dialback connection is completed before verified rather than just one ping later.

pkg/rpc/context.go line 2594 at r6 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

I had considered this, but I'm not exactly sure if this is right. The connection classes are just used for "separating" traffic, not for prioritization as there is no mechanism for doing this.

If we did this it would have two implications. First, it would require passing this over the PingRequest (or at least pass it in the case where there is a DialBack type set). But second, the Default class is the most likely to "stay connected" since it is used for most traffic, so it is the most likely to be an accurate indicator of the health of the reverse connection.

Lets discuss this a little more.

Updated to always use the system class and added comments around it.

erikgrinaker

Thanks for getting this across the line, I appreciate it was a longer slog than expected.

We should mark this as resolving #84289, and put the GA blocker label on that issue (labels on PRs don't count).

Reviewed 1 of 4 files at r5, 10 of 10 files at r14, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist)

pkg/rpc/context.go line 2589 at r6 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

Changed to wait until the dialback connection is completed before verified rather than just one ping later.

Thanks, appreciated -- blocking pings didn't really buy us anything, and only had downsides, so I think this is much better.

As for the RTT considerations, if we do want to drop the ping timeout, we would likely differentiate the initial ping timeout and subsequent timeouts to account for the additional dial RTTs, since we only have to account for the additional RTTs on the blocking ping. That would allow us to aggressively lower the ping timeout on low-latency (regional) clusters. There's also head-of-line blocking to consider.

pkg/rpc/context.go line 2249 at r14 (raw file):

			// so we ignore it.
			err := rpcCtx.runHeartbeat(ctx, conn, target)
			log.Health.Infof(ctx, "connection heartbeat loop ended with err: %v", err)

nit: will this result in duplicate logging? The error is stored in conn.err and propagated to callers via Connect() and Health() where they will presumably log it or propagate it further up the stack. If we should log it, shouldn't it be logged as an error?

pkg/rpc/context.go line 2297 at r14 (raw file):

	"rpc.dialback.enabled",
	"if true, require bidirectional RPC connections between nodes to prevent one-way network unavailability",
	true,

Should this be TenantReadOnly? I think we generally default to that unless we have a very good reason to use SystemOnly, since changes otherwise aren't visible to tenants and they'll always use the default value regardless of the host's value.

pkg/rpc/context.go line 2578 at r14 (raw file):

	// not the node. In that case, we can't look up if we have a connection to the
	// node and instead need to always try dialback.
	var err error

nit: we can drop this declaration and use a locally scoped err := inside the branch, since we don't need to keep the result around for later.

pkg/rpc/context.go line 2600 at r14 (raw file):

		// Clear out the ServerTime on our response so the receiver does not use
		// this response in its latency calculations.
		response.ServerTime = 0

Did you consider checking this on the sender side instead, where we update the clock/latency measurements? They're only affected when we set BLOCKING on the request, and this is controlled by the sender, so it can locally choose to ignore the measurement in that case. That avoids plumbing through the response here, keeps the logic in one place, and also avoids any subtle ordering dependencies where e.g. the OnPing has to be called after ServerTime has been populated.

Also, in clusters with clock skew, this will prevent us from detecting the skew on the initial connection attempt, thus we'll run with known clock skew for up to several seconds, potentially violating linearizability in the meanwhile. Maybe this isn't that big of a deal since these checks are always going to be best-effort anyway, but it seems a bit unfortunate that we're relaxing clock skew protections here. Do you have any thoughts on this? Afaict, we're forced to relax either clock skew protection or asymmetric partition detection here.

pkg/rpc/context.go line 2645 at r14 (raw file):

	previousAttempt := rpcCtx.dialbackMu.m[nodeID]

	// Block here for the previous connection to be completed (successfully or

nit: comment seems outdated, we don't block here.

pkg/rpc/context.go line 2652 at r14 (raw file):

	if previousAttempt != nil {
		select {
		case <-previousAttempt.initialHeartbeatDone:

nit: isn't this exactly the same logic as Health(), where the default branch corresponds to ErrNotHeartbeated?

pkg/rpc/heartbeat.proto line 74 at r3 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

This works fine with mixed-version clusters and CLIs from different versions. I will do one more pass to make sure I didn't miss anything, but it seems best to not add a new flag.

Sure. I suppose we could use a version gate whenever we add an enum value instead.

pkg/rpc/context_test.go line 2618 at r14 (raw file):

	rpcCtx := NewContext(context.Background(), opts)
	// This is normally set up inside the server, we want to hold onto all PingRequests that come through.
	rpcCtx.OnIncomingPing = func(ctx context.Context, req *PingRequest, resp *PingResponse) error {

So we basically rely on failover/*/blackhole-(recv|send) to verify that this work in a real cluster, yeah? Do we feel like that coverage is sufficient, or should we wire up some rudimentary integration tests using a stock TestCluster too?

pkg/rpc/context_test.go line 2715 at r14 (raw file):

	// Forcibly shut down listener 2 and the connection node1 -> node2.
	// Verify the reverse connection will also close within a DialTimeout.

nit: DialTimeout doesn't come into play here because the closed listener will respond with an immediate TCP RST packet. I'm not sure we can easily test DialTimeout here since we have to fiddle with the OS TCP stack.

andrewbaptist

TFTR - I updated the PR and comments also. I'm rerunning all the blackhole tests one more time with the final code and will push once that is done and successful.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)

pkg/rpc/context.go line 2589 at r6 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Thanks, appreciated -- blocking pings didn't really buy us anything, and only had downsides, so I think this is much better.

As for the RTT considerations, if we do want to drop the ping timeout, we would likely differentiate the initial ping timeout and subsequent timeouts to account for the additional dial RTTs, since we only have to account for the additional RTTs on the blocking ping. That would allow us to aggressively lower the ping timeout on low-latency (regional) clusters. There's also head-of-line blocking to consider.

Thanks, I agree this is better. I was worried about the timeouts on connection attempts, but it seems OK since we have a couple of failsafes in place. I'll add a note about this as well.

pkg/rpc/context.go line 2249 at r14 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

nit: will this result in duplicate logging? The error is stored in conn.err and propagated to callers via Connect() and Health() where they will presumably log it or propagate it further up the stack. If we should log it, shouldn't it be logged as an error?

It is duplicate, but it is better to know when this fails immediately rather than waiting for a future (non-heartbeat) message. Granted that future messages are typically soon, but there is no guarantee when they actually happen. I left it as Info since the connection is also Info and I wanted to be symmetric with that.

pkg/rpc/context.go line 2297 at r14 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Should this be TenantReadOnly? I think we generally default to that unless we have a very good reason to use SystemOnly, since changes otherwise aren't visible to tenants and they'll always use the default value regardless of the host's value.

Update to TenantReadOnly

pkg/rpc/context.go line 2578 at r14 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

nit: we can drop this declaration and use a locally scoped err := inside the branch, since we don't need to keep the result around for later.

Done

pkg/rpc/context.go line 2600 at r14 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Did you consider checking this on the sender side instead, where we update the clock/latency measurements? They're only affected when we set BLOCKING on the request, and this is controlled by the sender, so it can locally choose to ignore the measurement in that case. That avoids plumbing through the response here, keeps the logic in one place, and also avoids any subtle ordering dependencies where e.g. the OnPing has to be called after ServerTime has been populated.

Also, in clusters with clock skew, this will prevent us from detecting the skew on the initial connection attempt, thus we'll run with known clock skew for up to several seconds, potentially violating linearizability in the meanwhile. Maybe this isn't that big of a deal since these checks are always going to be best-effort anyway, but it seems a bit unfortunate that we're relaxing clock skew protections here. Do you have any thoughts on this? Afaict, we're forced to relax either clock skew protection or asymmetric partition detection here.

Good idea! I changed to check on the sender side only and removed zeroing it out here.

pkg/rpc/context.go line 2645 at r14 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

nit: comment seems outdated, we don't block here.

Updated comment

pkg/rpc/context.go line 2652 at r14 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

nit: isn't this exactly the same logic as Health(), where the default branch corresponds to ErrNotHeartbeated?

It was very close to calling Health and I considered that, but it needed to handle the dialbackMu.m differently and by time I had handled all the different cases it was basically harder to read than just copying the checks here. Two of the three "branches" do something different, so it really became easier to just reimplemnt this rather than using Healkth.

pkg/rpc/heartbeat.proto line 74 at r3 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Sure. I suppose we could use a version gate whenever we add an enum value instead.

I left for now, I don't expect we will add new values and it is probably going to require a version gate if we ever do. Leaving the default value as 0 simplified things a little.

pkg/rpc/context_test.go line 2618 at r14 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

So we basically rely on failover/*/blackhole-(recv|send) to verify that this work in a real cluster, yeah? Do we feel like that coverage is sufficient, or should we wire up some rudimentary integration tests using a stock TestCluster too?

I'm going to add a TODO to wire up a test for this. I don't want to hold up merging this PR though. I'll work on this later this week.

pkg/rpc/context_test.go line 2715 at r14 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

nit: DialTimeout doesn't come into play here because the closed listener will respond with an immediate TCP RST packet. I'm not sure we can easily test DialTimeout here since we have to fiddle with the OS TCP stack.

Updated the comment.

andrewbaptist · 2023-03-08T17:51:49Z

bors r=erikgrinaker

craig · 2023-03-08T18:32:08Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

craig · 2023-03-08T19:28:12Z

Build failed:

Bazel Essential CI (Cockroach)

Fixes: cockroachdb#84289 Previously one-way partitions where a node could initiate a successful TCP connection in one direction, but the reverse connection fails which causes problems. The node that initiates outgoing connections can acquire leases and cause failures for reads and writes to those ranges. This is particularly a problem if it acquires the liveness range leases, but is a problem even for other ranges. This commit adds an additional check during server-to-server communication where the recipient of a new PingRequest first validates that it is able to open a reverse connection to the initiator before responding. Additionally, it will monitor whether it has a successful reverse connection over time and asynchronously validate reverse connections to the sender. The ongoing validation is asynchronous to avoid adding delays to PingResponses as they are used for measuring clock offsets. Also the onlyOnceDialer prevents retrying after a dial error, however this can get into a state where it continually retries for certain network connections. This is not easy to reproduce in a unit tests as it requires killing the connection using iptables (normal closes don't cauuse this). After this change the onlyOnceDialer will no longer repeatedly retry to reconnect after a broken connection during setup. Release note (bug fix): RPC connections between nodes now require RPC connections to be established in both directions, otherwise the connection will be closed. This is done to prevent asymmetric network partitions where nodes are able to send outbound messages but not receive inbound messages, which could result in persistent unavailability. This behavior can be disabled by the cluster setting rpc.dialback.enabled. Epic: CRDB-2488

andrewbaptist · 2023-03-09T02:13:28Z

bors r=erikgrinaker

craig · 2023-03-09T02:46:49Z

Build succeeded:

Bazel Essential CI (Cockroach)

rmloveland · 2024-09-26T20:11:18Z

@andrewbaptist i'm making a small docs update based on this PR - can you confirm what versions this is in?

I did the following to check which tags contain this PR's commit SHA but let me know if i'm holding it wrong

if below is correct, looks like it's in everything v23.1+ at this point but please correct me if wrong

$ git tag --contains 5800482 | egrep '^v2' | egrep -v '(alpha|beta|rc)'        
v23.1.0
v23.1.1
v23.1.10
v23.1.11
v23.1.12
v23.1.13
v23.1.14
v23.1.15
v23.1.16
v23.1.17
v23.1.18
v23.1.19
v23.1.2
v23.1.20
v23.1.21
v23.1.22
v23.1.23
v23.1.24
v23.1.25
v23.1.26
v23.1.3
v23.1.4
v23.1.5
v23.1.6
v23.1.7
v23.1.8
v23.1.9
v23.2.0
v23.2.1
v23.2.10
v23.2.11
v23.2.2
v23.2.3
v23.2.4
v23.2.5
v23.2.6
v23.2.7
v23.2.8
v23.2.9
v24.1.0
v24.1.1
v24.1.2
v24.1.3
v24.1.4
v24.2.0
v24.2.1
v24.2.2

rmloveland · 2024-09-26T20:39:36Z

ah i just found it in the release notes as well for v23.1.0

https://www.cockroachlabs.com/docs/releases/v23.1.html#v23-1-0-cluster-settings

guess that answers my question, sorry for the noise

andrewbaptist force-pushed the 221228.network-partition branch from 99ab197 to 9003190 Compare January 5, 2023 19:18

andrewbaptist linked an issue Jan 5, 2023 that may be closed by this pull request

rpc: refuse incoming connections unless dialback succeeds #84289

Closed

andrewbaptist force-pushed the 221228.network-partition branch 16 times, most recently from 77d9bfe to dfc5f86 Compare January 10, 2023 17:49

erikgrinaker reviewed Jan 12, 2023

View reviewed changes

andrewbaptist mentioned this pull request Jan 20, 2023

gossip: Remove creation time dependency on grpc #95552

Merged

andrewbaptist force-pushed the 221228.network-partition branch 9 times, most recently from 4578413 to 8d4965b Compare January 31, 2023 15:19

andrewbaptist requested a review from a team as a code owner February 22, 2023 21:45

andrewbaptist added the GA-blocker label Mar 2, 2023

erikgrinaker reviewed Mar 6, 2023

View reviewed changes

andrewbaptist commented Mar 6, 2023

View reviewed changes

andrewbaptist force-pushed the 221228.network-partition branch 4 times, most recently from 17d49fd to 94eb6d7 Compare March 7, 2023 11:58

andrewbaptist commented Mar 7, 2023

View reviewed changes

andrewbaptist force-pushed the 221228.network-partition branch 5 times, most recently from 84fff6b to 7467c6f Compare March 7, 2023 22:30

andrewbaptist requested a review from erikgrinaker March 7, 2023 23:04

erikgrinaker approved these changes Mar 8, 2023

View reviewed changes

andrewbaptist force-pushed the 221228.network-partition branch from 7467c6f to 0249427 Compare March 8, 2023 15:17

andrewbaptist commented Mar 8, 2023

View reviewed changes

erikgrinaker mentioned this pull request Mar 21, 2023

kvserver: RPC dialback does not resolve failover/non-system/blackhole-recv #99104

Closed

erikgrinaker mentioned this pull request Aug 23, 2023

kvserver: reconsider gRPC server-side keepalive timeout #109317

Open

Conversation

andrewbaptist commented Jan 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Jan 5, 2023

Uh oh!

erikgrinaker Jan 12, 2023

Choose a reason for hiding this comment

Uh oh!

erikgrinaker Jan 12, 2023

Choose a reason for hiding this comment

Uh oh!

erikgrinaker Jan 12, 2023

Choose a reason for hiding this comment

Uh oh!

erikgrinaker Jan 12, 2023

Choose a reason for hiding this comment

Uh oh!

tbg commented Feb 27, 2023

Uh oh!

erikgrinaker left a comment

Choose a reason for hiding this comment

Uh oh!

andrewbaptist commented Mar 6, 2023

Uh oh!

andrewbaptist left a comment

Choose a reason for hiding this comment

Uh oh!

andrewbaptist left a comment

Choose a reason for hiding this comment

Uh oh!

erikgrinaker left a comment

Choose a reason for hiding this comment

Uh oh!

andrewbaptist left a comment

Choose a reason for hiding this comment

Uh oh!

andrewbaptist commented Mar 8, 2023

Uh oh!

craig bot commented Mar 8, 2023

Uh oh!

craig bot commented Mar 8, 2023

Uh oh!

andrewbaptist commented Mar 9, 2023

Uh oh!

craig bot commented Mar 9, 2023

Uh oh!

rmloveland commented Sep 26, 2024

Uh oh!

rmloveland commented Sep 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

andrewbaptist commented Jan 5, 2023 •

edited

Loading