kv: convert uni-directional network partitions to bi-directional#94778
Conversation
99ab197 to
9003190
Compare
77d9bfe to
dfc5f86
Compare
pkg/rpc/context.go
Outdated
| // We want this connection to block on connection so we verify it | ||
| // succeeds. | ||
| dialOpts = append(dialOpts, grpc.WithBlock()) | ||
| conn, err := grpc.DialContext(dialCtx, target, dialOpts...) |
There was a problem hiding this comment.
As we discussed elsewhere, we should go via grpcDialNodeInternal() such that we'll retain this connection for future use and join any in-flight connection attempts.
For an initial inbound connection, this will naïvely result in a circular dependency since we'll be attempting to ping each other in either direction, both waiting on the other's ping response. However, for the dialback connection attempt we don't have to wait for the ping response itself, we can simply wait for the underlying gRPC connection to be established asynchronously, e.g. via ClientConn.WaitForStateChange.
We may want to limit this to only initial ping attempts, such that we later verify that we're actually receiving pings in both directions too, but this is a lesser concern.
pkg/server/server.go
Outdated
|
|
||
| // This is somewhat tangled due to the need to use gossip to resolve the | ||
| // address. If we are not able to resolve, we simply skip this verification. | ||
| rpcContext.PingResolver = func(ctx context.Context, nodeID roachpb.NodeID) string { |
There was a problem hiding this comment.
We can avoid this tangle by not passing rpcContext in to gossip.New. Gossip only uses the RPC context once it starts making outbound connections, and that only happens once we call Gossip.Start() far below here. The cleanest approach is probably to remove the Gossip.rpcContext field entirely, and thread the RPC context via Gossip.Start() down into bootstrap() and manage(), but alternatively we can set the rpcContext field during Gossip.Start() as long as we properly synchronize access to it. This should preferably be done as a separate commit.
When we do that, we can set up gossip before the RPC context and just pass a nodedialer.AddressResolver to rpc.NewContext() (if we get away with it without any dependency cycles, alternatively use some other corresponding type).
pkg/rpc/context.go
Outdated
| func (rpcCtx *Context) VerifyDialback(ctx context.Context, nodeID roachpb.NodeID) error { | ||
| log.Errorf(ctx, "Verifying health of connection to n%d", nodeID) | ||
| if nodeID == 0 { | ||
| //FIXME: at startup, the nodeID might not be set. Unfortunately |
There was a problem hiding this comment.
This only happens when bootstrapping a new node, right? Once the node has joined a cluster, this will always be set even on the initial connection attempts? If so, this seems totally fine, since the node can't participate in consensus or acquire any leases before it's allocated a node ID.
pkg/rpc/heartbeat.go
Outdated
| // established. If there is already a healthy connection set up, it will | ||
| // simply return immediately, however if not, it attempts to establish a | ||
| // "dummy" connection which is never used to send messages on. | ||
| verifyDialback func(context.Context, roachpb.NodeID) error |
There was a problem hiding this comment.
We already have onHandlePing which effectively does the same thing, should we hook into that instead? We can't go via ContextOptions.OnIncomingPing because that's necessarily constructed before the RPC context itself, but we can inject a handler during RPC context construction before setting onHandlePing. Not particularly important, and I'm not sure if it's really making things more clear or less, so take it or leave it.
4578413 to
8d4965b
Compare
erikgrinaker
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist)
pkg/rpc/context.go line 2593 at r3 (raw file):
but some methods like ConnHealthTryDial is really broken today. If there is not already a connection, it will establish a new connection, but then call Health on it immediately which is basically guaranteed to return ErrNotHeartbeated
Right. I originally implemented ConnHealthTryDial back in #70017, carrying over the old behavior of ConnHealth for backwards compatibility. I think the immediate motivation here was that if ConnHealthTryDial doesn't kick off a dial attempt then it's possible that noone else will try to dial the node either, and so DistSQL (the only caller) may never be able to reach the remote node and always fall back to local processing. But it may not be better for ConnHealthTryDial to block until connected, since that would increase latencies in the face of network outages (it would have to wait for a timeout instead of scheduling it elsewhere), but it's possible that the RPC circuit breakers will take care of this -- in any case, it really shouldn't be DistSQL's responsibility to maintain RPC connections in the first place.
pkg/rpc/context.go line 2589 at r6 (raw file):
As mentioned above:
We shouldn't block on it, we should accept the ping until we have a dial result. Otherwise we may as well just always use blocking.
These delays will [...] count towards the ping timeout budget (which is not as high as it seems since pings are subject to head of line blocking and can already have been sitting around in an RPC buffer for 4 seconds before we get it).
In fact, now that we use GRPCDialNode for the non-blocking return dial, we may need 6 network roundtrips for this to succeed (one for the outbound connection and one for the blocking return connection).
I think this basically means that we should check ConnHealth(), and allow the ping on ErrNotHeartbeated but reject it on any other error (with the assumption that the setup and heartbeat will eventually succeed or error out).
pkg/rpc/context.go line 2594 at r6 (raw file):
// Check in our regular connection map to see if we are healthy. err = rpcCtx.ConnHealth(target, nodeID, DefaultClass)
Shouldn't this use the same connection class as the incoming connection?
|
Likely addresses #54762 as well. The dialback will fail due to the reverse ping not working. This may have also been addressed by other changes, but I will test and mark this as fixed once this PR merges. |
andrewbaptist
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)
pkg/rpc/context.go line 2557 at r3 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Yeah no, I agree, it makes sense to verify it here.
sounds good!
pkg/rpc/context.go line 2593 at r3 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
but some methods like ConnHealthTryDial is really broken today. If there is not already a connection, it will establish a new connection, but then call Health on it immediately which is basically guaranteed to return ErrNotHeartbeated
Right. I originally implemented
ConnHealthTryDialback in #70017, carrying over the old behavior ofConnHealthfor backwards compatibility. I think the immediate motivation here was that ifConnHealthTryDialdoesn't kick off a dial attempt then it's possible that noone else will try to dial the node either, and so DistSQL (the only caller) may never be able to reach the remote node and always fall back to local processing. But it may not be better forConnHealthTryDialto block until connected, since that would increase latencies in the face of network outages (it would have to wait for a timeout instead of scheduling it elsewhere), but it's possible that the RPC circuit breakers will take care of this -- in any case, it really shouldn't be DistSQL's responsibility to maintain RPC connections in the first place.
Makes sense - as we discussed, we don't want to change this behavior as part of this PR, but at a minimum, this method is poorly named. I added comments to avoid confusion for the next person reading this and will make a separate PR to refactor this code.
pkg/rpc/context.go line 2589 at r6 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
As mentioned above:
We shouldn't block on it, we should accept the ping until we have a dial result. Otherwise we may as well just always use blocking.
These delays will [...] count towards the ping timeout budget (which is not as high as it seems since pings are subject to head of line blocking and can already have been sitting around in an RPC buffer for 4 seconds before we get it).
In fact, now that we use
GRPCDialNodefor the non-blocking return dial, we may need 6 network roundtrips for this to succeed (one for the outbound connection and one for the blocking return connection).I think this basically means that we should check
ConnHealth(), and allow the ping onErrNotHeartbeatedbut reject it on any other error (with the assumption that the setup and heartbeat will eventually succeed or error out).
Assume we have healthy connections from A -> B and B -> A. If the reverse connection (B->A) gets broken, the following sequence happens.
- T0: A -> B - send Ping
- T0 + RTT/2 - B -> A - send successful Pong, but initiate GRPCDialNode
- T0 + RTT/2 + 2RTT - A (success case) - receives ping and responds immediately since it has a healthy connection. (the 2RTT is to handle all the network setup).
- T0 + 3*RTT - B is aware that the connection is established.
- T0 + ?? - A (fail case) - connection attempt fails/times out. This
- T0 + 3s - A sleep delay between pings (currently 3 seconds)
- T0 + 3s - A -> B - send next Ping
- T0 + 3s + RTT/2 - B -> A - checks the status of the connection attempt from 3 sec ago - if successful, then continues if failure, send an err Pong.
So the assumption is that the reverse connection can be set up in less than one PingInterval. I understand we want to lower it further, but it will always have to be at least 3x network RTT so I don't see this as a problem as long as we don't get into a situation where this holds. I can't imagine we want to set it lower than that as we wouldn't be able to deal with much latency. So assuming we support up to 500ms RTT, this allows setting PingTime down to 1.5s.
Lets discuss in more detail. If you are OK with this I will add some more comments to that code to make it clear why this dialback is fine.
pkg/rpc/context.go line 2594 at r6 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Shouldn't this use the same connection class as the incoming connection?
I had considered this, but I'm not exactly sure if this is right. The connection classes are just used for "separating" traffic, not for prioritization as there is no mechanism for doing this.
If we did this it would have two implications. First, it would require passing this over the PingRequest (or at least pass it in the case where there is a DialBack type set). But second, the Default class is the most likely to "stay connected" since it is used for most traffic, so it is the most likely to be an accurate indicator of the health of the reverse connection.
Lets discuss this a little more.
pkg/rpc/heartbeat.proto line 74 at r3 (raw file):
Previously, andrewbaptist (Andrew Baptist) wrote…
I left this for now since we are treating NONE as UNKNOWN. I need to do some compatibility testing on this.
This works fine with mixed-version clusters and CLIs from different versions. I will do one more pass to make sure I didn't miss anything, but it seems best to not add a new flag.
17d49fd to
94eb6d7
Compare
andrewbaptist
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)
pkg/rpc/context.go line 2589 at r6 (raw file):
Previously, andrewbaptist (Andrew Baptist) wrote…
Assume we have healthy connections from A -> B and B -> A. If the reverse connection (B->A) gets broken, the following sequence happens.
- T0: A -> B - send Ping
- T0 + RTT/2 - B -> A - send successful Pong, but initiate GRPCDialNode
- T0 + RTT/2 + 2RTT - A (success case) - receives ping and responds immediately since it has a healthy connection. (the 2RTT is to handle all the network setup).
- T0 + 3*RTT - B is aware that the connection is established.
- T0 + ?? - A (fail case) - connection attempt fails/times out. This
- T0 + 3s - A sleep delay between pings (currently 3 seconds)
- T0 + 3s - A -> B - send next Ping
- T0 + 3s + RTT/2 - B -> A - checks the status of the connection attempt from 3 sec ago - if successful, then continues if failure, send an err Pong.
So the assumption is that the reverse connection can be set up in less than one PingInterval. I understand we want to lower it further, but it will always have to be at least 3x network RTT so I don't see this as a problem as long as we don't get into a situation where this holds. I can't imagine we want to set it lower than that as we wouldn't be able to deal with much latency. So assuming we support up to 500ms RTT, this allows setting PingTime down to 1.5s.
Lets discuss in more detail. If you are OK with this I will add some more comments to that code to make it clear why this dialback is fine.
Changed to wait until the dialback connection is completed before verified rather than just one ping later.
pkg/rpc/context.go line 2594 at r6 (raw file):
Previously, andrewbaptist (Andrew Baptist) wrote…
I had considered this, but I'm not exactly sure if this is right. The connection classes are just used for "separating" traffic, not for prioritization as there is no mechanism for doing this.
If we did this it would have two implications. First, it would require passing this over the PingRequest (or at least pass it in the case where there is a DialBack type set). But second, the Default class is the most likely to "stay connected" since it is used for most traffic, so it is the most likely to be an accurate indicator of the health of the reverse connection.
Lets discuss this a little more.
Updated to always use the system class and added comments around it.
84fff6b to
7467c6f
Compare
erikgrinaker
left a comment
There was a problem hiding this comment.
Thanks for getting this across the line, I appreciate it was a longer slog than expected.
We should mark this as resolving #84289, and put the GA blocker label on that issue (labels on PRs don't count).
Reviewed 1 of 4 files at r5, 10 of 10 files at r14, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist)
pkg/rpc/context.go line 2589 at r6 (raw file):
Previously, andrewbaptist (Andrew Baptist) wrote…
Changed to wait until the dialback connection is completed before verified rather than just one ping later.
Thanks, appreciated -- blocking pings didn't really buy us anything, and only had downsides, so I think this is much better.
As for the RTT considerations, if we do want to drop the ping timeout, we would likely differentiate the initial ping timeout and subsequent timeouts to account for the additional dial RTTs, since we only have to account for the additional RTTs on the blocking ping. That would allow us to aggressively lower the ping timeout on low-latency (regional) clusters. There's also head-of-line blocking to consider.
pkg/rpc/context.go line 2249 at r14 (raw file):
// so we ignore it. err := rpcCtx.runHeartbeat(ctx, conn, target) log.Health.Infof(ctx, "connection heartbeat loop ended with err: %v", err)
nit: will this result in duplicate logging? The error is stored in conn.err and propagated to callers via Connect() and Health() where they will presumably log it or propagate it further up the stack. If we should log it, shouldn't it be logged as an error?
pkg/rpc/context.go line 2297 at r14 (raw file):
"rpc.dialback.enabled", "if true, require bidirectional RPC connections between nodes to prevent one-way network unavailability", true,
Should this be TenantReadOnly? I think we generally default to that unless we have a very good reason to use SystemOnly, since changes otherwise aren't visible to tenants and they'll always use the default value regardless of the host's value.
pkg/rpc/context.go line 2578 at r14 (raw file):
// not the node. In that case, we can't look up if we have a connection to the // node and instead need to always try dialback. var err error
nit: we can drop this declaration and use a locally scoped err := inside the branch, since we don't need to keep the result around for later.
pkg/rpc/context.go line 2600 at r14 (raw file):
// Clear out the ServerTime on our response so the receiver does not use // this response in its latency calculations. response.ServerTime = 0
Did you consider checking this on the sender side instead, where we update the clock/latency measurements? They're only affected when we set BLOCKING on the request, and this is controlled by the sender, so it can locally choose to ignore the measurement in that case. That avoids plumbing through the response here, keeps the logic in one place, and also avoids any subtle ordering dependencies where e.g. the OnPing has to be called after ServerTime has been populated.
Also, in clusters with clock skew, this will prevent us from detecting the skew on the initial connection attempt, thus we'll run with known clock skew for up to several seconds, potentially violating linearizability in the meanwhile. Maybe this isn't that big of a deal since these checks are always going to be best-effort anyway, but it seems a bit unfortunate that we're relaxing clock skew protections here. Do you have any thoughts on this? Afaict, we're forced to relax either clock skew protection or asymmetric partition detection here.
pkg/rpc/context.go line 2645 at r14 (raw file):
previousAttempt := rpcCtx.dialbackMu.m[nodeID] // Block here for the previous connection to be completed (successfully or
nit: comment seems outdated, we don't block here.
pkg/rpc/context.go line 2652 at r14 (raw file):
if previousAttempt != nil { select { case <-previousAttempt.initialHeartbeatDone:
nit: isn't this exactly the same logic as Health(), where the default branch corresponds to ErrNotHeartbeated?
pkg/rpc/heartbeat.proto line 74 at r3 (raw file):
Previously, andrewbaptist (Andrew Baptist) wrote…
This works fine with mixed-version clusters and CLIs from different versions. I will do one more pass to make sure I didn't miss anything, but it seems best to not add a new flag.
Sure. I suppose we could use a version gate whenever we add an enum value instead.
pkg/rpc/context_test.go line 2618 at r14 (raw file):
rpcCtx := NewContext(context.Background(), opts) // This is normally set up inside the server, we want to hold onto all PingRequests that come through. rpcCtx.OnIncomingPing = func(ctx context.Context, req *PingRequest, resp *PingResponse) error {
So we basically rely on failover/*/blackhole-(recv|send) to verify that this work in a real cluster, yeah? Do we feel like that coverage is sufficient, or should we wire up some rudimentary integration tests using a stock TestCluster too?
pkg/rpc/context_test.go line 2715 at r14 (raw file):
// Forcibly shut down listener 2 and the connection node1 -> node2. // Verify the reverse connection will also close within a DialTimeout.
nit: DialTimeout doesn't come into play here because the closed listener will respond with an immediate TCP RST packet. I'm not sure we can easily test DialTimeout here since we have to fiddle with the OS TCP stack.
7467c6f to
0249427
Compare
andrewbaptist
left a comment
There was a problem hiding this comment.
TFTR - I updated the PR and comments also. I'm rerunning all the blackhole tests one more time with the final code and will push once that is done and successful.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)
pkg/rpc/context.go line 2589 at r6 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Thanks, appreciated -- blocking pings didn't really buy us anything, and only had downsides, so I think this is much better.
As for the RTT considerations, if we do want to drop the ping timeout, we would likely differentiate the initial ping timeout and subsequent timeouts to account for the additional dial RTTs, since we only have to account for the additional RTTs on the blocking ping. That would allow us to aggressively lower the ping timeout on low-latency (regional) clusters. There's also head-of-line blocking to consider.
Thanks, I agree this is better. I was worried about the timeouts on connection attempts, but it seems OK since we have a couple of failsafes in place. I'll add a note about this as well.
pkg/rpc/context.go line 2249 at r14 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
nit: will this result in duplicate logging? The error is stored in
conn.errand propagated to callers viaConnect()andHealth()where they will presumably log it or propagate it further up the stack. If we should log it, shouldn't it be logged as an error?
It is duplicate, but it is better to know when this fails immediately rather than waiting for a future (non-heartbeat) message. Granted that future messages are typically soon, but there is no guarantee when they actually happen. I left it as Info since the connection is also Info and I wanted to be symmetric with that.
pkg/rpc/context.go line 2297 at r14 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Should this be
TenantReadOnly? I think we generally default to that unless we have a very good reason to useSystemOnly, since changes otherwise aren't visible to tenants and they'll always use the default value regardless of the host's value.
Update to TenantReadOnly
pkg/rpc/context.go line 2578 at r14 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
nit: we can drop this declaration and use a locally scoped
err :=inside the branch, since we don't need to keep the result around for later.
Done
pkg/rpc/context.go line 2600 at r14 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Did you consider checking this on the sender side instead, where we update the clock/latency measurements? They're only affected when we set
BLOCKINGon the request, and this is controlled by the sender, so it can locally choose to ignore the measurement in that case. That avoids plumbing through the response here, keeps the logic in one place, and also avoids any subtle ordering dependencies where e.g. the OnPing has to be called afterServerTimehas been populated.Also, in clusters with clock skew, this will prevent us from detecting the skew on the initial connection attempt, thus we'll run with known clock skew for up to several seconds, potentially violating linearizability in the meanwhile. Maybe this isn't that big of a deal since these checks are always going to be best-effort anyway, but it seems a bit unfortunate that we're relaxing clock skew protections here. Do you have any thoughts on this? Afaict, we're forced to relax either clock skew protection or asymmetric partition detection here.
Good idea! I changed to check on the sender side only and removed zeroing it out here.
pkg/rpc/context.go line 2645 at r14 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
nit: comment seems outdated, we don't block here.
Updated comment
pkg/rpc/context.go line 2652 at r14 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
nit: isn't this exactly the same logic as
Health(), where the default branch corresponds toErrNotHeartbeated?
It was very close to calling Health and I considered that, but it needed to handle the dialbackMu.m differently and by time I had handled all the different cases it was basically harder to read than just copying the checks here. Two of the three "branches" do something different, so it really became easier to just reimplemnt this rather than using Healkth.
pkg/rpc/heartbeat.proto line 74 at r3 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Sure. I suppose we could use a version gate whenever we add an enum value instead.
I left for now, I don't expect we will add new values and it is probably going to require a version gate if we ever do. Leaving the default value as 0 simplified things a little.
pkg/rpc/context_test.go line 2618 at r14 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
So we basically rely on
failover/*/blackhole-(recv|send)to verify that this work in a real cluster, yeah? Do we feel like that coverage is sufficient, or should we wire up some rudimentary integration tests using a stockTestClustertoo?
I'm going to add a TODO to wire up a test for this. I don't want to hold up merging this PR though. I'll work on this later this week.
pkg/rpc/context_test.go line 2715 at r14 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
nit:
DialTimeoutdoesn't come into play here because the closed listener will respond with an immediate TCPRSTpacket. I'm not sure we can easily testDialTimeouthere since we have to fiddle with the OS TCP stack.
Updated the comment.
|
bors r=erikgrinaker |
|
Build failed (retrying...): |
|
Build failed: |
Fixes: cockroachdb#84289 Previously one-way partitions where a node could initiate a successful TCP connection in one direction, but the reverse connection fails which causes problems. The node that initiates outgoing connections can acquire leases and cause failures for reads and writes to those ranges. This is particularly a problem if it acquires the liveness range leases, but is a problem even for other ranges. This commit adds an additional check during server-to-server communication where the recipient of a new PingRequest first validates that it is able to open a reverse connection to the initiator before responding. Additionally, it will monitor whether it has a successful reverse connection over time and asynchronously validate reverse connections to the sender. The ongoing validation is asynchronous to avoid adding delays to PingResponses as they are used for measuring clock offsets. Also the onlyOnceDialer prevents retrying after a dial error, however this can get into a state where it continually retries for certain network connections. This is not easy to reproduce in a unit tests as it requires killing the connection using iptables (normal closes don't cauuse this). After this change the onlyOnceDialer will no longer repeatedly retry to reconnect after a broken connection during setup. Release note (bug fix): RPC connections between nodes now require RPC connections to be established in both directions, otherwise the connection will be closed. This is done to prevent asymmetric network partitions where nodes are able to send outbound messages but not receive inbound messages, which could result in persistent unavailability. This behavior can be disabled by the cluster setting rpc.dialback.enabled. Epic: CRDB-2488
|
bors r=erikgrinaker |
|
Build succeeded: |
|
@andrewbaptist i'm making a small docs update based on this PR - can you confirm what versions this is in? I did the following to check which tags contain this PR's commit SHA but let me know if i'm holding it wrong if below is correct, looks like it's in everything v23.1+ at this point but please correct me if wrong |
|
ah i just found it in the release notes as well for v23.1.0 https://www.cockroachlabs.com/docs/releases/v23.1.html#v23-1-0-cluster-settings guess that answers my question, sorry for the noise |
Uh oh!
There was an error while loading. Please reload this page.