gossip: infoStore.mostDistant should never return the local node-id#29398
Conversation
Fixes cockroachdb#28517 Release note: None
a-robinson
left a comment
There was a problem hiding this comment.
Reviewed 2 of 2 files at r1.
Reviewable status:complete! 0 of 0 LGTMs obtained (and 1 stale)
pkg/gossip/infostore_test.go, line 299 at r1 (raw file):
} if expectedHops != hops { t.Errorf("%d: expected node %d; got %d", i, expectedHops, hops)
If you feel like cleaning up, this error message is wrong. Not necessary though.
When adding infos from a remote node, skip any info which originated on the local node. Release note: None
a9b0634 to
3356c79
Compare
petermattis
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained
pkg/gossip/infostore_test.go, line 299 at r1 (raw file):
Previously, a-robinson (Alex Robinson) wrote…
If you feel like cleaning up, this error message is wrong. Not necessary though.
Done.
|
bors r=a-robinson |
Build succeeded |
bdarnell
left a comment
There was a problem hiding this comment.
The second commit here (no loopback) worries me. If the previous incarnation of this node gossiped something that the new incarnation does not re-gossip, we'll have a situation where the rest of the cluster has some state in gossip that we don't (until it expires or we re-gossip it). I think I'd rather allow loopback gossip (encouraging convergence across the nodes) but special-case its harmful effects. (mostDistant and the timestamp assertion in #20986 are the only two that come to mind)
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale)
petermattis
left a comment
There was a problem hiding this comment.
@bdarnell Do you recall if it was intentional to allow loopback gossip info? I see your point that it allows divergence of the gossip state on the local node from the rest of the cluster, but I'm struggling to see where that is a problem. The node-id and store-id gossip info will be generated by the local node quickly. And the sentinel key and first-range descriptor will be gossiped by a new node in the cluster soon. I'm conflicted as to what to do here. We've seen several bugs due to allowing loopback gossip info. On the other hand, I see sanity in having each node have the same gossip state.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale)
|
I don't know whether it was considered or not. For the gossip keys you've mentioned, it's harmless to drop loopback gossip. Is that true for all keys (and will it always be true)? Gossip is used in pkg/sql/stats and pkg/sql/distsqlrun too, for example. I think that anything that would fail if we dropped loopback gossip would be flaky anyway, but allowing gossip to diverge between nodes would lead to very hard-to-debug failures. I think I'd rather err on the side of convergence. |
Receipt of loopback infos was disabled in cockroachdb#29398, but doing so had the unfortunate effect of allowing gossip state to temporarily diverge between nodes. Rather than disallowing loopback infos, we now ratchet the gossip monotonic clock in order to avoid the assertion around the gossip highwater timestamps. Fixes cockroachdb#29992 Fixes cockroachdb#20986 Release note: None
16354: storage: Delay initial metrics until system config is ready r=bdarnell a=bdarnell This was previously done too soon at startup and would clutter the logs (mainly in tests) with "failed initial metrics computation". Fixes #13560 29817: client, roachpb: eliminate log spam when loadgens are killed r=andreimatei a=andreimatei When the TPCC loadgen is CTRL-C'ed, the logs are spammed with: I180906 22:02:41.239771 27177127 internal/client/txn.go:625 [n1] async rollback failed: TransactionStatusError: already committed (REASON_UNKNOWN): "sql txn" id=57c518aa key=/Table/61/1/1224/0 rw=true pri=0.03781170 iso=SERIALIZABLE stat=COMMITTED epo=0 ts=1536271361.068457287,0 orig=1536271361.068457287,0 max=1536271361.073978228,0 wto=false rop=false seq=11 int=5 The "async rollback" part refers to the rollback being done with a canceled ctx (presumably a dropped connection's ctx). I believe the error happens because there's a commit in flight when the ctx is canceled. This patch lowers the message's level for this case. Release note: None 29987: github-post: add timeout handling to the stress issue poster r=andreimatei a=andreimatei Add explicit support to the issue poster for timeouts: - on all runs, publish an artifacts file with a list of slow tests - when timeouts happen, distinguish between the case where the test currently running at the timeout point is the culprit (i.e. if it is the longest running test) versus situations where that test is just an innocent bystender This patch also spruces up the github-post script in various ways. Among them there's now better support for running it on an input that comes directly from a test run, and not from a stress wrapper. Release note: None 30006: opt: fix panic during SELECT MIN(NULL) r=rytaft a=arjunravinarayan Fixes #29695. Release note (bug fix): fix a crash when SELECT MIN(NULL) was run with the optimizer enabled. 30008: gossip: allow receipt of "loopback infos" r=bdarnell a=petermattis Receipt of loopback infos was disabled in #29398, but doing so had the unfortunate effect of allowing gossip state to temporarily diverge between nodes. Rather than disallowing loopback infos, we now ratchet the gossip monotonic clock in order to avoid the assertion around the gossip highwater timestamps. Fixes #29992 Fixes #20986 Release note: None Co-authored-by: Ben Darnell <ben@cockroachlabs.com> Co-authored-by: Andrei Matei <andrei@cockroachlabs.com> Co-authored-by: Arjun Narayan <arjun@cockroachlabs.com> Co-authored-by: Peter Mattis <petermattis@gmail.com>
Receipt of loopback infos was disabled in cockroachdb#29398, but doing so had the unfortunate effect of allowing gossip state to temporarily diverge between nodes. Rather than disallowing loopback infos, we now ratchet the gossip monotonic clock in order to avoid the assertion around the gossip highwater timestamps. Fixes cockroachdb#29992 Fixes cockroachdb#20986 Release note: None
Receipt of loopback infos was disabled in cockroachdb#29398, but doing so had the unfortunate effect of allowing gossip state to temporarily diverge between nodes. Rather than disallowing loopback infos, we now ratchet the gossip monotonic clock in order to avoid the assertion around the gossip highwater timestamps. Fixes cockroachdb#29992 Fixes cockroachdb#20986 Release note: None
Fixes #28517
Release note: None