Skip to content

gossip: infoStore.mostDistant should never return the local node-id#29398

Merged
craig[bot] merged 2 commits intocockroachdb:masterfrom
petermattis:pmattis/gossip-tighten-loopback
Aug 31, 2018
Merged

gossip: infoStore.mostDistant should never return the local node-id#29398
craig[bot] merged 2 commits intocockroachdb:masterfrom
petermattis:pmattis/gossip-tighten-loopback

Conversation

@petermattis
Copy link
Copy Markdown
Collaborator

Fixes #28517

Release note: None

@petermattis petermattis requested a review from a team August 30, 2018 21:29
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@petermattis petermattis requested a review from a-robinson August 30, 2018 21:30
Copy link
Copy Markdown
Contributor

@a-robinson a-robinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 2 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/gossip/infostore_test.go, line 299 at r1 (raw file):

		}
		if expectedHops != hops {
			t.Errorf("%d: expected node %d; got %d", i, expectedHops, hops)

If you feel like cleaning up, this error message is wrong. Not necessary though.

When adding infos from a remote node, skip any info which originated on
the local node.

Release note: None
@petermattis petermattis force-pushed the pmattis/gossip-tighten-loopback branch from a9b0634 to 3356c79 Compare August 30, 2018 23:13
Copy link
Copy Markdown
Collaborator Author

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained


pkg/gossip/infostore_test.go, line 299 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

If you feel like cleaning up, this error message is wrong. Not necessary though.

Done.

@petermattis
Copy link
Copy Markdown
Collaborator Author

bors r=a-robinson

craig bot pushed a commit that referenced this pull request Aug 31, 2018
29398: gossip: infoStore.mostDistant should never return the local node-id r=a-robinson a=petermattis

Fixes #28517

Release note: None

Co-authored-by: Peter Mattis <petermattis@gmail.com>
@craig
Copy link
Copy Markdown
Contributor

craig bot commented Aug 31, 2018

Build succeeded

@craig craig bot merged commit 3356c79 into cockroachdb:master Aug 31, 2018
@petermattis petermattis deleted the pmattis/gossip-tighten-loopback branch August 31, 2018 01:36
craig bot pushed a commit that referenced this pull request Aug 31, 2018
29413: release-2.1: gossip: infoStore.mostDistant should never return the local node-id r=a-robinson a=petermattis

Backport 2/2 commits from #29398.

/cc @cockroachdb/release

---

Fixes #28517

Release note: None


Co-authored-by: Peter Mattis <petermattis@gmail.com>
Copy link
Copy Markdown
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second commit here (no loopback) worries me. If the previous incarnation of this node gossiped something that the new incarnation does not re-gossip, we'll have a situation where the rest of the cluster has some state in gossip that we don't (until it expires or we re-gossip it). I think I'd rather allow loopback gossip (encouraging convergence across the nodes) but special-case its harmful effects. (mostDistant and the timestamp assertion in #20986 are the only two that come to mind)

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)

Copy link
Copy Markdown
Collaborator Author

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bdarnell Do you recall if it was intentional to allow loopback gossip info? I see your point that it allows divergence of the gossip state on the local node from the rest of the cluster, but I'm struggling to see where that is a problem. The node-id and store-id gossip info will be generated by the local node quickly. And the sentinel key and first-range descriptor will be gossiped by a new node in the cluster soon. I'm conflicted as to what to do here. We've seen several bugs due to allowing loopback gossip info. On the other hand, I see sanity in having each node have the same gossip state.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)

@bdarnell
Copy link
Copy Markdown
Contributor

bdarnell commented Sep 9, 2018

I don't know whether it was considered or not. For the gossip keys you've mentioned, it's harmless to drop loopback gossip. Is that true for all keys (and will it always be true)? Gossip is used in pkg/sql/stats and pkg/sql/distsqlrun too, for example. I think that anything that would fail if we dropped loopback gossip would be flaky anyway, but allowing gossip to diverge between nodes would lead to very hard-to-debug failures. I think I'd rather err on the side of convergence.

petermattis added a commit to petermattis/cockroach that referenced this pull request Sep 10, 2018
Receipt of loopback infos was disabled in cockroachdb#29398, but doing so had the
unfortunate effect of allowing gossip state to temporarily diverge
between nodes. Rather than disallowing loopback infos, we now ratchet
the gossip monotonic clock in order to avoid the assertion around the
gossip highwater timestamps.

Fixes cockroachdb#29992
Fixes cockroachdb#20986

Release note: None
craig bot pushed a commit that referenced this pull request Sep 10, 2018
16354: storage: Delay initial metrics until system config is ready r=bdarnell a=bdarnell

This was previously done too soon at startup and would clutter the
logs (mainly in tests) with "failed initial metrics computation".

Fixes #13560

29817: client, roachpb: eliminate log spam when loadgens are killed r=andreimatei a=andreimatei

When the TPCC loadgen is CTRL-C'ed, the logs are spammed with:
I180906 22:02:41.239771 27177127 internal/client/txn.go:625  [n1] async rollback failed: TransactionStatusError: already committed (REASON_UNKNOWN): "sql txn" id=57c518aa key=/Table/61/1/1224/0 rw=true pri=0.03781170 iso=SERIALIZABLE stat=COMMITTED epo=0 ts=1536271361.068457287,0 orig=1536271361.068457287,0 max=1536271361.073978228,0 wto=false rop=false seq=11 int=5

The "async rollback" part refers to the rollback being done with a
canceled ctx (presumably a dropped connection's ctx). I believe the
error happens because there's a commit in flight when the ctx is
canceled. This patch lowers the message's level for this case.

Release note: None

29987: github-post: add timeout handling to the stress issue poster r=andreimatei a=andreimatei

Add explicit support to the issue poster for timeouts:
- on all runs, publish an artifacts file with a list of slow tests
- when timeouts happen, distinguish between the case where the test
currently running at the timeout point is the culprit (i.e. if it is the
longest running test) versus situations where that test is just an
innocent bystender

This patch also spruces up the github-post script in various ways. Among
them there's now better support for running it on an input that comes
directly from a test run, and not from a stress wrapper.

Release note: None

30006: opt: fix panic during SELECT MIN(NULL) r=rytaft a=arjunravinarayan

Fixes #29695.

Release note (bug fix): fix a crash when SELECT MIN(NULL) was run with
the optimizer enabled.

30008: gossip: allow receipt of "loopback infos" r=bdarnell a=petermattis

Receipt of loopback infos was disabled in #29398, but doing so had the
unfortunate effect of allowing gossip state to temporarily diverge
between nodes. Rather than disallowing loopback infos, we now ratchet
the gossip monotonic clock in order to avoid the assertion around the
gossip highwater timestamps.

Fixes #29992
Fixes #20986

Release note: None

Co-authored-by: Ben Darnell <ben@cockroachlabs.com>
Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
Co-authored-by: Arjun Narayan <arjun@cockroachlabs.com>
Co-authored-by: Peter Mattis <petermattis@gmail.com>
petermattis added a commit to petermattis/cockroach that referenced this pull request Sep 10, 2018
Receipt of loopback infos was disabled in cockroachdb#29398, but doing so had the
unfortunate effect of allowing gossip state to temporarily diverge
between nodes. Rather than disallowing loopback infos, we now ratchet
the gossip monotonic clock in order to avoid the assertion around the
gossip highwater timestamps.

Fixes cockroachdb#29992
Fixes cockroachdb#20986

Release note: None
petermattis added a commit to petermattis/cockroach that referenced this pull request Sep 10, 2018
Receipt of loopback infos was disabled in cockroachdb#29398, but doing so had the
unfortunate effect of allowing gossip state to temporarily diverge
between nodes. Rather than disallowing loopback infos, we now ratchet
the gossip monotonic clock in order to avoid the assertion around the
gossip highwater timestamps.

Fixes cockroachdb#29992
Fixes cockroachdb#20986

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants