Skip to content

storage: proactively renew expiration-based leases#25322

Merged
craig[bot] merged 2 commits intocockroachdb:masterfrom
a-robinson:renewlease
May 15, 2018
Merged

storage: proactively renew expiration-based leases#25322
craig[bot] merged 2 commits intocockroachdb:masterfrom
a-robinson:renewlease

Conversation

@a-robinson
Copy link
Copy Markdown
Contributor

This mechanism won't scale incredibly well to proactively renewing all
expiration-based leases since it'd require a goroutine per lease. It
should be fine if we only auto-renew r1's lease, though.

Fixes #24753

Release note: None


@bdarnell, do you have any suggestions for testing this? I've verified it manually, but it's hard to properly isolate this code in a realistic way. We could ratchet the lease durations way down and wait until the lease's expiration bumps, but that creates a timing-based test that is either flaky if we use a really short lease duration or slow if we use a long one.

@a-robinson a-robinson requested review from a team and bdarnell May 4, 2018 20:06
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@bdarnell
Copy link
Copy Markdown
Contributor

bdarnell commented May 6, 2018

Review status: 0 of 2 files reviewed at latest revision, all discussions resolved.


pkg/storage/replica.go, line 1544 at r1 (raw file):

			}

			// redirectOnOrAcquireLease doesn't wait until the lease request

I don't like the level of detailed knowledge of the lease request machinery here. Could we do something simpler like send a Get request to Replica.Send and let all the lease stuff happen under the covers?


pkg/storage/replica_proposal.go, line 274 at r1 (raw file):

		// different replicas as maybeGossipFirstRange is called on each (e.g.
		// #24753).
		r.store.Stopper().RunWorker(r.AnnotateCtx(context.Background()), func(ctx context.Context) {

I'm slightly worried that we could have a lot of these goroutines pile up if they're not exiting for some reason. I was imagining a single store-level goroutine that would periodically check to see whether we have a replica of r1 and if so try to do something to refresh the lease.


Comments from Reviewable

@a-robinson
Copy link
Copy Markdown
Contributor Author

TFTR!


Review status: 0 of 2 files reviewed at latest revision, 2 unresolved discussions.


pkg/storage/replica_proposal.go, line 274 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I'm slightly worried that we could have a lot of these goroutines pile up if they're not exiting for some reason. I was imagining a single store-level goroutine that would periodically check to see whether we have a replica of r1 and if so try to do something to refresh the lease.

Fair enough. Done. Although only for the leaseholder replica. Do you really think it should be done on all replicas of r1 rather than just on (what we think is) the leaseholder?


pkg/storage/replica.go, line 1544 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I don't like the level of detailed knowledge of the lease request machinery here. Could we do something simpler like send a Get request to Replica.Send and let all the lease stuff happen under the covers?

That doesn't really help with the timing issue, and I'm not sure sending a request is really simpler than calling redirectOnOrAcquireLease. We could just call redirectOnOrAcquireLease frequently enough that we don't even need think about the timing because we can be sure that we'll call it at least once at a time when redirectOnOrAcquireLease will actually kick off an asynchronous renewal with enough time for it to reasonably succeed.

I've changed to doing this, just calling it more frequently in place of the awkward timing logic.


Comments from Reviewable

@bdarnell
Copy link
Copy Markdown
Contributor

:lgtm:


Reviewed 2 of 3 files at r2.
Review status: 1 of 2 files reviewed at latest revision, 1 unresolved discussion.


pkg/storage/replica_proposal.go, line 274 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Fair enough. Done. Although only for the leaseholder replica. Do you really think it should be done on all replicas of r1 rather than just on (what we think is) the leaseholder?

Yes, I was thinking that all replicas of r1 would periodically try to acquire its lease (driven only by a ticker in the goroutine, no need for a trigger in leasePostApply). But doing it this way so that it just renews leases and leaves new acquisitions to be done lazily is fine too.


pkg/storage/store.go, line 1560 at r2 (raw file):

// maybeGossipFirstRange is called on each (e.g.  #24753).
//
// NOTE: Currently the only lease that we proactively renew is r1, but this

I think we should go ahead and do this for all expiration-based leases now that it's not a goroutine per range.


Comments from Reviewable

@a-robinson
Copy link
Copy Markdown
Contributor Author

Review status: 0 of 2 files reviewed at latest revision, 2 unresolved discussions.


pkg/storage/store.go, line 1560 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I think we should go ahead and do this for all expiration-based leases now that it's not a goroutine per range.

Done. The new commit may be worth a look, although it should just be what you expect.


Comments from Reviewable

@bdarnell
Copy link
Copy Markdown
Contributor

LGTM

@a-robinson
Copy link
Copy Markdown
Contributor Author

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented May 15, 2018

Merge conflict (retrying...)

Run a store-level background worker that proactively renews the lease
for r1. It could easily be extended to also renew other expiration-based
leases if we so desire in the future.

Fixes cockroachdb#24753

Release note: None
This builds off the previous commit which proactively renewed
only the first range's lease.

Release note: None
@a-robinson
Copy link
Copy Markdown
Contributor Author

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented May 15, 2018

Build failed (retrying...)

craig bot pushed a commit that referenced this pull request May 15, 2018
25216: sql: fix a bunch of minor spacing issues in the grammar r=BramGruneir a=BramGruneir

I was trying to do some searching and couldn't find what I was looking for due
to inconsistent spacing.

We should move towards adding a linter for the grammer.

Release note: None

25322: storage: proactively renew r1's lease r=a-robinson a=a-robinson

This mechanism won't scale incredibly well to proactively renewing all
expiration-based leases since it'd require a goroutine per lease. It
should be fine if we only auto-renew r1's lease, though.

Fixes #24753

Release note: None

----------------

@bdarnell, do you have any suggestions for testing this? I've verified it manually, but it's hard to properly isolate this code in a realistic way. We could ratchet the lease durations way down and wait until the lease's expiration bumps, but that creates a timing-based test that is either flaky if we use a really short lease duration or slow if we use a long one.

Co-authored-by: Bram Gruneir <bram@cockroachlabs.com>
Co-authored-by: Alex Robinson <alexdwanerobinson@gmail.com>
@craig
Copy link
Copy Markdown
Contributor

craig bot commented May 15, 2018

Build succeeded

@craig craig bot merged commit 5e49f9b into cockroachdb:master May 15, 2018
@andreimatei
Copy link
Copy Markdown
Contributor

It would appear this made TestBehaviorDuringLeaseTransfer flaky:
#25573

@a-robinson a-robinson changed the title storage: proactively renew r1's lease storage: proactively renew expiration-based leases May 16, 2018
bdarnell added a commit to bdarnell/cockroach that referenced this pull request May 21, 2018
This introduction of proactive lease renewals (cockroachdb#25322) made this test
flaky in two ways. First, the test was (oddly) creating two Replica
objects for range 1 on the same store (since cockroachdb#2996), leading to races
when combined with the background lease renewal thread. Second, the
test expects leases to expire so it needs to disable automatic
renewals.

Fixes cockroachdb#25748

Release note: None
craig bot pushed a commit that referenced this pull request May 21, 2018
25781: storage: Deflake TestReplicaLease r=tschottdorf a=bdarnell

This introduction of proactive lease renewals (#25322) made this test
flaky in two ways. First, the test was (oddly) creating two Replica
objects for range 1 on the same store (since #2996), leading to races
when combined with the background lease renewal thread. Second, the
test expects leases to expire so it needs to disable automatic
renewals.

Fixes #25748

Release note: None

Co-authored-by: Ben Darnell <ben@cockroachlabs.com>
a-robinson added a commit to a-robinson/cockroach that referenced this pull request May 23, 2018
Caused by the proactive renewal of expiration-based leases (see cockroachdb#25322
and cockroachdb#25625 for more detail).

Fixes cockroachdb#25672

Release note: None
craig bot pushed a commit that referenced this pull request May 23, 2018
25676: storage: Fix flaky TestReplicaNotLeaseHolderError r=a-robinson a=a-robinson

Caused by the proactive renewal of expiration-based leases (see #25322
and #25625 for more detail).

Fixes #25731

Release note: None

I'm not sure how many more of these flakes to expect over the next couple weeks. I'm currently running `stressrace` on all of `pkg/storage` on a gceworker to try and ferret them out if there are any more, but there's no guarantee that'll find all of them.

25816: ui: don't show login indicator unless login is enabled r=couchand a=couchand

In #25005 I accidentally let the login indicator leak even if the environment
variable wasn't set.  This corrects that issue.

Release note: None
Fixes: #25843

25853: dep: Bump grpc-go to include perf improvements and data race fix r=a-robinson a=a-robinson

Un-reverts #24883 to include grpc/grpc-go#2074

This also removed hdrhistogram-writer because it apparently isn't used
anymore.

Release note: None

Maybe we want to wait until they put this into a more formal release SHA, though.

Co-authored-by: Alex Robinson <alexdwanerobinson@gmail.com>
Co-authored-by: Andrew Couch <andrew@cockroachlabs.com>
craig bot pushed a commit that referenced this pull request Aug 28, 2018
28931: backport-2.0: storage: proactively renew expiration-based leases r=a-robinson a=a-robinson

Backport 2/2 commits from #25322.

/cc @cockroachdb/release

---

This mechanism won't scale incredibly well to proactively renewing all
expiration-based leases since it'd require a goroutine per lease. It
should be fine if we only auto-renew r1's lease, though.

Fixes #24753

Release note: None

----------------

To fix #27731 in v2.0.

Co-authored-by: Alex Robinson <alexdwanerobinson@gmail.com>
a-robinson added a commit to a-robinson/cockroach that referenced this pull request Aug 29, 2018
Caused by the proactive renewal of expiration-based leases (see cockroachdb#25322
and cockroachdb#25625 for more detail).

Fixes cockroachdb#25672

Release note: None
a-robinson pushed a commit to a-robinson/cockroach that referenced this pull request Aug 29, 2018
This introduction of proactive lease renewals (cockroachdb#25322) made this test
flaky in two ways. First, the test was (oddly) creating two Replica
objects for range 1 on the same store (since cockroachdb#2996), leading to races
when combined with the background lease renewal thread. Second, the
test expects leases to expire so it needs to disable automatic
renewals.

Fixes cockroachdb#25748

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants