Skip to content

sqlliveness/slstorage: rework deletion loop to avoid contention#90875

Merged
craig[bot] merged 3 commits intocockroachdb:masterfrom
ajwerner:ajwerner/fix-sqlliveness-transactions
Oct 31, 2022
Merged

sqlliveness/slstorage: rework deletion loop to avoid contention#90875
craig[bot] merged 3 commits intocockroachdb:masterfrom
ajwerner:ajwerner/fix-sqlliveness-transactions

Conversation

@ajwerner
Copy link
Copy Markdown
Contributor

@ajwerner ajwerner commented Oct 28, 2022

sqlliveness/slstorage: rework deletion loop to avoid contention

The periodic loop to delete abandoned records used to to a potentially
long-running transaction to discover and delete expired sessions. This
could lead to live-lock starvation in some scenarios. Consider a case
whereby there are live sessions which are heart-beating their records
successfully. In the meantime, local sessions heart-beat regularly and
stay alive. Those sessions are in the read set of the deletion loop but
not in the write set. Imagine now that the deletion loop gets pushed and
has to refresh. It will fail and need to restart, but it will hold its
locks. This livelock can persist forever if the rate of heartbeats of
live sessions has a period shorter than the latency of the deletion
loop's operations. In a cluster with 100 nodes and a heartbeat interval
of 5s, we'd expect a heartbeat every 50ms. If the latency between the
deletion loop and leaseholder is, say, 50ms, we're in big trouble because
just the scan phase will take at least that long.

This change avoids the large transaction altogether. It decouples candidate
discover from removal. The usual process by which rows are removed is able
to avoid intents altogether and use 1PC. In this way, starvation, or even
waiting on locks should be fully eliminated.

sqlliveness/slstorage: reduce the garbage collection loop frequency

Nothing ever scans the sqlliveness table. We don't expect it to grow very large
very rapidly. Also, we run this loop on every node. Running it every 20s never
made much sense. This commit changes it to run hourly.

sqlliveness/slstorage: use 1PC to avoid intents

There was no reason for these transactions to lay down intents. They can always
commit with 1PC. Use the API to achieve that.

Epic: None

Release note (bug fix): In large, multi-region clusters, it was possible for
the leasing mechanism used for jobs to get caught in a live-lock scenario
whereby jobs could not be adopted. This bug has been resolved.

@ajwerner ajwerner requested a review from andreimatei October 28, 2022 20:26
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@ajwerner ajwerner force-pushed the ajwerner/fix-sqlliveness-transactions branch from 8f4ec57 to bb0be96 Compare October 31, 2022 01:50
@ajwerner
Copy link
Copy Markdown
Contributor Author

@jeffswenson this may conflict a bit with #90408 in terms of decoding the ID from the key, but it won't be big.

Copy link
Copy Markdown
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner)


-- commits line 41 at r3:
Remind me what we believe we say in that cluster. The deletion transaction was constantly retrying, and this was causing problems for jobs leases because... the deletion txn had taken locks on some session records and so those nodes couldn't heartbeat their own record? Or the deletion loop had taken a lock on one specific session record, which owned a job lease, and then nobody could delete that specific record?


pkg/sql/sqlliveness/slstorage/slstorage.go line 216 at r3 (raw file):

	}

	// We think that the session is expired, check, and maybe delete it.

expired; check


pkg/sql/sqlliveness/slstorage/slstorage.go line 219 at r3 (raw file):

	resChan := s.deleteOrFetchSessionSingleFlight(ctx, sid)

	// Release the lock after launching the single-flight to ensure that

I think this comment is weird because "releasing" the lock does not ensure anything. Consider putting a comment above the deleteOrFetch call saying that the singleflight will populate its result under the lock, and that ensures that all of the callers either join the flight, or see the result.


pkg/sql/sqlliveness/slstorage/slstorage.go line 241 at r3 (raw file):

}

func (s *Storage) deleteOrFetchSessionSingleFlight(

Please comment that the result on the channel will be a bool meaning something, and that the singleflight doesn't respond to ctx cancelation.


pkg/sql/sqlliveness/slstorage/slstorage.go line 241 at r3 (raw file):

}

func (s *Storage) deleteOrFetchSessionSingleFlight(

this method has two callers. One of them calls it while holding the mutex, and before calling it checks the cache. The other does neither of the two. I think this makes it difficult for the reader to figure out the singleflight's relationship to the mutex. If feasible, consider moving the cache checks inside this method, and always call the singleflight while holding the mutex.


pkg/sql/sqlliveness/slstorage/slstorage.go line 244 at r3 (raw file):

	ctx context.Context, sid sqlliveness.SessionID,
) <-chan singleflight.Result {
	// Launch singleflight to go read from the database and maybe delete the

lift this to a method comment


pkg/sql/sqlliveness/slstorage/slstorage.go line 248 at r3 (raw file):

	// cache. If it isn't found, we know it's dead and we can add that to the
	// deadSessions cache.
	resChan, _ := s.g.DoChan(string(sid), func() (interface{}, error) {

should s.g live under s.mu ?


pkg/sql/sqlliveness/slstorage/slstorage.go line 270 at r3 (raw file):

			s.mu.liveSessions.Add(sid, expiration)
		} else {
			s.mu.deadSessions.Del(sid)

what happens if we get rid of this Del and we simply Add blindly


pkg/sql/sqlliveness/slstorage/slstorage.go line 279 at r3 (raw file):

// deleteOrFetchSession returns whether the query session currently exists by
// reading from the database. If passed expiration is non-zero and the existing

stale comment


pkg/sql/sqlliveness/slstorage/slstorage.go line 377 at r3 (raw file):

}

func (s *Storage) fetchMaybeExpiredSessionIDs(

I think the "maybe" in the name is inappropriate. The sessions returned by this function are as expired as they get. The fact that the delete is not transactional with this function and so we need to check again is somebody else's concern.


pkg/sql/sqlliveness/slstorage/slstorage.go line 381 at r3 (raw file):

) (toCheck []sqlliveness.SessionID, _ error) {
	return toCheck, s.db.Txn(ctx, func(ctx context.Context, txn *kv.Txn) error {
		toCheck = nil // reset for restarts

instead of this, don't declare the toCheck variable in the return values; declare it inside this closure. It doesn't add anything to the function signature anyway


pkg/sql/sqlliveness/slstorage/slstorage.go line 397 at r3 (raw file):

				exp, err := decodeValue(rows[i])
				if err != nil {
					log.Warningf(ctx, "failed to decode row %s expiration: %v", rows[i].Key.String(), err)

consider panicking here instead, and below

There was no reason for these transactions to lay down intents. They can always
commit with 1PC. Use the API to achieve that.

Release note: None
Nothing ever scans the sqlliveness table. We don't expect it to grow very large
very rapidly. Also, we run this loop on every node. Running it every 20s never
made much sense. This commit changes it to run hourly.

Release note: None
The periodic loop to delete abandoned records used to to a potentially
long-running transaction to discover and delete expired sessions. This
could lead to live-lock starvation in some scenarios. Consider a case
whereby there are live sessions which are heart-beating their records
successfully. In the meantime, local sessions heart-beat regularly and
stay alive. Those sessions are in the read set of the deletion loop but
not in the write set. Imagine now that the deletion loop gets pushed and
has to refresh. It will fail and need to restart, but it will hold its
locks. This livelock can persist forever if the rate of heartbeats of
live sessions has a period shorter than the latency of the deletion
loop's operations. In a cluster with 100 nodes and a heartbeat interval
of 5s, we'd expect a heartbeat every 50ms. If the latency between the
deletion loop and leaseholder is, say, 50ms, we're in big trouble because
just the scan phase will take at least that long.

This change avoids the large transaction altogether. It decouples candidate
discover from removal. The usual process by which rows are removed is able
to avoid intents altogether and use 1PC. In this way, starvation, or even
waiting on locks should be fully eliminated.

Release note (bug fix): In large, multi-region clusters, it was possible for
the leasing mechanism used for jobs to get caught in a live-lock scenario
whereby jobs could not be adopted. This bug has been resolved.
@ajwerner ajwerner force-pushed the ajwerner/fix-sqlliveness-transactions branch from bb0be96 to 76d6825 Compare October 31, 2022 16:15
Copy link
Copy Markdown
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei)


-- commits line 41 at r3:

Previously, andreimatei (Andrei Matei) wrote…

Remind me what we believe we say in that cluster. The deletion transaction was constantly retrying, and this was causing problems for jobs leases because... the deletion txn had taken locks on some session records and so those nodes couldn't heartbeat their own record? Or the deletion loop had taken a lock on one specific session record, which owned a job lease, and then nobody could delete that specific record?

The jobs table had references to sessions from old processes. The jobs adoption loop is trying to check the liveness of these dead sessions. There are live sessions which are successfully heartbeating. The deletion loop holds locks on the dead sessions. It keeps failing to refresh because of the heartbeating happening underneath it -- it doesn't lock those rows.


pkg/sql/sqlliveness/slstorage/slstorage.go line 216 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

expired; check

Done.


pkg/sql/sqlliveness/slstorage/slstorage.go line 219 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I think this comment is weird because "releasing" the lock does not ensure anything. Consider putting a comment above the deleteOrFetch call saying that the singleflight will populate its result under the lock, and that ensures that all of the callers either join the flight, or see the result.

I reworded it a bit.


pkg/sql/sqlliveness/slstorage/slstorage.go line 241 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Please comment that the result on the channel will be a bool meaning something, and that the singleflight doesn't respond to ctx cancelation.

Done.


pkg/sql/sqlliveness/slstorage/slstorage.go line 241 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

this method has two callers. One of them calls it while holding the mutex, and before calling it checks the cache. The other does neither of the two. I think this makes it difficult for the reader to figure out the singleflight's relationship to the mutex. If feasible, consider moving the cache checks inside this method, and always call the singleflight while holding the mutex.

Done.


pkg/sql/sqlliveness/slstorage/slstorage.go line 244 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

lift this to a method comment

Done.


pkg/sql/sqlliveness/slstorage/slstorage.go line 248 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

should s.g live under s.mu ?

Done.


pkg/sql/sqlliveness/slstorage/slstorage.go line 270 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

what happens if we get rid of this Del and we simply Add blindly

Done.


pkg/sql/sqlliveness/slstorage/slstorage.go line 279 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

stale comment

Done.


pkg/sql/sqlliveness/slstorage/slstorage.go line 377 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I think the "maybe" in the name is inappropriate. The sessions returned by this function are as expired as they get. The fact that the delete is not transactional with this function and so we need to check again is somebody else's concern.

Done.


pkg/sql/sqlliveness/slstorage/slstorage.go line 381 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

instead of this, don't declare the toCheck variable in the return values; declare it inside this closure. It doesn't add anything to the function signature anyway

I reworked it a bit, but the slice still needs to be declared outside of the closure.


pkg/sql/sqlliveness/slstorage/slstorage.go line 397 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

consider panicking here instead, and below

That seems strictly worse to me. A malformed row in this table shouldn't take down a cluster.

Copy link
Copy Markdown
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained

@ajwerner
Copy link
Copy Markdown
Contributor Author

TFTR!

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Oct 31, 2022

Build failed (retrying...):

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Oct 31, 2022

Build failed:

@ajwerner
Copy link
Copy Markdown
Contributor Author

bors r+

flake was #76839

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Oct 31, 2022

Build succeeded:

@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Oct 31, 2022

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 0b3cc80 to blathers/backport-release-22.1-90875: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.1.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants