sqlliveness/slstorage: rework deletion loop to avoid contention by ajwerner · Pull Request #90875 · cockroachdb/cockroach

ajwerner · 2022-10-28T20:26:06Z

sqlliveness/slstorage: rework deletion loop to avoid contention

The periodic loop to delete abandoned records used to to a potentially
long-running transaction to discover and delete expired sessions. This
could lead to live-lock starvation in some scenarios. Consider a case
whereby there are live sessions which are heart-beating their records
successfully. In the meantime, local sessions heart-beat regularly and
stay alive. Those sessions are in the read set of the deletion loop but
not in the write set. Imagine now that the deletion loop gets pushed and
has to refresh. It will fail and need to restart, but it will hold its
locks. This livelock can persist forever if the rate of heartbeats of
live sessions has a period shorter than the latency of the deletion
loop's operations. In a cluster with 100 nodes and a heartbeat interval
of 5s, we'd expect a heartbeat every 50ms. If the latency between the
deletion loop and leaseholder is, say, 50ms, we're in big trouble because
just the scan phase will take at least that long.

This change avoids the large transaction altogether. It decouples candidate
discover from removal. The usual process by which rows are removed is able
to avoid intents altogether and use 1PC. In this way, starvation, or even
waiting on locks should be fully eliminated.

sqlliveness/slstorage: reduce the garbage collection loop frequency

Nothing ever scans the sqlliveness table. We don't expect it to grow very large
very rapidly. Also, we run this loop on every node. Running it every 20s never
made much sense. This commit changes it to run hourly.

sqlliveness/slstorage: use 1PC to avoid intents

There was no reason for these transactions to lay down intents. They can always
commit with 1PC. Use the API to achieve that.

Epic: None

Release note (bug fix): In large, multi-region clusters, it was possible for
the leasing mechanism used for jobs to get caught in a live-lock scenario
whereby jobs could not be adopted. This bug has been resolved.

cockroach-teamcity · 2022-10-28T20:26:18Z

This change is

ajwerner · 2022-10-31T13:44:46Z

@jeffswenson this may conflict a bit with #90408 in terms of decoding the ID from the key, but it won't be big.

andreimatei

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner)

-- commits line 41 at r3:
Remind me what we believe we say in that cluster. The deletion transaction was constantly retrying, and this was causing problems for jobs leases because... the deletion txn had taken locks on some session records and so those nodes couldn't heartbeat their own record? Or the deletion loop had taken a lock on one specific session record, which owned a job lease, and then nobody could delete that specific record?

pkg/sql/sqlliveness/slstorage/slstorage.go line 216 at r3 (raw file):

	}

	// We think that the session is expired, check, and maybe delete it.

expired; check

pkg/sql/sqlliveness/slstorage/slstorage.go line 219 at r3 (raw file):

	resChan := s.deleteOrFetchSessionSingleFlight(ctx, sid)

	// Release the lock after launching the single-flight to ensure that

I think this comment is weird because "releasing" the lock does not ensure anything. Consider putting a comment above the deleteOrFetch call saying that the singleflight will populate its result under the lock, and that ensures that all of the callers either join the flight, or see the result.

pkg/sql/sqlliveness/slstorage/slstorage.go line 241 at r3 (raw file):

}

func (s *Storage) deleteOrFetchSessionSingleFlight(

Please comment that the result on the channel will be a bool meaning something, and that the singleflight doesn't respond to ctx cancelation.

pkg/sql/sqlliveness/slstorage/slstorage.go line 241 at r3 (raw file):

}

func (s *Storage) deleteOrFetchSessionSingleFlight(

this method has two callers. One of them calls it while holding the mutex, and before calling it checks the cache. The other does neither of the two. I think this makes it difficult for the reader to figure out the singleflight's relationship to the mutex. If feasible, consider moving the cache checks inside this method, and always call the singleflight while holding the mutex.

pkg/sql/sqlliveness/slstorage/slstorage.go line 244 at r3 (raw file):

	ctx context.Context, sid sqlliveness.SessionID,
) <-chan singleflight.Result {
	// Launch singleflight to go read from the database and maybe delete the

lift this to a method comment

pkg/sql/sqlliveness/slstorage/slstorage.go line 248 at r3 (raw file):

	// cache. If it isn't found, we know it's dead and we can add that to the
	// deadSessions cache.
	resChan, _ := s.g.DoChan(string(sid), func() (interface{}, error) {

should s.g live under s.mu ?

pkg/sql/sqlliveness/slstorage/slstorage.go line 270 at r3 (raw file):

			s.mu.liveSessions.Add(sid, expiration)
		} else {
			s.mu.deadSessions.Del(sid)

what happens if we get rid of this Del and we simply Add blindly

pkg/sql/sqlliveness/slstorage/slstorage.go line 279 at r3 (raw file):

// deleteOrFetchSession returns whether the query session currently exists by
// reading from the database. If passed expiration is non-zero and the existing

stale comment

pkg/sql/sqlliveness/slstorage/slstorage.go line 377 at r3 (raw file):

}

func (s *Storage) fetchMaybeExpiredSessionIDs(

I think the "maybe" in the name is inappropriate. The sessions returned by this function are as expired as they get. The fact that the delete is not transactional with this function and so we need to check again is somebody else's concern.

pkg/sql/sqlliveness/slstorage/slstorage.go line 381 at r3 (raw file):

) (toCheck []sqlliveness.SessionID, _ error) {
	return toCheck, s.db.Txn(ctx, func(ctx context.Context, txn *kv.Txn) error {
		toCheck = nil // reset for restarts

instead of this, don't declare the toCheck variable in the return values; declare it inside this closure. It doesn't add anything to the function signature anyway

pkg/sql/sqlliveness/slstorage/slstorage.go line 397 at r3 (raw file):

				exp, err := decodeValue(rows[i])
				if err != nil {
					log.Warningf(ctx, "failed to decode row %s expiration: %v", rows[i].Key.String(), err)

consider panicking here instead, and below

There was no reason for these transactions to lay down intents. They can always commit with 1PC. Use the API to achieve that. Release note: None

Nothing ever scans the sqlliveness table. We don't expect it to grow very large very rapidly. Also, we run this loop on every node. Running it every 20s never made much sense. This commit changes it to run hourly. Release note: None

The periodic loop to delete abandoned records used to to a potentially long-running transaction to discover and delete expired sessions. This could lead to live-lock starvation in some scenarios. Consider a case whereby there are live sessions which are heart-beating their records successfully. In the meantime, local sessions heart-beat regularly and stay alive. Those sessions are in the read set of the deletion loop but not in the write set. Imagine now that the deletion loop gets pushed and has to refresh. It will fail and need to restart, but it will hold its locks. This livelock can persist forever if the rate of heartbeats of live sessions has a period shorter than the latency of the deletion loop's operations. In a cluster with 100 nodes and a heartbeat interval of 5s, we'd expect a heartbeat every 50ms. If the latency between the deletion loop and leaseholder is, say, 50ms, we're in big trouble because just the scan phase will take at least that long. This change avoids the large transaction altogether. It decouples candidate discover from removal. The usual process by which rows are removed is able to avoid intents altogether and use 1PC. In this way, starvation, or even waiting on locks should be fully eliminated. Release note (bug fix): In large, multi-region clusters, it was possible for the leasing mechanism used for jobs to get caught in a live-lock scenario whereby jobs could not be adopted. This bug has been resolved.

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei)

-- commits line 41 at r3: