sqlliveness/slstorage: rework deletion loop to avoid contention#90875
Conversation
8f4ec57 to
bb0be96
Compare
|
@jeffswenson this may conflict a bit with #90408 in terms of decoding the ID from the key, but it won't be big. |
andreimatei
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @ajwerner)
-- commits line 41 at r3:
Remind me what we believe we say in that cluster. The deletion transaction was constantly retrying, and this was causing problems for jobs leases because... the deletion txn had taken locks on some session records and so those nodes couldn't heartbeat their own record? Or the deletion loop had taken a lock on one specific session record, which owned a job lease, and then nobody could delete that specific record?
pkg/sql/sqlliveness/slstorage/slstorage.go line 216 at r3 (raw file):
} // We think that the session is expired, check, and maybe delete it.
expired; check
pkg/sql/sqlliveness/slstorage/slstorage.go line 219 at r3 (raw file):
resChan := s.deleteOrFetchSessionSingleFlight(ctx, sid) // Release the lock after launching the single-flight to ensure that
I think this comment is weird because "releasing" the lock does not ensure anything. Consider putting a comment above the deleteOrFetch call saying that the singleflight will populate its result under the lock, and that ensures that all of the callers either join the flight, or see the result.
pkg/sql/sqlliveness/slstorage/slstorage.go line 241 at r3 (raw file):
} func (s *Storage) deleteOrFetchSessionSingleFlight(
Please comment that the result on the channel will be a bool meaning something, and that the singleflight doesn't respond to ctx cancelation.
pkg/sql/sqlliveness/slstorage/slstorage.go line 241 at r3 (raw file):
} func (s *Storage) deleteOrFetchSessionSingleFlight(
this method has two callers. One of them calls it while holding the mutex, and before calling it checks the cache. The other does neither of the two. I think this makes it difficult for the reader to figure out the singleflight's relationship to the mutex. If feasible, consider moving the cache checks inside this method, and always call the singleflight while holding the mutex.
pkg/sql/sqlliveness/slstorage/slstorage.go line 244 at r3 (raw file):
ctx context.Context, sid sqlliveness.SessionID, ) <-chan singleflight.Result { // Launch singleflight to go read from the database and maybe delete the
lift this to a method comment
pkg/sql/sqlliveness/slstorage/slstorage.go line 248 at r3 (raw file):
// cache. If it isn't found, we know it's dead and we can add that to the // deadSessions cache. resChan, _ := s.g.DoChan(string(sid), func() (interface{}, error) {
should s.g live under s.mu ?
pkg/sql/sqlliveness/slstorage/slstorage.go line 270 at r3 (raw file):
s.mu.liveSessions.Add(sid, expiration) } else { s.mu.deadSessions.Del(sid)
what happens if we get rid of this Del and we simply Add blindly
pkg/sql/sqlliveness/slstorage/slstorage.go line 279 at r3 (raw file):
// deleteOrFetchSession returns whether the query session currently exists by // reading from the database. If passed expiration is non-zero and the existing
stale comment
pkg/sql/sqlliveness/slstorage/slstorage.go line 377 at r3 (raw file):
} func (s *Storage) fetchMaybeExpiredSessionIDs(
I think the "maybe" in the name is inappropriate. The sessions returned by this function are as expired as they get. The fact that the delete is not transactional with this function and so we need to check again is somebody else's concern.
pkg/sql/sqlliveness/slstorage/slstorage.go line 381 at r3 (raw file):
) (toCheck []sqlliveness.SessionID, _ error) { return toCheck, s.db.Txn(ctx, func(ctx context.Context, txn *kv.Txn) error { toCheck = nil // reset for restarts
instead of this, don't declare the toCheck variable in the return values; declare it inside this closure. It doesn't add anything to the function signature anyway
pkg/sql/sqlliveness/slstorage/slstorage.go line 397 at r3 (raw file):
exp, err := decodeValue(rows[i]) if err != nil { log.Warningf(ctx, "failed to decode row %s expiration: %v", rows[i].Key.String(), err)
consider panicking here instead, and below
There was no reason for these transactions to lay down intents. They can always commit with 1PC. Use the API to achieve that. Release note: None
Nothing ever scans the sqlliveness table. We don't expect it to grow very large very rapidly. Also, we run this loop on every node. Running it every 20s never made much sense. This commit changes it to run hourly. Release note: None
The periodic loop to delete abandoned records used to to a potentially long-running transaction to discover and delete expired sessions. This could lead to live-lock starvation in some scenarios. Consider a case whereby there are live sessions which are heart-beating their records successfully. In the meantime, local sessions heart-beat regularly and stay alive. Those sessions are in the read set of the deletion loop but not in the write set. Imagine now that the deletion loop gets pushed and has to refresh. It will fail and need to restart, but it will hold its locks. This livelock can persist forever if the rate of heartbeats of live sessions has a period shorter than the latency of the deletion loop's operations. In a cluster with 100 nodes and a heartbeat interval of 5s, we'd expect a heartbeat every 50ms. If the latency between the deletion loop and leaseholder is, say, 50ms, we're in big trouble because just the scan phase will take at least that long. This change avoids the large transaction altogether. It decouples candidate discover from removal. The usual process by which rows are removed is able to avoid intents altogether and use 1PC. In this way, starvation, or even waiting on locks should be fully eliminated. Release note (bug fix): In large, multi-region clusters, it was possible for the leasing mechanism used for jobs to get caught in a live-lock scenario whereby jobs could not be adopted. This bug has been resolved.
bb0be96 to
76d6825
Compare
ajwerner
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei)
Previously, andreimatei (Andrei Matei) wrote…
Remind me what we believe we say in that cluster. The deletion transaction was constantly retrying, and this was causing problems for jobs leases because... the deletion txn had taken locks on some session records and so those nodes couldn't heartbeat their own record? Or the deletion loop had taken a lock on one specific session record, which owned a job lease, and then nobody could delete that specific record?
The jobs table had references to sessions from old processes. The jobs adoption loop is trying to check the liveness of these dead sessions. There are live sessions which are successfully heartbeating. The deletion loop holds locks on the dead sessions. It keeps failing to refresh because of the heartbeating happening underneath it -- it doesn't lock those rows.
pkg/sql/sqlliveness/slstorage/slstorage.go line 216 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
expired; check
Done.
pkg/sql/sqlliveness/slstorage/slstorage.go line 219 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I think this comment is weird because "releasing" the lock does not ensure anything. Consider putting a comment above the
deleteOrFetchcall saying that the singleflight will populate its result under the lock, and that ensures that all of the callers either join the flight, or see the result.
I reworded it a bit.
pkg/sql/sqlliveness/slstorage/slstorage.go line 241 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
Please comment that the result on the channel will be a bool meaning something, and that the singleflight doesn't respond to ctx cancelation.
Done.
pkg/sql/sqlliveness/slstorage/slstorage.go line 241 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
this method has two callers. One of them calls it while holding the mutex, and before calling it checks the cache. The other does neither of the two. I think this makes it difficult for the reader to figure out the singleflight's relationship to the mutex. If feasible, consider moving the cache checks inside this method, and always call the singleflight while holding the mutex.
Done.
pkg/sql/sqlliveness/slstorage/slstorage.go line 244 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
lift this to a method comment
Done.
pkg/sql/sqlliveness/slstorage/slstorage.go line 248 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
should
s.glive unders.mu?
Done.
pkg/sql/sqlliveness/slstorage/slstorage.go line 270 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
what happens if we get rid of this
Deland we simplyAddblindly
Done.
pkg/sql/sqlliveness/slstorage/slstorage.go line 279 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
stale comment
Done.
pkg/sql/sqlliveness/slstorage/slstorage.go line 377 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I think the "maybe" in the name is inappropriate. The sessions returned by this function are as expired as they get. The fact that the delete is not transactional with this function and so we need to check again is somebody else's concern.
Done.
pkg/sql/sqlliveness/slstorage/slstorage.go line 381 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
instead of this, don't declare the
toCheckvariable in the return values; declare it inside this closure. It doesn't add anything to the function signature anyway
I reworked it a bit, but the slice still needs to be declared outside of the closure.
pkg/sql/sqlliveness/slstorage/slstorage.go line 397 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
consider panicking here instead, and below
That seems strictly worse to me. A malformed row in this table shouldn't take down a cluster.
andreimatei
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained
|
TFTR! bors r+ |
|
Build failed (retrying...): |
|
Build failed: |
|
bors r+ flake was #76839 |
|
Build succeeded: |
|
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from 0b3cc80 to blathers/backport-release-22.1-90875: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 22.1.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
sqlliveness/slstorage: rework deletion loop to avoid contention
The periodic loop to delete abandoned records used to to a potentially
long-running transaction to discover and delete expired sessions. This
could lead to live-lock starvation in some scenarios. Consider a case
whereby there are live sessions which are heart-beating their records
successfully. In the meantime, local sessions heart-beat regularly and
stay alive. Those sessions are in the read set of the deletion loop but
not in the write set. Imagine now that the deletion loop gets pushed and
has to refresh. It will fail and need to restart, but it will hold its
locks. This livelock can persist forever if the rate of heartbeats of
live sessions has a period shorter than the latency of the deletion
loop's operations. In a cluster with 100 nodes and a heartbeat interval
of 5s, we'd expect a heartbeat every 50ms. If the latency between the
deletion loop and leaseholder is, say, 50ms, we're in big trouble because
just the scan phase will take at least that long.
This change avoids the large transaction altogether. It decouples candidate
discover from removal. The usual process by which rows are removed is able
to avoid intents altogether and use 1PC. In this way, starvation, or even
waiting on locks should be fully eliminated.
sqlliveness/slstorage: reduce the garbage collection loop frequency
Nothing ever scans the sqlliveness table. We don't expect it to grow very large
very rapidly. Also, we run this loop on every node. Running it every 20s never
made much sense. This commit changes it to run hourly.
sqlliveness/slstorage: use 1PC to avoid intents
There was no reason for these transactions to lay down intents. They can always
commit with 1PC. Use the API to achieve that.
Epic: None
Release note (bug fix): In large, multi-region clusters, it was possible for
the leasing mechanism used for jobs to get caught in a live-lock scenario
whereby jobs could not be adopted. This bug has been resolved.