-
Notifications
You must be signed in to change notification settings - Fork 4.1k
storage: separated lock-table keyspace #41720
Description
Summary
Move lock portion of write intents to a separate lock-table keyspace instead of storing it inline with MVCC data.
Motivation
Write intents are currently made up of an MVCCMetadata key and a provisional value key. Both are stored inline in the global MVCC keyspace, with the former stored at timestamp 0 of the write intent's key and the latter stored at the transaction's provisional commit timestamp at the time of writing the record.
Here's what this looks like visually for an intent at key "b":
/Table/"a"/123: {val1}
/Table/"b"/0: {txn: txn1, ts: 999}
/Table/"b"/999: {val2}
/Table/"b"/500: {val3}
/Table/"c"/700: {val4}
We can think of the information stored in the MVCCMetadata key as a "write lock" and the information stored in the provisional value key as the "value under lock".
Storing the provisional value where it likely will remain after the transaction has committed is desirable because we only need to write it once. However, storing the write lock inline with the MVCC data is less of a clear win. Here are the only benefits I can think of for doing so:
- can discover locks and read data without needing to read from two different iterators.
- ... that's it?
I propose we move it to a separate "lock-table" keyspace for all of the following reasons:
- simplifies MVCC logic significantly. A scan can check for conflicts first before looking at data. After this, scanning the data keyspace doesn't need to be aware of transaction isolation at all. We shouldn't even need to push transaction information into MVCC methods, just the read/write timestamp. This complexity has bitten us in the past and continue to (see time-bound iterator complications)
- avoids cluttering the MVCC keyspace with RocksDB deletion tombstones at timestamp zero of each key. This slows down scans
- opens up the possibility to "blind-write" into the MVCC keyspace (with cooperation from the write timestamp cache if we need to maintain single-key linearizability). This would be a significant win for write-heavy workloads into large LSMs
- opens up the possibility to store multiple write intents on a single key (see storage: allow multiple intents #5861)
- makes it significantly cleaner to store point and ranged read locks
- makes it easier to ignore closed timestamp/timestamp cache when a transaction is updating its own intent (see storage: adaptive closed timestamp target duration #36478 (comment))
- makes it easier to augment the durable lock-table with a cache of known committed transactions, which might allow us to resolve intents (release locks) without holding latches. This would reduce the contention footprint of transactions (see kv: block less on intent resolutions, reduce transaction contention footprint #22349)
- makes optimizations like using SingleDelete when resolving intents easier to make in isolation (see kvserver: investigate
SingleDeletefor raft log truncation #8979) - makes scanning the intents on a Range cheap. Multiple other proposals have wanted this access pattern to be improved
- @bdarnell believes this would be good for filling up SSTs in the local keyspace
In addition to all of these benefits, I don't think the one downside (multiple storage engine scans) is as bad as it sounds. The lock table's RocksDB block will likely always remain in cache, so it's unlikely that this will have a noticeable impact on performance. In fact, the speedup due to the simplification of MVCC code (not to mention the other optimizations this allows) will likely outweigh the overhead of separated lock checking.
Strawman Proposal
- Remove Txn and IntentHistory fields from MVCCMetadata. Remove isolation concerns from MVCC logic.
- Create new range-local lock-table keyspace. Store locks at keys with the structure:
/Local/<key>/lock/<txnID>. This ends up being in-line with transaction record keys, which isn't really a problem. It hints at the equivalence between these two concepts. - Adjust read and write path to check lock-table after acquiring latches. The spans declared for latches would probably be the right input to check for conflicting locks.
- Address one or more of the improvements listed above that can be made with this new structure.
This will require a fairly serious migration, but it seems like an important prereq for a number of desirable projects.
gz#9005