You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Move lock portion of write intents to a separate lock-table keyspace instead of storing it inline with MVCC data.
Motivation
Write intents are currently made up of an MVCCMetadata key and a provisional value key. Both are stored inline in the global MVCC keyspace, with the former stored at timestamp 0 of the write intent's key and the latter stored at the transaction's provisional commit timestamp at the time of writing the record.
Here's what this looks like visually for an intent at key "b":
We can think of the information stored in the MVCCMetadata key as a "write lock" and the information stored in the provisional value key as the "value under lock".
Storing the provisional value where it likely will remain after the transaction has committed is desirable because we only need to write it once. However, storing the write lock inline with the MVCC data is less of a clear win. Here are the only benefits I can think of for doing so:
can discover locks and read data without needing to read from two different iterators.
... that's it?
I propose we move it to a separate "lock-table" keyspace for all of the following reasons:
simplifies MVCC logic significantly. A scan can check for conflicts first before looking at data. After this, scanning the data keyspace doesn't need to be aware of transaction isolation at all. We shouldn't even need to push transaction information into MVCC methods, just the read/write timestamp. This complexity has bitten us in the past and continue to (see time-bound iterator complications)
avoids cluttering the MVCC keyspace with RocksDB deletion tombstones at timestamp zero of each key. This slows down scans
opens up the possibility to "blind-write" into the MVCC keyspace (with cooperation from the write timestamp cache if we need to maintain single-key linearizability). This would be a significant win for write-heavy workloads into large LSMs
makes scanning the intents on a Range cheap. Multiple other proposals have wanted this access pattern to be improved
@bdarnell believes this would be good for filling up SSTs in the local keyspace
In addition to all of these benefits, I don't think the one downside (multiple storage engine scans) is as bad as it sounds. The lock table's RocksDB block will likely always remain in cache, so it's unlikely that this will have a noticeable impact on performance. In fact, the speedup due to the simplification of MVCC code (not to mention the other optimizations this allows) will likely outweigh the overhead of separated lock checking.
Strawman Proposal
Remove Txn and IntentHistory fields from MVCCMetadata. Remove isolation concerns from MVCC logic.
Create new range-local lock-table keyspace. Store locks at keys with the structure: /Local/<key>/lock/<txnID>. This ends up being in-line with transaction record keys, which isn't really a problem. It hints at the equivalence between these two concepts.
Adjust read and write path to check lock-table after acquiring latches. The spans declared for latches would probably be the right input to check for conflicting locks.
Address one or more of the improvements listed above that can be made with this new structure.
This will require a fairly serious migration, but it seems like an important prereq for a number of desirable projects.
Summary
Move lock portion of write intents to a separate lock-table keyspace instead of storing it inline with MVCC data.
Motivation
Write intents are currently made up of an MVCCMetadata key and a provisional value key. Both are stored inline in the global MVCC keyspace, with the former stored at timestamp 0 of the write intent's key and the latter stored at the transaction's provisional commit timestamp at the time of writing the record.
Here's what this looks like visually for an intent at key "b":
We can think of the information stored in the MVCCMetadata key as a "write lock" and the information stored in the provisional value key as the "value under lock".
Storing the provisional value where it likely will remain after the transaction has committed is desirable because we only need to write it once. However, storing the write lock inline with the MVCC data is less of a clear win. Here are the only benefits I can think of for doing so:
I propose we move it to a separate "lock-table" keyspace for all of the following reasons:
SingleDeletefor raft log truncation #8979)In addition to all of these benefits, I don't think the one downside (multiple storage engine scans) is as bad as it sounds. The lock table's RocksDB block will likely always remain in cache, so it's unlikely that this will have a noticeable impact on performance. In fact, the speedup due to the simplification of MVCC code (not to mention the other optimizations this allows) will likely outweigh the overhead of separated lock checking.
Strawman Proposal
/Local/<key>/lock/<txnID>. This ends up being in-line with transaction record keys, which isn't really a problem. It hints at the equivalence between these two concepts.This will require a fairly serious migration, but it seems like an important prereq for a number of desirable projects.
gz#9005