storage: separated lock-table keyspace

### Summary

Move lock portion of write intents to a separate lock-table keyspace instead of storing it inline with MVCC data.

### Motivation

Write intents are currently made up of an MVCCMetadata key and a provisional value key. Both are stored inline in the global MVCC keyspace, with the former stored at timestamp 0 of the write intent's key and the latter stored at the transaction's provisional commit timestamp at the time of writing the record.

Here's what this looks like visually for an intent at key "b":
```
/Table/"a"/123: {val1}
/Table/"b"/0:   {txn: txn1, ts: 999}
/Table/"b"/999: {val2}
/Table/"b"/500: {val3}
/Table/"c"/700: {val4}
```

We can think of the information stored in the MVCCMetadata key as a "write lock" and the information stored in the provisional value key as the "value under lock".

Storing the provisional value where it likely will remain after the transaction has committed is desirable because we only need to write it once. However, storing the write lock inline with the MVCC data is less of a clear win. Here are the only benefits I can think of for doing so:
1. can discover locks and read data without needing to read from two different iterators.
2. ... that's it?

I propose we move it to a separate "lock-table" keyspace for all of the following reasons:
1. simplifies MVCC logic significantly. A scan can check for conflicts first before looking at data. After this, scanning the data keyspace doesn't need to be aware of transaction isolation at all. We shouldn't even need to push transaction information into MVCC methods, just the read/write timestamp. This complexity has bitten us in the past and continue to (see time-bound iterator complications)
2. avoids cluttering the MVCC keyspace with RocksDB deletion tombstones at timestamp zero of each key. This slows down scans
3. opens up the possibility to "blind-write" into the MVCC keyspace (with cooperation from the write timestamp cache if we need to maintain single-key linearizability). This would be a significant win for write-heavy workloads into large LSMs
4. opens up the possibility to store multiple write intents on a single key (see #5861)
5. makes it significantly cleaner to store point and ranged read locks
6. makes it easier to ignore closed timestamp/timestamp cache when a transaction is updating its own intent (see https://github.com/cockroachdb/cockroach/issues/36478#issuecomment-521903404)
7. makes it easier to augment the durable lock-table with a cache of known committed transactions, which might allow us to resolve intents (release locks) without holding latches. This would reduce the contention footprint of transactions (see #22349)
8. makes optimizations like using SingleDelete when resolving intents easier to make in isolation (see #8979)
9. makes scanning the intents on a Range cheap. Multiple other proposals have wanted this access pattern to be improved
10. @bdarnell believes this would be good for filling up SSTs in the local keyspace

In addition to all of these benefits, I don't think the one downside (multiple storage engine scans) is as bad as it sounds. The lock table's RocksDB block will likely always remain in cache, so it's unlikely that this will have a noticeable impact on performance. In fact, the speedup due to the simplification of MVCC code (not to mention the other optimizations this allows) will likely outweigh the overhead of separated lock checking.

### Strawman Proposal

1. Remove Txn and IntentHistory fields from MVCCMetadata. Remove isolation concerns from MVCC logic.
2. Create new range-local lock-table keyspace. Store locks at keys with the structure: `/Local/<key>/lock/<txnID>`. This ends up being in-line with transaction record keys, which isn't really a problem. It hints at the equivalence between these two concepts.
3. Adjust read and write path to check lock-table after acquiring latches. The spans declared for latches would probably be the right input to check for conflicting locks.
4. Address one or more of the improvements listed above that can be made with this new structure.

This will require a fairly serious migration, but it seems like an important prereq for a number of desirable projects.

gz#9005

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: separated lock-table keyspace #41720

Summary

Motivation

Strawman Proposal

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

storage: separated lock-table keyspace #41720

Description

Summary

Motivation

Strawman Proposal

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions