-
Notifications
You must be signed in to change notification settings - Fork 4.1k
libroach: timebound iterator metadata doesn't consider intents #28358
Description
Note: I haven't confirmed this behavior in practice, it's currently theoretical from reading code. Holding off on the severity label until it's confirmed. The fix will need to be backported into 2.0
When computing which timestamps are present in an sstable for the time bound iterator performance optimization, all keys without a timestamp are ignored, which includes intents. This means that certain (probably unlikely) compactions will result in a situation where a time bounded iterator reading from [t1,t2] will miss an intent at t2.
cockroach/c-deps/libroach/timebound.cc
Lines 38 to 53 in 8b684ff
| rocksdb::Status AddUserKey(const rocksdb::Slice& user_key, const rocksdb::Slice& value, | |
| rocksdb::EntryType type, rocksdb::SequenceNumber seq, | |
| uint64_t file_size) override { | |
| rocksdb::Slice unused; | |
| rocksdb::Slice ts; | |
| if (SplitKey(user_key, &unused, &ts) && !ts.empty()) { | |
| ts.remove_prefix(1); // The NUL prefix. | |
| if (ts_max_.empty() || ts.compare(ts_max_) > 0) { | |
| ts_max_.assign(ts.data(), ts.size()); | |
| } | |
| if (ts_min_.empty() || ts.compare(ts_min_) < 0) { | |
| ts_min_.assign(ts.data(), ts.size()); | |
| } | |
| } | |
| return rocksdb::Status::OK(); | |
| } |
Backup uses timebounded iterators and does a read at a certain timestamp until all intents below that timestamp have been resolved one way or another. This bug means that backup could incorrectly proceed with an unresolved intent. If this intent is later rolled back, the backup would contain a value that was never committed.
CDC will have similar correctness issues.
Several fixes are possible with various performance tradeoffs. An partial list:
- Include intents in the time bound metadata. This requires unmarshalling every mvcc metadata proto in the sst and is probably too expensive, especially for inline values like timeseries.
- Disable time bound metadata for any sstable containing intents. This will cause the time bound iterator to always look in the file, so will be correct but a performance pessimization.
- Some combination of the two based on how many intents are in the file or where in the keyspace it is.