-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Description
Update: Since the compaction-filter performs RocksDB writes in background threads, which may run concurrently with SST ingestion during apply snapshot, so we introduce a range latch to ensure mutual exclusion between the compaction-filter and apply snapshot ingestion. For more details, refer to this PR: #18096.
Summary
This proposal introduces a special option to RocksDB that allows writes to continue during SST file ingestion. TiKV can safely enable this option because there are no concurrent threads writing data that overlap with the data being ingested.
Motivation
Currently, RocksDB pauses writes during SST file ingestion, which increases TiKV’s apply wait duration. This behavior should be optimized to reduce foreground duration.
Design Details
We propose adding an allow_write option to RocksDB’s IngestExternalFileOptions. When set to true, RocksDB will no longer pause writes to the database during SST ingestion.
TiKV can safely enable the allow_write option when ingesting SST files during apply-snapshot or destroy-region(delete by ingest) because:
- During apply-snapshot or destroy-region, no foreground writes overlap with the region being ingested.
- The single-threaded region-worker ensures sequential execution of apply-snapshot and destroy-region tasks. Even if a region is migrated and quickly returns, apply-snapshot and destroy-region cannot occur concurrently.
Drawbacks
Alternatives
An alternative approach is to modify RocksDB to support ingestion without pausing writes by restructuring the sequence number assignment process and temporarily pausing the memtable flushing instead of writes:
- Assign and publish sequence numbers to the SST before ingestion.
- Flush memtables if there are overlapping keys.
- Pause further memtable flushes.
- Perform the SST ingestion.
- Resume memtable flushes.
While this approach ensures safe sequence number assignment, it introduces two major drawbacks:
- Immutable snapshot is broken: After assigning and publish sequence numbers to ingested SST files, a new snapshot is created. Initially, this snapshot might not see the ingested SST files because the ingestion has not started. Once the ingestion completes, the snapshot can see the files, breaking snapshot immutability.
- Atomic writes are broken: RocksDB SST file ingestion is not atomic, especially when involving multiple column families. If sequence numbers are published before the ingestion completes, snapshots may observe a partial state where some SST data is visible while others are not, violating atomicity guarantees.
These issues may not impact TiKV in practice due to its usage patterns. However, the complexity and potential risks outweigh the benefits.