Skip to content

[Proposal] Ingest SST without pausing writes #18081

@hhwyt

Description

@hhwyt

Update: Since the compaction-filter performs RocksDB writes in background threads, which may run concurrently with SST ingestion during apply snapshot, so we introduce a range latch to ensure mutual exclusion between the compaction-filter and apply snapshot ingestion. For more details, refer to this PR: #18096.

Summary

This proposal introduces a special option to RocksDB that allows writes to continue during SST file ingestion. TiKV can safely enable this option because there are no concurrent threads writing data that overlap with the data being ingested.

Motivation

Currently, RocksDB pauses writes during SST file ingestion, which increases TiKV’s apply wait duration. This behavior should be optimized to reduce foreground duration.

Design Details

We propose adding an allow_write option to RocksDB’s IngestExternalFileOptions. When set to true, RocksDB will no longer pause writes to the database during SST ingestion.

TiKV can safely enable the allow_write option when ingesting SST files during apply-snapshot or destroy-region(delete by ingest) because:

  • During apply-snapshot or destroy-region, no foreground writes overlap with the region being ingested.
  • The single-threaded region-worker ensures sequential execution of apply-snapshot and destroy-region tasks. Even if a region is migrated and quickly returns, apply-snapshot and destroy-region cannot occur concurrently.

Drawbacks

Alternatives

An alternative approach is to modify RocksDB to support ingestion without pausing writes by restructuring the sequence number assignment process and temporarily pausing the memtable flushing instead of writes:

  1. Assign and publish sequence numbers to the SST before ingestion.
  2. Flush memtables if there are overlapping keys.
  3. Pause further memtable flushes.
  4. Perform the SST ingestion.
  5. Resume memtable flushes.

While this approach ensures safe sequence number assignment, it introduces two major drawbacks:

  • Immutable snapshot is broken: After assigning and publish sequence numbers to ingested SST files, a new snapshot is created. Initially, this snapshot might not see the ingested SST files because the ingestion has not started. Once the ingestion completes, the snapshot can see the files, breaking snapshot immutability.
  • Atomic writes are broken: RocksDB SST file ingestion is not atomic, especially when involving multiple column families. If sequence numbers are published before the ingestion completes, snapshots may observe a partial state where some SST data is visible while others are not, violating atomicity guarantees.

These issues may not impact TiKV in practice due to its usage patterns. However, the complexity and potential risks outweigh the benefits.

Metadata

Metadata

Assignees

Labels

affects-7.5This bug affects the 7.5.x(LTS) versions.affects-8.1This bug affects the 8.1.x(LTS) versions.affects-8.5This bug affects the 8.5.x(LTS) versions.report/customerCustomers have encountered this bug.type/enhancementThe issue or PR belongs to an enhancement.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions