Skip to content

Flaky WAL page-version-gap on TimeSeries shard page-0 under concurrent inserts + background compaction in Raft HA #4458

@robfrank

Description

@robfrank

Summary

Under a Raft HA cluster, concurrent single-row INSERTs into a TimeSeries type that overlap a background compaction can cause a follower to observe a WAL page-version gap on the shard's page-0 (the mutable-bucket header page). The follower logs WALVersionGapException, force-triggers a snapshot resync, and during the resync window queries transiently fail (e.g. Type with name '' was not found) or ingestion stalls until recovery.

This is the root cause of the intermittent failure of ThreeNodesTimeSeriesLoadTestIT on the gRPC protocol.

The race is flaky and timing/load-dependent: it only manifests when execution is slow enough for the compaction page-0 write and a concurrent append page-0 write to interleave unfavorably. It is not caused by the #4453 per-row append fix (see "Not related to #4453" below).

Symptoms

Follower-side log (gap starts small, then cascades as resync churns):

WARNI [TransactionManager] Cannot apply changes to the database because modified page
      PageId(<db>/28/0) version in WAL (18) does not match with existent version (16) fileId=28
SEVER [ArcadeStateMachine] WAL version gap on follower - state divergence detected,
      triggering snapshot resync (db=<db>, txId=191): ...
SEVER [ArcadeStateMachine] Replication error at index 357: WAL version gap detected - snapshot resync required

Client side (during the resync window): Type with name '' was not found, or ingestion that never completes.

PageId(<db>/28/0) is page 0 of a TimeSeries shard file (the mutable-bucket header page), which is modified by every append (updateHeaderStats) and by compaction (clearDataPages, compaction flag).

Reproduction

Added an embedded (no Docker) regression test that drives the scenario:
grpcw/src/test/java/com/arcadedb/server/grpc/TimeSeriesGrpcHaConcurrentInsertIT.java (2-node Raft HA + gRPC, 3 threads x 2000 single-row INSERT INTO sensor SET ...).

Observed behaviour (each config run 3x):

Config Gaps Notes
main 0/3 clean at normal speed (~95s)
#4453 branch HEAD (with dispatch) 0/3 clean at normal speed
#4453 branch, dispatch removed 3/3 clean clean at normal speed

The gap reproduced only in runs that were abnormally slow (240-320s instead of ~95s), i.e. when the machine was under heavy load (immediately after a parallel mvn -am build). So the embedded test is a useful smoke test but is not yet a deterministic reproduction. A deterministic repro likely needs an injected delay/hook between compaction Phase 4b and the SCHEMA_ENTRY ship, or a forced compaction overlap.

Hypothesized mechanism

During compaction, mutating commits run under RaftReplicatedDatabase.runWithCompactionReplication with isSchemaCommitThread=true, so their WAL is buffered (schemaWalBuffer) and shipped later, atomically, as a single SCHEMA_ENTRY (replicateSchema). Meanwhile, the compaction Phase 4b runs lock-free so concurrent appends keep committing and ship their page-0 changes immediately as TX_ENTRY.

Sequence that produces the gap:

  1. Compaction writes page-0 (Phase 0 flag and/or Phase 4c clear) at version V; this commit is buffered, not yet shipped.
  2. A concurrent append writes page-0 at version V+1 and ships it immediately as TX_ENTRY.
  3. The follower applies the append's TX_ENTRY (V+1) while still at V-1 (it never received V, which is sitting in the not-yet-shipped SCHEMA_ENTRY) -> WALVersionGapException -> snapshot resync.

The code comment at RaftReplicatedDatabase (around the schemaWalBuffer drain) asserts the buffered compaction WAL is "correctly-versioned ... without a version gap", but that assumes no concurrent immediate TX_ENTRY advances/anticipates the same page's version while the compaction entry is buffered in flight.

Relevant code

  • ha-raft/.../RaftReplicatedDatabase.java - commit() (immediate TX_ENTRY path), runWithCompactionReplication() / schemaWalBuffer / replicateSchema() (buffered SCHEMA_ENTRY path), isSchemaCommitThread.
  • engine/.../timeseries/TimeSeriesShard.java - compactInternal() Phase 0 (page-0 flag) and Phase 4c (clearDataPages, page-0 reset); Phase 4b is lock-free by design.
  • engine/.../timeseries/TimeSeriesBucket.java - updateHeaderStats() / getOrCreateActiveDataPage() write page-0 on every append.
  • engine/.../TransactionManager.applyChanges() - the page-version-gap check that throws WALVersionGapException.
  • ha-raft/.../ArcadeStateMachine.applyTxEntry() - converts the gap into a ReplicationException -> snapshot resync.

Investigation / fix directions to evaluate

  1. Make the compaction page-0 writes ship in-order with concurrent append TX_ENTRYs (e.g. do not buffer page-0 mutations that race with live appends; or hold the append lock across the buffered page-0 commit so no immediate TX_ENTRY can interleave a higher version while the buffered entry is in flight).
  2. Alternatively, ship compaction mutable-bucket page changes via the same ordered TX_ENTRY path rather than the deferred SCHEMA_ENTRY buffer, so the Raft log order matches the page-version order.
  3. Make follower apply tolerant of the buffered/immediate interleaving for TS shard pages without resorting to a full snapshot resync (last resort).
  4. Build a deterministic reproduction (test hook to delay the SCHEMA_ENTRY ship relative to a concurrent append) so any fix is verifiable.

Not related to #4453

#4453 (silent sample loss on concurrent single-row TS INSERTs) is fixed by the compaction Phase-4 change alone; the per-row append shardExecutor dispatch in that PR is unnecessary and does not cause this gap (HEAD with the dispatch is also clean at normal speed). This page-0 ordering race is pre-existing/orthogonal and should be tracked and fixed separately.

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions