Flaky WAL page-version-gap on TimeSeries shard page-0 under concurrent inserts + background compaction in Raft HA

## Summary

Under a Raft HA cluster, concurrent single-row INSERTs into a TimeSeries type that overlap a background compaction can cause a follower to observe a **WAL page-version gap** on the shard's page-0 (the mutable-bucket header page). The follower logs `WALVersionGapException`, force-triggers a snapshot resync, and during the resync window queries transiently fail (e.g. `Type with name '' was not found`) or ingestion stalls until recovery.

This is the root cause of the intermittent failure of `ThreeNodesTimeSeriesLoadTestIT` on the gRPC protocol.

The race is **flaky and timing/load-dependent**: it only manifests when execution is slow enough for the compaction page-0 write and a concurrent append page-0 write to interleave unfavorably. It is **not** caused by the `#4453` per-row append fix (see "Not related to #4453" below).

## Symptoms

Follower-side log (gap starts small, then cascades as resync churns):

```
WARNI [TransactionManager] Cannot apply changes to the database because modified page
      PageId(<db>/28/0) version in WAL (18) does not match with existent version (16) fileId=28
SEVER [ArcadeStateMachine] WAL version gap on follower - state divergence detected,
      triggering snapshot resync (db=<db>, txId=191): ...
SEVER [ArcadeStateMachine] Replication error at index 357: WAL version gap detected - snapshot resync required
```

Client side (during the resync window): `Type with name '' was not found`, or ingestion that never completes.

`PageId(<db>/28/0)` is page 0 of a TimeSeries shard file (the mutable-bucket **header** page), which is modified by every append (`updateHeaderStats`) and by compaction (`clearDataPages`, compaction flag).

## Reproduction

Added an embedded (no Docker) regression test that drives the scenario:
`grpcw/src/test/java/com/arcadedb/server/grpc/TimeSeriesGrpcHaConcurrentInsertIT.java` (2-node Raft HA + gRPC, 3 threads x 2000 single-row `INSERT INTO sensor SET ...`).

Observed behaviour (each config run 3x):

| Config | Gaps | Notes |
|---|---|---|
| main | 0/3 | clean at normal speed (~95s) |
| `#4453` branch HEAD (with dispatch) | 0/3 | clean at normal speed |
| `#4453` branch, dispatch removed | 3/3 clean | clean at normal speed |

The gap reproduced **only** in runs that were abnormally slow (240-320s instead of ~95s), i.e. when the machine was under heavy load (immediately after a parallel `mvn -am` build). So the embedded test is a useful smoke test but is **not yet a deterministic reproduction**. A deterministic repro likely needs an injected delay/hook between compaction Phase 4b and the SCHEMA_ENTRY ship, or a forced compaction overlap.

## Hypothesized mechanism

During compaction, mutating commits run under `RaftReplicatedDatabase.runWithCompactionReplication` with `isSchemaCommitThread=true`, so their WAL is **buffered** (`schemaWalBuffer`) and shipped later, atomically, as a single `SCHEMA_ENTRY` (`replicateSchema`). Meanwhile, the compaction Phase 4b runs **lock-free** so concurrent appends keep committing and ship their page-0 changes **immediately** as `TX_ENTRY`.

Sequence that produces the gap:

1. Compaction writes page-0 (Phase 0 flag and/or Phase 4c clear) at version V; this commit is **buffered**, not yet shipped.
2. A concurrent append writes page-0 at version V+1 and ships it **immediately** as `TX_ENTRY`.
3. The follower applies the append's `TX_ENTRY` (V+1) while still at V-1 (it never received V, which is sitting in the not-yet-shipped `SCHEMA_ENTRY`) -> `WALVersionGapException` -> snapshot resync.

The code comment at `RaftReplicatedDatabase` (around the `schemaWalBuffer` drain) asserts the buffered compaction WAL is "correctly-versioned ... without a version gap", but that assumes no concurrent immediate `TX_ENTRY` advances/anticipates the same page's version while the compaction entry is buffered in flight.

## Relevant code

- `ha-raft/.../RaftReplicatedDatabase.java` - `commit()` (immediate `TX_ENTRY` path), `runWithCompactionReplication()` / `schemaWalBuffer` / `replicateSchema()` (buffered `SCHEMA_ENTRY` path), `isSchemaCommitThread`.
- `engine/.../timeseries/TimeSeriesShard.java` - `compactInternal()` Phase 0 (page-0 flag) and Phase 4c (`clearDataPages`, page-0 reset); Phase 4b is lock-free by design.
- `engine/.../timeseries/TimeSeriesBucket.java` - `updateHeaderStats()` / `getOrCreateActiveDataPage()` write page-0 on every append.
- `engine/.../TransactionManager.applyChanges()` - the page-version-gap check that throws `WALVersionGapException`.
- `ha-raft/.../ArcadeStateMachine.applyTxEntry()` - converts the gap into a `ReplicationException` -> snapshot resync.

## Investigation / fix directions to evaluate

1. Make the compaction page-0 writes ship in-order with concurrent append `TX_ENTRY`s (e.g. do not buffer page-0 mutations that race with live appends; or hold the append lock across the buffered page-0 commit so no immediate `TX_ENTRY` can interleave a higher version while the buffered entry is in flight).
2. Alternatively, ship compaction mutable-bucket page changes via the same ordered `TX_ENTRY` path rather than the deferred `SCHEMA_ENTRY` buffer, so the Raft log order matches the page-version order.
3. Make follower apply tolerant of the buffered/immediate interleaving for TS shard pages without resorting to a full snapshot resync (last resort).
4. Build a deterministic reproduction (test hook to delay the `SCHEMA_ENTRY` ship relative to a concurrent append) so any fix is verifiable.

## Not related to #4453

`#4453` (silent sample loss on concurrent single-row TS INSERTs) is fixed by the compaction Phase-4 change alone; the per-row append `shardExecutor` dispatch in that PR is unnecessary and does not cause this gap (HEAD with the dispatch is also clean at normal speed). This page-0 ordering race is pre-existing/orthogonal and should be tracked and fixed separately.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Flaky WAL page-version-gap on TimeSeries shard page-0 under concurrent inserts + background compaction in Raft HA #4458

Summary

Symptoms

Reproduction

Hypothesized mechanism

Relevant code

Investigation / fix directions to evaluate

Not related to #4453

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Config	Gaps	Notes
main	0/3	clean at normal speed (~95s)
`#4453` branch HEAD (with dispatch)	0/3	clean at normal speed
`#4453` branch, dispatch removed	3/3 clean	clean at normal speed

Uh oh!

Flaky WAL page-version-gap on TimeSeries shard page-0 under concurrent inserts + background compaction in Raft HA #4458

Description

Summary

Symptoms

Reproduction

Hypothesized mechanism

Relevant code

Investigation / fix directions to evaluate

Not related to #4453

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions