Skip to content

fix(lite): preserve follow wait budget after lagged recovery#311

Merged
shikhar merged 4 commits intomainfrom
codex/fix-follow-wait-deadline
Mar 6, 2026
Merged

fix(lite): preserve follow wait budget after lagged recovery#311
shikhar merged 4 commits intomainfrom
codex/fix-follow-wait-deadline

Conversation

@shikhar
Copy link
Member

@shikhar shikhar commented Mar 6, 2026

Summary

  • persist the follow-mode wait deadline in read session state so lagged recovery does not reset it
  • only re-arm the wait budget when a read batch is actually delivered
  • add a regression test for lagged follow recovery when DB catch-up yields no records

Testing

  • cargo test -p s2-lite read_wait_is_not_reset_after_follow_lag_without_catchup_records -- --nocapture
  • cargo test -p s2-lite

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 6, 2026

Greptile Summary

This PR fixes a bug in the lite backend's read session where a lagged broadcast recovery would silently reset the follow-mode wait deadline, potentially causing a session to block indefinitely instead of honouring the original wait budget.

Key changes:

  • wait and wait_deadline are moved from transient local variables into ReadSessionState, so they survive continue 'session loops triggered by a Lagged broadcast error.
  • ensure_wait_deadline() initialises the deadline idempotently (only when None), preserving any previously set deadline across re-entries into follow mode.
  • A wait_deadline_expired() guard is added at the top of each follow-mode entry, so that if the budget has already expired during lagged DB catch-up (especially when the catch-up scan yields no records), the session breaks out immediately rather than re-entering the select loop.
  • on_batch re-arms the deadline whenever real data is actually delivered (both during DB catch-up and live follow), which is a reasonable extension: the wait timer should reset when new records arrive.
  • A regression test (read_wait_is_not_reset_after_follow_lag_without_catchup_records) exercises the exact failure path using paused Tokio time, verifying that the session closes once the original deadline elapses even if the lagged DB catch-up produced no records.

Confidence Score: 5/5

  • This PR is safe to merge — it fixes a well-scoped bug with minimal, targeted changes and adds a deterministic regression test.
  • The change is minimal and contained entirely within lite/src/backend/read.rs. The logic is straightforward: moving a local variable into persistent state and guarding re-entry with an idempotent initialiser. The regression test covers the previously broken code path with paused Tokio time, making it deterministic. No other files or public APIs are affected. The fix directly addresses the root cause without introducing side effects.
  • No files require special attention.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Read Session Start\nwait_deadline = None] --> B{start_seq_num < tail?}
    B -- Yes --> C[DB Scan]
    C --> D{Records found?}
    D -- Yes --> E[on_batch\nresets wait_deadline = now + wait]
    E --> B
    D -- No --> F[start_seq_num = tail.seq_num]
    F --> B
    B -- No --> G{may_follow?}
    G -- No --> Z[End Session]
    G -- Yes --> H[client.follow]
    H --> I[ensure_wait_deadline\ninitialises ONLY if None]
    I --> J{wait_deadline_expired?}
    J -- Yes --> Z
    J -- No --> K[yield Heartbeat]
    K --> L{tokio::select! biased}
    L -- follow_rx msg OK --> M[on_batch\nresets wait_deadline = now + wait]
    M --> L
    L -- follow_rx Lagged --> N[continue 'session\ndeadline PRESERVED in state]
    N --> B
    L -- heartbeat sleep --> O[yield Heartbeat]
    O --> L
    L -- wait_sleep_until deadline --> Z
Loading

Last reviewed commit: ca1ae78

@shikhar shikhar merged commit 5f44437 into main Mar 6, 2026
16 checks passed
@shikhar shikhar deleted the codex/fix-follow-wait-deadline branch March 6, 2026 17:59
@s2-release-plz s2-release-plz bot mentioned this pull request Mar 6, 2026
shikhar pushed a commit that referenced this pull request Mar 6, 2026
## 🤖 New release

* `s2-lite`: 0.29.20 -> 0.29.21 (✓ API compatible changes)
* `s2-sdk`: 0.24.7 -> 0.24.8 (✓ API compatible changes)
* `s2-cli`: 0.29.20 -> 0.29.21

<details><summary><i><b>Changelog</b></i></summary><p>

## `s2-lite`

<blockquote>

## [0.29.21] - 2026-03-06

### Bug Fixes

- Allow http endpoints for S3-compatible object stores
([#303](#303))
- Preserve follow wait budget after lagged recovery
([#311](#311))
- Initialize durability notifier from close status
([#312](#312))

<!-- generated by git-cliff -->
</blockquote>

## `s2-sdk`

<blockquote>

## [0.24.8] - 2026-03-06

### Bug Fixes

- Compression and FrameSignal ordering
([#313](#313))

<!-- generated by git-cliff -->
</blockquote>

## `s2-cli`

<blockquote>

## [0.29.21] - 2026-03-06

<!-- generated by git-cliff -->
</blockquote>


</p></details>

---
This PR was generated with
[release-plz](https://github.com/release-plz/release-plz/).

Co-authored-by: s2-release-plz[bot] <262023388+s2-release-plz[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant