Skip to content

fix: resume watch-mode tailing after flock race marks session completed#305

Merged
umputun merged 15 commits intomasterfrom
watch-mode-reactivate-tailing
Apr 25, 2026
Merged

fix: resume watch-mode tailing after flock race marks session completed#305
umputun merged 15 commits intomasterfrom
watch-mode-reactivate-tailing

Conversation

@umputun
Copy link
Copy Markdown
Owner

@umputun umputun commented Apr 25, 2026

Fixes #283.

When the watch-mode dashboard runs as a separate process from the executor (pm2 start ralphex -s -w /path + separate ralphex docs/plans/...), RefreshStates would mark the session completed whenever the 5s tick hit a moment where TryLockFile won the race against the executor's flock. Once marked completed, startTailingIfNeeded refused to restart tailing, so the dashboard froze even though the executor was still writing to the progress file.

Fix — option B from the issue thread: on a write event for a completed session, reactivate it and resume tailing from a tracked offset. Flock-based detection in RefreshStates stays — reactivation is the recovery path, not a replacement.

Mechanics

  • Tailer.Offset() and Tailer.StartFromOffset(offset) expose and seek the internal offset
  • Session.lastOffset captured inside StopTailing() before the tailer is stopped
  • Session.Reactivate() starts a new tailer at lastOffset (or fromEnd if offset is 0), then flips state to active only on success
  • loadProgressFileIntoSession records bytes read pre-trim, so CRLF/LF/no-newline all count correctly
  • Watcher.handleProgressFileChange calls Reactivate() for the specific written path when the session is completed and IsLoaded() (avoids racing the loader)

Windows noteflock is a no-op on Windows so IsActive always returns false. Reactivation-on-write becomes the primary streaming mechanism on Windows, not just a recovery path. No platform-specific code needed.

Tests

  • TDD reproduction: TestWatcher_ResumesStreamingAfterFlockRace (red without the fix, green with it)
  • offset/seek edges: zero-fallback, beyond-file-size, CRLF, LF, no-trailing-newline
  • Reactivate: idempotent, resume-from-offset (no duplicates), failed-start-leaves-state-unchanged, on-closed-session
  • watcher: only the written path is reactivated (sibling completed sessions stay completed)
  • 86.9% coverage in pkg/web

Verified end-to-end against the toy project — dashboard streaming survives the 5s RefreshStates tick mid-run.

umputun added 14 commits April 24, 2026 12:57
Adds TestWatcher_ResumesStreamingAfterFlockRace to pkg/web/watcher_test.go
that reproduces issue #283: after the flock race marks a session completed,
subsequent writes to the progress file do not reactivate the session and
streaming stops.

Skipped with t.Skip so the commit stays green; verified manually that the
test fails on master when un-skipped. Will be un-skipped in Task 5 once
the reactivation fix lands, completing the red-green TDD cycle for this bug.
Adds Offset() accessor and StartFromOffset(offset) to Tailer so callers
can capture the current read position and later resume tailing from
that byte offset without re-emitting already-processed content.
Offsets beyond EOF are clamped; offset <= 0 falls back to seek-to-end.

Groundwork for fixing watch-mode dashboard tailing after a flock race
(issue #283) — by itself this change has no runtime effect, subsequent
tasks will wire the offset through Session and Watcher.
add Session.lastOffset field with unexported getLastOffset/setLastOffset
accessors. StopTailing captures the tailer's current byte offset before
stopping so a subsequent Reactivate can resume from the exact byte where
tailing stopped, avoiding duplicate events in the SSE replay buffer.
Extract a private startTailerLocked helper and a tailerStartMode enum
(modeFromStart, modeFromEnd, modeResume) so StartTailing and Reactivate
share one tailer-launch path under the session write lock.

Reactivate resumes a completed session when a write event arrives after
a flock race in RefreshStates. It picks modeResume when lastOffset>0 so
events already in the SSE replay buffer aren't re-emitted; when
lastOffset is 0 it falls back to modeFromEnd (not fromStart) to avoid
replaying historical content the loader already published. State flips
to active only after the tailer is confirmed started, and the call is
idempotent while a tailer is running.
loadProgressFileIntoSession now accumulates bytes read from each
ReadString('\n') (including the delimiter, before trimLineEnding) and
stores the total in session.lastOffset. A subsequent Reactivate() can
then resume tailing from after the already-loaded content instead of
re-emitting events into the SSE replay buffer.
After Discover runs, handleProgressFileChange looks up the session for the
written path and calls Reactivate() when the session is still completed and
loaded. This recovers from the RefreshStates flock race that briefly marks a
running session as completed.

Scoped to the exact written path so other completed sessions in the same
directory are not reactivated. IsLoaded() gate avoids racing the initial
loader.

Un-skips TestWatcher_ResumesStreamingAfterFlockRace (the Task 0 TDD
reproduction) and adds coverage:
- TestWatcher_ReactivatesCompletedSessionOnWrite
- TestWatcher_DoesNotReactivateActiveSession
- TestWatcher_OnlyReactivatesWrittenPath

Tests subscribe to SSE via httptest + sse.Read to verify events arrive exactly
once without relying on internal tailer state.

Related to #283
Run full test suite, race detector, linter, formatter, and coverage check
for the issue #283 fix. All gates pass: tests green, no races, lint clean,
pkg/web coverage at 87.6% with new methods at 100% (Reactivate, Offset,
getLastOffset, setLastOffset) and the bug-fix hot paths above 84%.
Ran the two-process scenario from the plan: watch-mode dashboard on one
side, real ralphex executor on the other, both against /tmp/ralphex-test.

Verified live SSE streaming throughout an 80+ second claude-driven task
iteration, including survival past the 5-second RefreshStates tick with
no freezing. Verified reactivation after kill+restart: a completed
session flipped back to active on the next write event from a new
executor targeting the same progress file. Clean shutdown confirmed -
no zombie tailers, state converged to completed after flock release.

Also ran a direct-write simulation that reproduces the exact flock race
(initial unlocked file -> completed -> post-discovery writes trigger
Reactivate -> lastOffset-based resume with no SSE duplicates), covering
the scenario deterministically at the wire level.
Add a one-line note to CLAUDE.md Key Patterns describing the
reactivation-on-write recovery path for the flock race in
RefreshStates (issue #283). Move the completed plan file into
docs/plans/completed/.
- remove redundant assert.GreaterOrEqual after require.NotEmpty in
  TestWatcher_ReactivatesCompletedSessionOnWrite (dead assertion)
- strengthen concurrent-access assertion in TestSession_LastOffset to
  verify final value matches one of the written values instead of the
  trivially true final >= 0 check
- clarify Reactivate godoc: idempotent branch leaves state as-is
- clarify StopTailing godoc: nil-tailer preserves lastOffset
- rewrite startTailerLocked godoc to describe the RLock-block liveness
  property (not a safety one)
Copilot AI review requested due to automatic review settings April 25, 2026 16:32
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages Bot commented Apr 25, 2026

Deploying ralphex with  Cloudflare Pages  Cloudflare Pages

Latest commit: 28f14bb
Status: ✅  Deploy successful!
Preview URL: https://9b29a17f.ralphex.pages.dev
Branch Preview URL: https://watch-mode-reactivate-tailin.ralphex.pages.dev

View logs

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a watch-mode reliability issue where log streaming can freeze if a transient flock race causes a still-running session to be marked completed. It adds an on-write reactivation path that resumes tailing from a persisted byte offset (and preserves parser phase / deferred section state) to avoid duplicate SSE events.

Changes:

  • Add tailer/session resume primitives (byte offset + phase/pending-section carryover) and a Session.Reactivate() flow.
  • Update watcher/session-manager logic to reactivate completed sessions on fsnotify Write events and to handle truncate+rewrite restarts safely.
  • Expand pkg/web test coverage around reactivation, offsets, truncation/restart detection, and SSE delivery.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
pkg/web/watcher.go Reactivates a completed+loaded session on write events; delegates activation logic to SessionManager.
pkg/web/watcher_test.go Adds end-to-end watch-mode reproduction/regression tests and SSE subscription helpers.
pkg/web/tail.go Exposes tailer offset/phase/pending-section and supports starting from a byte offset with truncation fallback.
pkg/web/tail_test.go Adds tests for offset accounting, resume behavior, truncation fallback, and pending-section preservation.
pkg/web/session.go Adds per-run state (lastOffset/phase/pending-section), serialized StopTailing, and Reactivate/resume logic.
pkg/web/session_test.go Adds concurrency/offset/reactivation tests, including drain guarantees and phase/pending-section preservation.
pkg/web/session_progress.go Enhances header parsing with “complete” flag and records loader byte offsets + ending phase.
pkg/web/session_progress_test.go Updates header parsing tests and adds loader offset/partial-line coverage.
pkg/web/session_manager.go Adds restart detection via header Started time; ensures activation uses Reactivate when resuming; gates loader on header completeness.
pkg/web/session_manager_test.go Adds tests for completed→active resume-from-offset, loader suppression when offset already ingested, restart detection, and partial-header behavior.
docs/plans/completed/20260424-watch-mode-reactivate-tailing.md Design/implementation plan documenting the chosen recovery approach and test strategy.
CLAUDE.md Documents the new watch-mode reactivation behavior and invariants.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/web/watcher_test.go
Comment on lines +716 to +723
// register an active session directly (simulate a session that is already
// tailing, bypassing flock-based discovery which would mark it completed)
session := NewSession(sessionID, progressFile)
sm.Register(session)
require.True(t, session.MarkLoadedIfNot(), "simulate loader has completed")
session.SetState(SessionStateActive)
require.NoError(t, session.StartTailing(true))
require.Eventually(t, func() bool { return session.IsTailing() },
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestWatcher_DoesNotReactivateActiveSession registers a session as active+tailing, but once w.Start() runs its initial DiscoverRecursive, SessionManager.updateSession will call IsActive() on the file. Since this test file isn't flock-locked by any process, IsActive() will return false and updateSession will transition the session to completed and call StopTailing(), making the final SessionStateActive assertion unreliable/incorrect. To make this test valid, ensure the progress file is actually lock-held (e.g., create it via progress.NewLogger and keep the logger open during the test, or otherwise acquire/hold the same lock mechanism that IsActive() checks).

Copilot uses AI. Check for mistakes.
Comment thread pkg/web/watcher_test.go
Comment on lines +533 to +547
// wait for initial discovery
sessionID := sessionIDFromPath(progressFile)
require.Eventually(t, func() bool {
return sm.Get(sessionID) != nil
}, time.Second, 10*time.Millisecond, "session should be discovered")

session := sm.Get(sessionID)
require.NotNil(t, session)

// simulate the flock race: force state to completed and stop tailing
// (models what RefreshStates does when TryLockFile transiently succeeds)
session.SetState(SessionStateCompleted)
session.StopTailing()
assert.Eventually(t, func() bool { return !session.IsTailing() }, time.Second, 10*time.Millisecond)
require.Equal(t, SessionStateCompleted, session.GetState())
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestWatcher_ResumesStreamingAfterFlockRace appends to the progress file after waiting only for sm.Get(sessionID) != nil. The watcher's reactivation path is gated on session.IsLoaded(), so if the loader hasn’t finished yet when this single append happens, handleProgressFileChange will skip Reactivate() and there may be no later write event to recover, making the test flaky. Consider waiting for session.IsLoaded() (and ideally a non-zero session.getLastOffset()) before forcing the completed state and appending the new line.

Copilot uses AI. Check for mistakes.
Comment thread pkg/web/session.go Outdated
Comment on lines +302 to +306
// the caller MUST hold s.mu as a write lock. on success, the new tailer is
// stored on the session and a feedEvents goroutine is launched; the goroutine
// blocks on s.mu.RLock() until the caller releases s.mu, then picks up the
// newly assigned s.tailer/s.stopTailCh. on error, s.tailer and s.stopTailCh
// are left unchanged.
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc comment for startTailerLocked says the spawned feedEvents goroutine “blocks on s.mu.RLock() until the caller releases s.mu, then picks up the newly assigned s.tailer/s.stopTailCh”. That’s no longer accurate: feedEvents now receives tailer, stopCh, and feedDone as arguments and doesn’t need to wait for a lock to read session fields. Please update the comment to match the current implementation so future concurrency reasoning doesn’t rely on outdated behavior.

Suggested change
// the caller MUST hold s.mu as a write lock. on success, the new tailer is
// stored on the session and a feedEvents goroutine is launched; the goroutine
// blocks on s.mu.RLock() until the caller releases s.mu, then picks up the
// newly assigned s.tailer/s.stopTailCh. on error, s.tailer and s.stopTailCh
// are left unchanged.
// the caller MUST hold s.mu as a write lock. on success, the new tailer and
// channels are stored on the session and a feedEvents goroutine is launched
// with tailer, stopCh, and feedDone passed directly as arguments. on error,
// s.tailer and s.stopTailCh are left unchanged.

Copilot uses AI. Check for mistakes.
- watcher_test.go: TestWatcher_DoesNotReactivateActiveSession now holds
  an exclusive flock on the progress file so IsActive() reports true
  during w.Start's initial DiscoverRecursive. without this, updateSession
  would flip the session to completed and stop tailing, defeating the
  test's premise. unix-only via build-tagged holdFileLockForTest helper;
  windows skips since flock-based active detection is unix-only.
- session.go: rewrite startTailerLocked godoc to reflect current
  behavior. feedEvents receives tailer/stopCh/feedDone as arguments at
  spawn time and does not acquire s.mu, so the previous reference to
  goroutine blocking on s.mu.RLock was stale.
@umputun
Copy link
Copy Markdown
Owner Author

umputun commented Apr 25, 2026

addressed findings 1 and 3 in 28f14bb. pushing back on finding 2.

watcher_test.go:723 (active-session test) — valid. once w.Start runs DiscoverRecursive, updateSession sees IsActive=false (no flock holder), flips state to completed and stops the tailer, so the assertions pass for the wrong reason. fixed by adding a unix-only test helper that holds an exclusive flock on the progress file before the watcher starts. windows skips since flock-based detection is a no-op there anyway.

session.go:303-306 (startTailerLocked godoc) — valid, doc was stale. rewrote to describe the current implementation: feedEvents receives tailer/stopCh/feedDone as args at spawn time and doesn't acquire s.mu.

watcher_test.go:547 (reactivation test) — disagree, the test is fine. Discover registers the session in m.sessions after updateSession returns, and updateSession runs loadProgressFileIntoSession synchronously inside the MarkLoadedIfNot block (session_manager.go:213). so sm.Get(sessionID) != nilupdateSession completed ⟹ IsLoaded() == true. the test's existing wait condition is sufficient — there's no observable window where Get returns the session before it's loaded.

@umputun umputun merged commit 7e36f67 into master Apr 25, 2026
5 checks passed
@umputun umputun deleted the watch-mode-reactivate-tailing branch April 25, 2026 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Watch mode: tailing stops after RefreshStates marks session completed (flock not held by executor)

2 participants