fix: resume watch-mode tailing after flock race marks session completed#305
fix: resume watch-mode tailing after flock race marks session completed#305
Conversation
Adds TestWatcher_ResumesStreamingAfterFlockRace to pkg/web/watcher_test.go that reproduces issue #283: after the flock race marks a session completed, subsequent writes to the progress file do not reactivate the session and streaming stops. Skipped with t.Skip so the commit stays green; verified manually that the test fails on master when un-skipped. Will be un-skipped in Task 5 once the reactivation fix lands, completing the red-green TDD cycle for this bug.
Adds Offset() accessor and StartFromOffset(offset) to Tailer so callers can capture the current read position and later resume tailing from that byte offset without re-emitting already-processed content. Offsets beyond EOF are clamped; offset <= 0 falls back to seek-to-end. Groundwork for fixing watch-mode dashboard tailing after a flock race (issue #283) — by itself this change has no runtime effect, subsequent tasks will wire the offset through Session and Watcher.
add Session.lastOffset field with unexported getLastOffset/setLastOffset accessors. StopTailing captures the tailer's current byte offset before stopping so a subsequent Reactivate can resume from the exact byte where tailing stopped, avoiding duplicate events in the SSE replay buffer.
Extract a private startTailerLocked helper and a tailerStartMode enum (modeFromStart, modeFromEnd, modeResume) so StartTailing and Reactivate share one tailer-launch path under the session write lock. Reactivate resumes a completed session when a write event arrives after a flock race in RefreshStates. It picks modeResume when lastOffset>0 so events already in the SSE replay buffer aren't re-emitted; when lastOffset is 0 it falls back to modeFromEnd (not fromStart) to avoid replaying historical content the loader already published. State flips to active only after the tailer is confirmed started, and the call is idempotent while a tailer is running.
loadProgressFileIntoSession now accumulates bytes read from each
ReadString('\n') (including the delimiter, before trimLineEnding) and
stores the total in session.lastOffset. A subsequent Reactivate() can
then resume tailing from after the already-loaded content instead of
re-emitting events into the SSE replay buffer.
After Discover runs, handleProgressFileChange looks up the session for the written path and calls Reactivate() when the session is still completed and loaded. This recovers from the RefreshStates flock race that briefly marks a running session as completed. Scoped to the exact written path so other completed sessions in the same directory are not reactivated. IsLoaded() gate avoids racing the initial loader. Un-skips TestWatcher_ResumesStreamingAfterFlockRace (the Task 0 TDD reproduction) and adds coverage: - TestWatcher_ReactivatesCompletedSessionOnWrite - TestWatcher_DoesNotReactivateActiveSession - TestWatcher_OnlyReactivatesWrittenPath Tests subscribe to SSE via httptest + sse.Read to verify events arrive exactly once without relying on internal tailer state. Related to #283
Run full test suite, race detector, linter, formatter, and coverage check for the issue #283 fix. All gates pass: tests green, no races, lint clean, pkg/web coverage at 87.6% with new methods at 100% (Reactivate, Offset, getLastOffset, setLastOffset) and the bug-fix hot paths above 84%.
Ran the two-process scenario from the plan: watch-mode dashboard on one side, real ralphex executor on the other, both against /tmp/ralphex-test. Verified live SSE streaming throughout an 80+ second claude-driven task iteration, including survival past the 5-second RefreshStates tick with no freezing. Verified reactivation after kill+restart: a completed session flipped back to active on the next write event from a new executor targeting the same progress file. Clean shutdown confirmed - no zombie tailers, state converged to completed after flock release. Also ran a direct-write simulation that reproduces the exact flock race (initial unlocked file -> completed -> post-discovery writes trigger Reactivate -> lastOffset-based resume with no SSE duplicates), covering the scenario deterministically at the wire level.
Add a one-line note to CLAUDE.md Key Patterns describing the reactivation-on-write recovery path for the flock race in RefreshStates (issue #283). Move the completed plan file into docs/plans/completed/.
- remove redundant assert.GreaterOrEqual after require.NotEmpty in TestWatcher_ReactivatesCompletedSessionOnWrite (dead assertion) - strengthen concurrent-access assertion in TestSession_LastOffset to verify final value matches one of the written values instead of the trivially true final >= 0 check - clarify Reactivate godoc: idempotent branch leaves state as-is - clarify StopTailing godoc: nil-tailer preserves lastOffset - rewrite startTailerLocked godoc to describe the RLock-block liveness property (not a safety one)
Deploying ralphex with
|
| Latest commit: |
28f14bb
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://9b29a17f.ralphex.pages.dev |
| Branch Preview URL: | https://watch-mode-reactivate-tailin.ralphex.pages.dev |
There was a problem hiding this comment.
Pull request overview
This PR addresses a watch-mode reliability issue where log streaming can freeze if a transient flock race causes a still-running session to be marked completed. It adds an on-write reactivation path that resumes tailing from a persisted byte offset (and preserves parser phase / deferred section state) to avoid duplicate SSE events.
Changes:
- Add tailer/session resume primitives (byte offset + phase/pending-section carryover) and a
Session.Reactivate()flow. - Update watcher/session-manager logic to reactivate completed sessions on fsnotify
Writeevents and to handle truncate+rewrite restarts safely. - Expand
pkg/webtest coverage around reactivation, offsets, truncation/restart detection, and SSE delivery.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/web/watcher.go | Reactivates a completed+loaded session on write events; delegates activation logic to SessionManager. |
| pkg/web/watcher_test.go | Adds end-to-end watch-mode reproduction/regression tests and SSE subscription helpers. |
| pkg/web/tail.go | Exposes tailer offset/phase/pending-section and supports starting from a byte offset with truncation fallback. |
| pkg/web/tail_test.go | Adds tests for offset accounting, resume behavior, truncation fallback, and pending-section preservation. |
| pkg/web/session.go | Adds per-run state (lastOffset/phase/pending-section), serialized StopTailing, and Reactivate/resume logic. |
| pkg/web/session_test.go | Adds concurrency/offset/reactivation tests, including drain guarantees and phase/pending-section preservation. |
| pkg/web/session_progress.go | Enhances header parsing with “complete” flag and records loader byte offsets + ending phase. |
| pkg/web/session_progress_test.go | Updates header parsing tests and adds loader offset/partial-line coverage. |
| pkg/web/session_manager.go | Adds restart detection via header Started time; ensures activation uses Reactivate when resuming; gates loader on header completeness. |
| pkg/web/session_manager_test.go | Adds tests for completed→active resume-from-offset, loader suppression when offset already ingested, restart detection, and partial-header behavior. |
| docs/plans/completed/20260424-watch-mode-reactivate-tailing.md | Design/implementation plan documenting the chosen recovery approach and test strategy. |
| CLAUDE.md | Documents the new watch-mode reactivation behavior and invariants. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // register an active session directly (simulate a session that is already | ||
| // tailing, bypassing flock-based discovery which would mark it completed) | ||
| session := NewSession(sessionID, progressFile) | ||
| sm.Register(session) | ||
| require.True(t, session.MarkLoadedIfNot(), "simulate loader has completed") | ||
| session.SetState(SessionStateActive) | ||
| require.NoError(t, session.StartTailing(true)) | ||
| require.Eventually(t, func() bool { return session.IsTailing() }, |
There was a problem hiding this comment.
TestWatcher_DoesNotReactivateActiveSession registers a session as active+tailing, but once w.Start() runs its initial DiscoverRecursive, SessionManager.updateSession will call IsActive() on the file. Since this test file isn't flock-locked by any process, IsActive() will return false and updateSession will transition the session to completed and call StopTailing(), making the final SessionStateActive assertion unreliable/incorrect. To make this test valid, ensure the progress file is actually lock-held (e.g., create it via progress.NewLogger and keep the logger open during the test, or otherwise acquire/hold the same lock mechanism that IsActive() checks).
| // wait for initial discovery | ||
| sessionID := sessionIDFromPath(progressFile) | ||
| require.Eventually(t, func() bool { | ||
| return sm.Get(sessionID) != nil | ||
| }, time.Second, 10*time.Millisecond, "session should be discovered") | ||
|
|
||
| session := sm.Get(sessionID) | ||
| require.NotNil(t, session) | ||
|
|
||
| // simulate the flock race: force state to completed and stop tailing | ||
| // (models what RefreshStates does when TryLockFile transiently succeeds) | ||
| session.SetState(SessionStateCompleted) | ||
| session.StopTailing() | ||
| assert.Eventually(t, func() bool { return !session.IsTailing() }, time.Second, 10*time.Millisecond) | ||
| require.Equal(t, SessionStateCompleted, session.GetState()) |
There was a problem hiding this comment.
TestWatcher_ResumesStreamingAfterFlockRace appends to the progress file after waiting only for sm.Get(sessionID) != nil. The watcher's reactivation path is gated on session.IsLoaded(), so if the loader hasn’t finished yet when this single append happens, handleProgressFileChange will skip Reactivate() and there may be no later write event to recover, making the test flaky. Consider waiting for session.IsLoaded() (and ideally a non-zero session.getLastOffset()) before forcing the completed state and appending the new line.
| // the caller MUST hold s.mu as a write lock. on success, the new tailer is | ||
| // stored on the session and a feedEvents goroutine is launched; the goroutine | ||
| // blocks on s.mu.RLock() until the caller releases s.mu, then picks up the | ||
| // newly assigned s.tailer/s.stopTailCh. on error, s.tailer and s.stopTailCh | ||
| // are left unchanged. |
There was a problem hiding this comment.
The doc comment for startTailerLocked says the spawned feedEvents goroutine “blocks on s.mu.RLock() until the caller releases s.mu, then picks up the newly assigned s.tailer/s.stopTailCh”. That’s no longer accurate: feedEvents now receives tailer, stopCh, and feedDone as arguments and doesn’t need to wait for a lock to read session fields. Please update the comment to match the current implementation so future concurrency reasoning doesn’t rely on outdated behavior.
| // the caller MUST hold s.mu as a write lock. on success, the new tailer is | |
| // stored on the session and a feedEvents goroutine is launched; the goroutine | |
| // blocks on s.mu.RLock() until the caller releases s.mu, then picks up the | |
| // newly assigned s.tailer/s.stopTailCh. on error, s.tailer and s.stopTailCh | |
| // are left unchanged. | |
| // the caller MUST hold s.mu as a write lock. on success, the new tailer and | |
| // channels are stored on the session and a feedEvents goroutine is launched | |
| // with tailer, stopCh, and feedDone passed directly as arguments. on error, | |
| // s.tailer and s.stopTailCh are left unchanged. |
- watcher_test.go: TestWatcher_DoesNotReactivateActiveSession now holds an exclusive flock on the progress file so IsActive() reports true during w.Start's initial DiscoverRecursive. without this, updateSession would flip the session to completed and stop tailing, defeating the test's premise. unix-only via build-tagged holdFileLockForTest helper; windows skips since flock-based active detection is unix-only. - session.go: rewrite startTailerLocked godoc to reflect current behavior. feedEvents receives tailer/stopCh/feedDone as arguments at spawn time and does not acquire s.mu, so the previous reference to goroutine blocking on s.mu.RLock was stale.
|
addressed findings 1 and 3 in 28f14bb. pushing back on finding 2. watcher_test.go:723 (active-session test) — valid. once session.go:303-306 (startTailerLocked godoc) — valid, doc was stale. rewrote to describe the current implementation: watcher_test.go:547 (reactivation test) — disagree, the test is fine. |
Fixes #283.
When the watch-mode dashboard runs as a separate process from the executor (
pm2 start ralphex -s -w /path+ separateralphex docs/plans/...),RefreshStateswould mark the sessioncompletedwhenever the 5s tick hit a moment whereTryLockFilewon the race against the executor's flock. Once marked completed,startTailingIfNeededrefused to restart tailing, so the dashboard froze even though the executor was still writing to the progress file.Fix — option B from the issue thread: on a write event for a
completedsession, reactivate it and resume tailing from a tracked offset. Flock-based detection inRefreshStatesstays — reactivation is the recovery path, not a replacement.Mechanics
Tailer.Offset()andTailer.StartFromOffset(offset)expose and seek the internal offsetSession.lastOffsetcaptured insideStopTailing()before the tailer is stoppedSession.Reactivate()starts a new tailer atlastOffset(orfromEndif offset is 0), then flips state toactiveonly on successloadProgressFileIntoSessionrecords bytes read pre-trim, so CRLF/LF/no-newline all count correctlyWatcher.handleProgressFileChangecallsReactivate()for the specific written path when the session iscompletedandIsLoaded()(avoids racing the loader)Windows note —
flockis a no-op on Windows soIsActivealways returns false. Reactivation-on-write becomes the primary streaming mechanism on Windows, not just a recovery path. No platform-specific code needed.Tests
TestWatcher_ResumesStreamingAfterFlockRace(red without the fix, green with it)Reactivate: idempotent, resume-from-offset (no duplicates), failed-start-leaves-state-unchanged, on-closed-sessionpkg/webVerified end-to-end against the toy project — dashboard streaming survives the 5s
RefreshStatestick mid-run.