Skip to content

Commit 1d91f1a

Browse files
spencer2211claude
andcommitted
fix(agents): release session write lock if fence read throws on prompt release
releaseHeldLockWithFence (the embedded-attempt session-lock controller's release-for-prompt path) cleared its in-memory `heldLock` reference and then performed fence bookkeeping — readSessionFileFingerprint and readSessionFileFenceSnapshot — before calling lock.release(). Both perform filesystem I/O that can reject with a transient non-ENOENT error (e.g. EIO/EMFILE under load). Because release() was not in a finally, such a throw exited the function with `heldLock` already nulled and the underlying file lock never released. The lock then leaked on the live gateway process for the full maxHoldMs lease. Since the controller holds the lock with maxHoldMs derived from the agent timeout (~17 min) and the holder is the live pid (its /proc starttime matches the lock payload), the stale-lock breaker correctly refuses to reclaim it — only the watchdog TTL frees it. Meanwhile the attempt teardown still releases the in-process session-file owner mutex, so the next interactive turn sails past the mutex and blocks on the orphaned file lock, failing every turn with SessionWriteLockTimeoutError for up to 17 minutes (observed: a Discord channel dead 00:08–00:27). Wrap the fence bookkeeping in try/finally so the file lock is always released once heldLock is cleared, even if the fence reads throw. The original error still propagates; only the lock-release invariant is made exception-safe. Add regression coverage: a fence-read failure during releaseForPrompt must still release the underlying lock. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent 0fa384c commit 1d91f1a

2 files changed

Lines changed: 46 additions & 11 deletions

File tree

src/agents/embedded-agent-runner/run/attempt.session-lock.test.ts

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -136,6 +136,31 @@ describe("embedded attempt session lock lifecycle", () => {
136136
expect(releases).toEqual(["held"]);
137137
});
138138

139+
it("releases the eagerly-held lock even when the fence read throws during prompt release (FAD-845)", async () => {
140+
const release = vi.fn(async () => {});
141+
const acquireSessionWriteLockLocalFad845 = vi.fn(async () => ({ release }));
142+
const controller = await createEmbeddedAttemptSessionLockController({
143+
acquireSessionWriteLock: acquireSessionWriteLockLocalFad845,
144+
lockOptions,
145+
});
146+
147+
// Simulate a transient, non-ENOENT filesystem error (e.g. EIO/EMFILE) while the
148+
// prompt-release path reads the session-file fence. This fires AFTER the controller
149+
// has cleared its in-memory `heldLock` reference but BEFORE the underlying file lock
150+
// is released, which is the window that orphans the lock.
151+
const statError = Object.assign(new Error("simulated I/O failure"), { code: "EIO" });
152+
const statSpy = vi.spyOn(fs, "stat").mockRejectedValueOnce(statError);
153+
154+
await expect(controller.releaseForPrompt()).rejects.toThrow();
155+
156+
// The underlying file lock MUST still be released; otherwise it is orphaned on the
157+
// live gateway process for the full maxHoldMs lease (~17 min), wedging every
158+
// subsequent interactive turn with SessionWriteLockTimeoutError. See FAD-845.
159+
expect(release).toHaveBeenCalledTimes(1);
160+
161+
statSpy.mockRestore();
162+
});
163+
139164
it("releaseHeldLockForAbort and dispose are idempotent in succession (#86816)", async () => {
140165
const release = vi.fn(async () => {});
141166
const acquireSessionWriteLockLocal26 = vi.fn(async () => ({ release }));

src/agents/embedded-agent-runner/run/attempt.session-lock.ts

Lines changed: 21 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -801,17 +801,27 @@ export async function createEmbeddedAttemptSessionLockController(params: {
801801
}
802802
const lock = heldLock;
803803
heldLock = undefined;
804-
const fingerprint = await readSessionFileFingerprint(params.lockOptions.sessionFile);
805-
const ownedWrite = ownedSessionFileWrites.get(sessionFileFenceKey);
806-
const trustedGeneration = trustSessionFileState(sessionFileFenceKey, fingerprint);
807-
fenceFingerprint = fingerprint;
808-
fenceSnapshot = await readSessionFileFenceSnapshot(params.lockOptions.sessionFile);
809-
fenceGeneration =
810-
ownedWrite && sameSessionFileFingerprint(ownedWrite.fingerprint, fingerprint)
811-
? ownedWrite.generation
812-
: (trustedGeneration ?? fenceGeneration);
813-
fenceActive = true;
814-
await lock.release();
804+
// Once `heldLock` is cleared the controller no longer tracks this lock, so the
805+
// underlying file lock MUST be released even if the fence bookkeeping below
806+
// throws. The fence reads perform filesystem I/O that can fail with a transient
807+
// non-ENOENT error (e.g. EIO/EMFILE under load); without this guard such a throw
808+
// would orphan the lock on the live gateway process for the full maxHoldMs lease
809+
// (~17 min), wedging every subsequent interactive turn with
810+
// SessionWriteLockTimeoutError until the watchdog reclaims it. See FAD-845.
811+
try {
812+
const fingerprint = await readSessionFileFingerprint(params.lockOptions.sessionFile);
813+
const ownedWrite = ownedSessionFileWrites.get(sessionFileFenceKey);
814+
const trustedGeneration = trustSessionFileState(sessionFileFenceKey, fingerprint);
815+
fenceFingerprint = fingerprint;
816+
fenceSnapshot = await readSessionFileFenceSnapshot(params.lockOptions.sessionFile);
817+
fenceGeneration =
818+
ownedWrite && sameSessionFileFingerprint(ownedWrite.fingerprint, fingerprint)
819+
? ownedWrite.generation
820+
: (trustedGeneration ?? fenceGeneration);
821+
fenceActive = true;
822+
} finally {
823+
await lock.release();
824+
}
815825
} finally {
816826
finishHeldLockDrain(drainOwner);
817827
}

0 commit comments

Comments
 (0)