✨ feat(agent): inactivity watchdog finalize endpoint + agent-hono migration#14476
Conversation
The snapshot accumulation + finalize logic previously lived inline in `AgentRuntimeService.executeStep` (per-step header init, message diff, event stripping, tool delta, partial save) plus a separate helper for finalize. Two distinct call sites for finalize, three places touching `snapshotStore` directly, and ~120 lines of branching inside an already overgrown method. Pull all of it into `OperationTraceRecorder` with two methods: - `appendStep(operationId, params)` — owns partial header init, incremental message diff, llm_stream / done-event finalState pruning, toolResults-from-payload stripping, and activatedStepTools delta. - `finalize(operationId, params)` — owns success+error finalize, optional `failedStep` synthesis (LOBE-8533), and append-to-last-step enrichment for completion signal events. The recorder always exists on the service; when the underlying store is null, methods are no-ops, so the call sites no longer gate on `if (this.snapshotStore)`. Public `AgentRuntimeServiceOptions.snapshotStore` stays unchanged so existing tests keep injecting through the same surface. `AgentRuntimeService.ts` shrinks from 2274 → 2084 lines (-190 net). Tests: existing 70 agentRuntime tests + 11 new recorder unit tests (partial header init, llm_stream stripping, done-event pruning, messagesDelta vs baseline, compression reset, activatedStepTools delta, success finalize, failed-step synthesis, no-partial skip, appendEventsToLastStep, store=null no-op) — all 81 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…everse-trigger
When the agent-runtime Vercel function is killed mid-flight (LOBE-8533),
nothing reports the failure: no error reaches the gateway dashboard, the
`_partial/` snapshot is orphaned in S3, and the assistant message stays
dangling in DB. The agent-gateway DO is the only external observer that
can detect "operation went silent" — but it needs an endpoint to call so
finalization runs in a fresh function invocation.
Adds `AbandonOperationService.finalizeAbandoned(operationId, reason)` that
loads agent state from the Redis coordinator, mutates it to errored,
runs `OperationTraceRecorder.finalize()` with a synthetic failedStep
record (matching the existing LOBE-8533 error path), updates the
dangling assistant message, and cleans Redis state. Idempotent.
Exposed via `POST /api/agent/finalize-abandoned` with QStash signature
auth, body `{ operationId, reason }`. Mirrors the auth + DI pattern of
the existing `/api/agent/run` route. 6 unit tests cover the missing-state,
no-partial, no-assistantMessageId, and best-effort-cleanup paths.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors src/server/workflows-hono/ for QStash workflows. The finalize-abandoned endpoint moves out of a dedicated Next.js route.ts into a Hono handler under src/server/agent-hono/handlers/. URL is unchanged (POST /api/agent/finalize-abandoned). The Hono app is mounted via a catch-all at src/app/(backend)/api/agent/[...route]/route.ts. Next.js App Router prefers static segments over dynamic ones, so existing routes (run/tool-result/stream/gateway/webhooks) continue to win — they can migrate to Hono one at a time by deleting the static route.ts and adding a handler here. Auth is now factored into a reusable serviceTokenAuth() middleware that mirrors the per-route Bearer check in /api/agent/tool-result. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Moves the two simplest /api/agent/* endpoints onto the Hono app added in the previous commit: - POST /api/agent (execAgent) — new handler at agent-hono/handlers/execAgent.ts; dual auth (QStash sig OR AGENT_EXEC_API_KEY) factored into a reusable qstashOrApiKeyAuth() middleware. URL unchanged. - POST /api/agent/tool-result — new handler at agent-hono/handlers/toolResult.ts reusing serviceTokenAuth(). URL unchanged. Existing route test ported to a handler-direct unit test (5 tests, mirrors original 6 minus the auth middleware ones now covered by serviceTokenAuth's own contract). Catch-all switched from required `[...route]` to optional `[[...route]]` so the bare /api/agent path also falls through to Hono. Deleted the static route.ts files for both endpoints. Routing precedence still puts surviving static routes (run/stream/gateway/webhooks) in front of the Hono catch-all, so they keep working unchanged until they're individually migrated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
💡 Codex Reviewlobehub/src/server/services/agentRuntime/AbandonOperationService.ts Lines 94 to 98 in 97fb098 When the watchdog hits a transient snapshot-store failure (for example S3 ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## canary #14476 +/- ##
==========================================
+ Coverage 68.67% 68.70% +0.02%
==========================================
Files 2543 2545 +2
Lines 220888 221099 +211
Branches 22483 27989 +5506
==========================================
+ Hits 151703 151900 +197
- Misses 69042 69055 +13
- Partials 143 144 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
💻 Change Type
🔗 Related Issue
Pairs with the agent-gateway DO inactivity watchdog PR (lobehub-biz/agent-gateway). Continues the LOBE-8533 thread — when the agent-runtime Vercel function is killed mid-flight, no error reaches the gateway dashboard, the
_partial/snapshot is orphaned in S3, and the assistant message stays dangling in DB.🔀 Description of Change
Two threads in this branch.
Thread 1 — finalize-abandoned endpoint (the load-bearing change):
OperationTraceRecorderextraction from ♻️ refactor(agent-runtime): extract CompletionLifecycle, HumanInterventionHandler, stepPresentation #14441 commit 1 (a3a2c23). Will rebase cleanly when ♻️ refactor(agent-runtime): extract CompletionLifecycle, HumanInterventionHandler, stepPresentation #14441 lands on canary.AbandonOperationService.finalizeAbandoned(operationId, reason)that loads agent state from the Redis coordinator, mutates it to errored, runsOperationTraceRecorder.finalize()with a syntheticfailedStep(matching the existing LOBE-8533 path), updates the dangling assistant message, cleans Redis. Idempotent.POST /api/agent/finalize-abandoned,Authorization: Bearer <AGENT_GATEWAY_SERVICE_TOKEN>— same trust boundary as/api/agent/tool-result. The agent-gateway DO is the only caller.Thread 2 —
agent-honoframework + first migrations:src/server/agent-hono/mirrors the existingsrc/server/workflows-hono/pattern.src/app/(backend)/api/agent/[[...route]]/route.ts. Existing static route.ts files (run / stream / gateway / webhooks) keep winning by Next's static-segment precedence — they migrate one at a time when convenient.POST /api/agent(execAgent)POST /api/agent/tool-resultPOST /api/agent/finalize-abandoned(new)serviceTokenAuth(Bearer SERVICE_TOKEN) andqstashOrApiKeyAuth(QStash sig ORAGENT_EXEC_API_KEYBearer).🧪 How to Test
24 tests pass (3 files):
OperationTraceRecorder.test.ts— 13 (from ♻️ refactor(agent-runtime): extract CompletionLifecycle, HumanInterventionHandler, stepPresentation #14441)AbandonOperationService.test.ts— 6 new (idempotency, missing partial, missing assistantMessageId, best-effort cleanup)toolResult.test.ts— 5 ported from the deleted Next.js route test (handler-direct + minimal Hono Context stub; vitest can't load thehonomodule in this repo's pnpm-isolated layout — same issue affectsworkflows-hono/*)📝 Additional Information
Routing precedence sanity check. Next.js prefers static segments over dynamic ones, so:
/api/agent/run→ existingrun/route.ts/api/agent/stream→ existingstream/route.ts/api/agent/gateway/*→ existinggateway/*/route.ts/api/agent/webhooks/*→ existingwebhooks/*/route.ts/api/agent→ catch-all → HonoexecAgent/api/agent/tool-result→ catch-all → HonotoolResult/api/agent/finalize-abandoned→ catch-all → HonofinalizeAbandonedPre-existing TS errors for
Cannot find module 'hono'are the samehono-resolution issue thatsrc/server/workflows-hono/*already produces. Next.js bundleshonocorrectly at build time so production works; vitest can't resolve it becausehonolives atnode_modules/.pnpm/hono@4.12.10/(no top-level hoist). Fixing repo-level resolution is out of scope for this PR.Why factor
OperationTraceRecorderfirst. Itsfinalize()already has afailedStepsynthesis branch built specifically for LOBE-8533 — exactly whatAbandonOperationServiceneeds. Cherry-picking just commit 1 of #14441 keeps this PR's diff focused.🤖 Generated with Claude Code