β¨ feat(verify): Agent Run delivery checker system#15489
Merged
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Codecov Reportβ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## canary #15489 +/- ##
==========================================
- Coverage 70.70% 70.55% -0.16%
==========================================
Files 3285 3302 +17
Lines 324626 326172 +1546
Branches 28728 35517 +6789
==========================================
+ Hits 229533 230121 +588
- Misses 94910 95868 +958
Partials 183 183
Flags with carried forward coverage won't be shown. Click here to find out more.
π New features to boost your workflow:
|
869e07c to
7df45ba
Compare
fbb1b03 to
dae68df
Compare
β¦ecker Implement the database layer for the Agent Run delivery checker (Verify System). Reuse / definition layer: - verify_criteria: a single reusable pass/fail standard (atomic unit), carrying its verifier config + onFail default and bound to a document for judging guidance (iteration history reuses document_history; no version columns) - verify_rubrics: a named group that aggregates criteria β the reusable unit - verify_rubric_criteria: junction, which criteria a rubric aggregates (criteria are reusable across rubrics) Mounted onto an agent via the existing agency config jsonb: - agencyConfig.verifyRubricId: a reusable rubric (criteria template) - agencyConfig.verifyCriteriaIds: ad-hoc one-off criteria A run's plan instantiates the union of both. No dedicated bindings table. Snapshot + result layer: - agent_operations.verify_plan (jsonb) + verify_plan_confirmed_at: the per-run immutable check-item snapshot lives ON the operation (1:1 β auto-repair spawns a new operation), instead of a separate plans table - agent_operations.verify_status: denormalized rollup for list-page badges - verify_check_results: per-criterion result with the Toulmin model (verdict/confidence as columns, narrative in a typed toulmin jsonb), N:1 verifier_tracing_id for batch judging, FP/FN flags for the data flywheel; relates to the plan via operation_id + stable check_item_id Ref: LOBE-10019 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Implements the verify system on top of the schema (PR #15480): - models: verifyCriterion / verifyRubric (+junction) / verifyCheckResult; agentOperation verify plan/status methods - services/verify: AI plan generation (auto-create criteria), executor with LLM Toulmin judge (per-criterion + batch), program placeholder, agent & auto-repair spawner seams, rollup chokepoint, feedback fp/fn, completion lifecycle bridge - lambda verify router (criteria/rubric CRUD, plan, results, feedback) - frontend feature module: service, SWR hooks, CheckerDock state machine, RunArtifact, verify i18n namespace - tracing scenarios: VerifyPlanGen / VerifyJudge Live UI mount (dock/artifact into chat) pending server operationId source. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
β¦d border - The kicker said "Run Artifact" β it's automated verification, not an artifact. Renamed to "Verification Β· Round N". - Removed the red error border on a failed check β a normal card reads better. - Fixes a render crash (`useVerifyState is not defined`): the border removal left a dangling reference after the import was dropped. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
When the live stream (gateway WebSocket / SSE) closes before the run finishes, the run is still executing server-side β so instead of hard-exiting, fall back to polling `aiAgent.getOperationStatus` every 10s until the run reaches a terminal state (or is no longer tracked). Pairs with surfacing the WS close code/reason. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The generateVerifyPlan tool call rendered as the default param/result dump. Add a Render that lists the generated delivery checks (title + gate/auto-fill tag), and surface the items on the tool state so the Render can read them. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The agent generated a plan but it stayed `planned`/unconfirmed, so the completion hook (which gates on a confirmed plan) never ran the checks β the card was stuck at "awaiting confirmation" with no pass/fail. In the headless agent flow there's no one to click Confirm, so `generateVerifyPlan` now auto-confirms the plan it generates; the checks then run automatically on completion. (An interactive "review before run" gate is a future enhancement.) Also: the verify card header disappeared in the draft/planned phase (`phaseToArtifact.draft` was null). Give it a header so the card always shows its "Verification Β· Round N" heading. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
β¦tural noise The first structuralNoiseRatio charged ALL markup (every <...> tag) as noise, which over-penalized legitimately structured results 3x. Grounding against real web-search output (`<item title="β¦" url="β¦">snippet</item>`) showed the tags and the title=/url= attributes ARE the signal the model reads. Now only opaque/presentational attribute names (id, class, style, data-*, aria-*, role, on*) count as noise; semantic element tags and content-bearing attributes (title, url, href, nameβ¦) are kept. On a 57-op user-interrupted sample this drops web-search noise 42%β0% and overall estimated waste 16%β5%, leaving large-payload (readDocument) and high error-rate tools as the real signal. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
β¦tion-in-document + agent verifier Restructure the generateVerifyPlan tool to a createDocument-style full-create flow and wire up the agent verifier path: - criteria now = title + description (required one-liner) + instruction (required detailed rubric); instruction lives in a linked document (verify_criteria.documentId), description is a new verify_criteria column (migration 0111). verifierConfig no longer holds description/instruction. - generateVerifyPlan creates verify_criteria + a rubric, snapshots the plan onto the operation and confirms it; judge resolves the instruction from the document. - agent-type checks run as verifier sub-agents (execAgent + isolated thread) whose onComplete hook parses a VERDICT and writes it back to verify_check_results (renamed AgentVerifierSpawner β VerifierAgentRunner). - UI: custom Inspector for the tool header; check list shows per-verifier-type icons (llm/agent/program) + description + required/optional tag; i18n en/zh. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The three verifier kinds are independent; previously the agent spawn waited for the batched LLM judge to finish. Run them via Promise.all so agent sub-agents start immediately alongside the LLM batch. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
β¦=verify message, portal check editor - Add `@lobechat/builtin-tool-verify` (submitVerifyResult) + builtin `verify-agent`; agent-type checks now run as the dedicated verify agent (not the user's agent), which investigates and writes its verdict back via the tool during its run. - Verifier inherits the parent run's model/provider (builtin default may be unconfigured locally). - role=verify completion message no longer requires an assistantMessageId, so the delivery-checker card always surfaces when a plan exists. - Portal editor for verify checks (title/description/instruction/verifier/onFail). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
β¦ng loader icon Root cause of stuck `running` agent checks: the verify-agent ran in agent mode and inherited all default tools (web-browsing, cloud-sandbox, skills, activator), so it went off web-searching/crawling to "investigate" and never called submitVerifyResult. - Run the verify-agent in chat mode (enableAgentMode: false, searchMode: off) β the strict whitelist β and whitelist `lobe-verify` for chat mode so the verifier gets ONLY its writeback tool. - Sharpen the verify systemRole: judge from the provided deliverable/instruction (no external tools), always reach a verdict, and always call submitVerifyResult. - CheckerDock: running check now uses the standard RingLoadingIcon (warning ring), matching the app's loader instead of a blue spinner. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
β¦back on failed checks When required checks fail with onFail=auto_repair, automatically run a second iteration instead of ending at `failed`: - createRepairRunner: re-runs the SAME agent in the same topic with the failure feedback as the prompt, re-snapshots the plan onto the repair operation and confirms it so it re-verifies on completion (the next round). Capped at MAX_REPAIR_ROUNDS via parent-chain depth to prevent runaway loops. - maybeAutoRepair: fires only once every required check has a terminal result, so it works for inline LLM checks (triggered from lifecycle) and async agent checks (triggered from the verify tool's writeback path). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
β¦result - add a VerifyResult portal view: clicking any check row opens that result's detail (verdict, confidence, Toulmin sections, suggestion) on the right; agent checks expose their execution trace from inside the panel - CheckerDock rows are all clickable now (chevron affordance), status shown by icon only; verify card uses colorBgElevated - rename the run-result surface from "artifact" to "result" everywhere: RunArtifact β RunResult, phaseToArtifact β phaseToResult, and all `artifact.*` i18n keys β `result.*` - ship verify namespace zh-CN / en-US locales Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
β¦r detail view Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
β¦ify card Auto-repair feedback now lives on the failed round's role=verify message (content), and the VerifyMessageProcessor surfaces it into the repair run's context as a tagged user turn β so the repair op runs off history via a new execAgent `suppressUserMessage` path instead of injecting a synthetic user message. createVerifyMessage is awaited before verification to avoid a race. maxRepairRounds becomes a rubric-level config: new `verify_rubrics.config` jsonb column, read live at repair time via the plan's sourceRubricId. Adds a RubricConfig portal panel (reachable from the plan card's settings affordance) to view/edit it, wired through the verify store + TRPC. Verify domain types/vocab/config are extracted from the DB schema into @lobechat/types as the single source of truth; schema and consumers import from there. Tests: VerifyMessageProcessor dual behavior; VerifyRubricModel config round-trip; MessageModel.findVerifyMessageByOperationId. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Collapse 0110 (tables) + 0111 (criteria.description) + 0112 (rubrics.config) into a single regenerated 0110_add_verify_tables so the PR ships one clean, idempotent migration. No schema change vs the three combined. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
β¦g-rule editor font CLI: `verify rubric create --max-repair-rounds`, `verify rubric view`, and `verify rubric update` exercise the rubric config endpoints end-to-end; adds a mocked command test. UI: judging-rule editor font 16px β 14px. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
β¦repair rounds Add a name (title) field to the RubricConfig portal, persisted via a new updateRubricTitle store action + service (optimistic + debounced, alongside the config write-back). Bump DEFAULT_MAX_REPAIR_ROUNDS 2 β 3. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
β¦-delivery-checker tool Move the delivery-checker plan-creation flow out of the always-on lobe-agent tool into a new standalone, installable builtin tool `lobe-delivery-checker` (Skill Store, opt-in per agent β not loaded by default). lobe-agent no longer ships generateVerifyPlan. - new packages/builtin-tool-lobe-delivery-checker (manifest/types/systemRole + client Render/Inspector/Portal moved wholesale from lobe-agent) - new serverRuntimes/lobeDeliveryChecker.ts (generateVerifyPlan moved out of lobeAgent.ts), registered alongside verifyResult - registered installable in builtin-tools (no hidden/discoverable:false, not in defaultToolIds/alwaysOnToolIds/runtimeManagedToolIds); renders/inspectors/ portals/identifiers wired; lobe-agent portal entries removed - i18n keys moved builtins.lobe-agent.verifyPlan.* β builtins.lobe-delivery-checker.* Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
dae68df to
865071a
Compare
f4780ed to
7780a8b
Compare
β¦f chat-mode Chat mode's contract is to strip ALL user/agent plugins (strict KB/memory/web allow-list) β so the verify sub-agent couldn't get its writeback tool without a leaky blanket rule. Introduce a third tool mode `custom` where the toolset is EXACTLY the agent's declared plugins (no always-on, no defaults, no activator), for focused builtin sub-agents. - chatConfig.toolMode: 'agent' | 'chat' | 'custom' (overrides enableAgentMode) - AgentToolsEngine: custom branch (defaultToolIds = plugins, rules = plugins-on, allowExplicitActivation only in agent mode); chatModeRules restored to strict - verify agent β toolMode: 'custom'; lobe-verify dropped from chatModeAllowedToolIds - test: custom mode enables exactly the declared plugin, no always-on / defaults Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
7780a8b to
594fd05
Compare
This was referenced Jun 9, 2026
arvinxx
added a commit
that referenced
this pull request
Jun 10, 2026
# π LobeHub Release (20260610) **Release Date:** June 10, 2026 **Since v2.2.2:** 131 merged PRs Β· 13 contributors > This weekly release strengthens agent collaboration across cloud, desktop, CLI, and workspace flows, with steadier runtime behavior and a broader foundation for workspace-scoped data. --- ## β¨ Highlights - **Agent execution across devices** β Unifies per-device working directories, project skill discovery, and sub-agent suspend/resume behavior across server, QStash, and device RPC flows. (#15543, #15566, #15481, #15620, #15591) - **Connector and sandbox platform** β Expands connector permissions, custom OAuth MCP connector onboarding, sandbox provider support, and user-uploaded file sync into cloud sandbox runs. (#15463, #15546, #15184, #15550) - **Desktop and CLI reliability** β Fixes desktop cold-start, auto-update, Windows build, CLI skill discovery, and `lh connect` agent dispatch paths. (#15547, #15525, #15527, #15562, #15632, #15634) - **Pages and sharing** β Refreshes topic sharing, improves Page Editor layout behavior, and routes Page Agent tool execution through the server-side editor path. (#15581, #15556, #15588, #15023, #15610) - **Model availability and provider updates** β Adds user-scoped LobeHub model availability, Claude Fable 5, Qwen thinking preservation, and MiniMax M3 updates. (#15590, #15639, #13494, #15376) --- ## ποΈ Core Product & Architecture ### Agent Runtime & Heterogeneous Agents - Improves sub-agent lifecycle handling, including async suspend/resume, queue-mode QStash resume delivery, and blocking nested sub-agent calls. (#15481, #15620, #15575) - Stabilizes heterogeneous agent ingestion and streaming with raw stream dumps, per-turn usage, image forwarding on regenerate, and duplicate-text fixes. (#15602, #15577, #15592, #15585) - Adds execution-device and working-directory controls across device RPC, legacy defaults, and remote-spawned Claude Code sessions. (#15543, #15566, #15591, #15572) - Improves runtime diagnostics and compatibility, including Gemini multimodal output capture, abort stream semantics, and trace quality analysis. (#15535, #13677, #15508) --- ## π± Platforms, Integrations & UX ### Connectors, Sandbox & Tools - Ships API-level connector tool permissions, custom OAuth MCP connector onboarding, and connector-first runtime execution. (#15463, #15546) - Adds sandbox provider support, cloud sandbox file sync, and safer external URL file input handling with SSRF validation. (#15184, #15550, #12657) - Improves tool visibility and execution with pinned app-fixed tools, ANSI output rendering, gateway-tunneled MCP calls, and automatic headless tool runs. (#15509, #15516, #15469, #15492) ### Desktop, CLI & Web UX - Restores desktop startup and reload behavior, preserves IPC error causes, and keeps the tab bar new-tab action visible across routes. (#15547, #15597, #15638) - Fixes desktop update and build stability for browser quit guards, macOS update signing, and Windows Visual Studio detection. (#15525, #15527, #15562) - Shows the plan-limit upgrade UI on desktop builds. (#15628) - Adds the Agent Run delivery checker and fixes CLI device dispatch plus skill list/search output. (#15489, #15634, #15632) - Refreshes onboarding, auth source preservation, topic UI states, referral/Fable campaign copy, and chat-input control bar behavior. (#15629, #15544, #15573, #15614, #15616, #15617, #15622, #15643) --- ## π Security, Reliability & Rollout Notes - External URL file input now includes SSRF validation for safer Google file handling. (#12657) - Database workspace-scope migrations are part of this release; self-hosted operators should run the normal migration path before serving the updated app. (#15446, #15465, #15468, #15472) - The release branch was re-cut from `canary` and includes the latest `main` release-version commit so `v2.2.2` is the verified compare base. --- ## π₯ Contributors @ONLY-yours, @sxjeru, @hardy-one, @xujingli, @hezhijie0327, @Coooolfan, @arvinxx, @tjx666, @Innei, @rivertwilight, @rdmclin2, @cy948, @AmAzing129 **Full Changelog**: v2.2.2...release/weekly-20260610-recut-3
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
π» Change Type
π Related Issue
Part of LOBE-10019
π Description of Change
Adds the Agent Run delivery checker ("verify system") β an LLM-as-judge gate that scores a run's deliverable against reusable criteria before it is delivered, with a Toulmin-style verdict and a data flywheel for human feedback.
Layers
0110_add_verify_tables):verify_criteria,verify_rubrics,verify_rubric_criteria,verify_check_results, plusagent_operations.verify_status / verify_plan / verify_plan_confirmed_at.verifyservices (plan generator, executor/LLM judge, status rollup, repair, feedback) +verifylambda TRPC router + completion-lifecycle hook.lh verifyβ criteria/rubric CRUD, per-run plan (generate/state/confirm/skip), execute, results, feedback β for end-to-end backend testing.features/Verify(CheckerDock,RunArtifact) β module only, not yet wired into a route.Bug fixes found while verifying end-to-end (this branch)
writeVerdictset theverifier_tracing_idFK synchronously, but thellm_generation_tracingrow is written asynchronously (best-effort, after the response), so the FK was violated and every verdict write rolled back β runs stuck atverifying. Now the verdict is written with a null link and the id is backfilled by anonPersistedcallback once the tracing row commits (non-blocking; link stays null when tracing is disabled).null; the Zod validator used.optional()(rejects null) and discarded whole batches. Switched to.nullish().π§ͺ How to Test
Tested locally
Added/updated tests
No tests needed
Migration: verified
drizzle-kit generateis drift-free, replayed all migrations through PGlite (tables/cols/FKs/indexes/idempotency/cascade), and confirmed0110applied on the dev DB.Unit: verify model tests (12) + service/feature tests (8) pass.
Backend e2e (CLI β TRPC β services β DB, real LLM):
lh verify criterion/rubricCRUD,plan generate/state/confirm,verify runβ verdicts persist (passed, conf 0.95/0.98), rollup βpassed,verifier_tracing_idbackfilled and FK target present.