Skip to content

✨ feat(verify): Agent Run delivery checker system#15489

Merged
arvinxx merged 28 commits into
canaryfrom
feat/lobe-10019-verify-system
Jun 8, 2026
Merged

✨ feat(verify): Agent Run delivery checker system#15489
arvinxx merged 28 commits into
canaryfrom
feat/lobe-10019-verify-system

Conversation

@arvinxx

@arvinxx arvinxx commented Jun 5, 2026

Copy link
Copy Markdown
Member

πŸ’» Change Type

  • ✨ feat
  • πŸ› fix

πŸ”— Related Issue

Part of LOBE-10019

πŸ”€ Description of Change

Adds the Agent Run delivery checker ("verify system") β€” an LLM-as-judge gate that scores a run's deliverable against reusable criteria before it is delivered, with a Toulmin-style verdict and a data flywheel for human feedback.

Layers

  • DB (0110_add_verify_tables): verify_criteria, verify_rubrics, verify_rubric_criteria, verify_check_results, plus agent_operations.verify_status / verify_plan / verify_plan_confirmed_at.
  • Server: verify services (plan generator, executor/LLM judge, status rollup, repair, feedback) + verify lambda TRPC router + completion-lifecycle hook.
  • CLI: lh verify β€” criteria/rubric CRUD, per-run plan (generate/state/confirm/skip), execute, results, feedback β€” for end-to-end backend testing.
  • Frontend: features/Verify (CheckerDock, RunArtifact) β€” module only, not yet wired into a route.

Bug fixes found while verifying end-to-end (this branch)

  1. Verdicts never persisted β€” writeVerdict set the verifier_tracing_id FK synchronously, but the llm_generation_tracing row is written asynchronously (best-effort, after the response), so the FK was violated and every verdict write rolled back β†’ runs stuck at verifying. Now the verdict is written with a null link and the id is backfilled by an onPersisted callback once the tracing row commits (non-blocking; link stays null when tracing is disabled).
  2. Brittle verdict parse β€” the non-strict judge schema returns optional Toulmin fields as explicit null; the Zod validator used .optional() (rejects null) and discarded whole batches. Switched to .nullish().

πŸ§ͺ How to Test

  • Tested locally

  • Added/updated tests

  • No tests needed

  • Migration: verified drizzle-kit generate is drift-free, replayed all migrations through PGlite (tables/cols/FKs/indexes/idempotency/cascade), and confirmed 0110 applied on the dev DB.

  • Unit: verify model tests (12) + service/feature tests (8) pass.

  • Backend e2e (CLI β†’ TRPC β†’ services β†’ DB, real LLM): lh verify criterion/rubric CRUD, plan generate/state/confirm, verify run β†’ verdicts persist (passed, conf 0.95/0.98), rollup β†’ passed, verifier_tracing_id backfilled and FK target present.

Note: the features/Verify UI module is not yet mounted into any route and has known type-check issues (i18n flat-key convention, Flexbox/TextArea imports) β€” backend is complete and verified; the UI is intended to be wired up separately.

@arvinxx arvinxx requested review from nekomeowww and tjx666 as code owners June 5, 2026 05:47
@dosubot dosubot Bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Jun 5, 2026

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @arvinxx, you have reached your weekly rate limit of 500000 diff characters.

Please try again later or upgrade to continue using Sourcery

@vercel

vercel Bot commented Jun 5, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
lobehub Ready Ready Preview, Comment Jun 7, 2026 6:58pm

Request Review

@dosubot dosubot Bot added the feature:agent Assistant/Agent configuration and behavior label Jun 5, 2026
Comment thread src/features/Verify/CheckerDock.tsx Fixed
@codecov

codecov Bot commented Jun 5, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 38.07692% with 966 lines in your changes missing coverage. Please review.
βœ… Project coverage is 70.55%. Comparing base (ee6a74b) to head (594fd05).
⚠️ Report is 3 commits behind head on canary.

Additional details and impacted files
@@            Coverage Diff             @@
##           canary   #15489      +/-   ##
==========================================
- Coverage   70.70%   70.55%   -0.16%     
==========================================
  Files        3285     3302      +17     
  Lines      324626   326172    +1546     
  Branches    28728    35517    +6789     
==========================================
+ Hits       229533   230121     +588     
- Misses      94910    95868     +958     
  Partials      183      183              
Flag Coverage Ξ”
app 61.28% <29.36%> (-0.19%) ⬇️
database 92.22% <75.00%> (-0.27%) ⬇️
packages/agent-manager-runtime 49.69% <ΓΈ> (ΓΈ)
packages/agent-runtime 81.06% <ΓΈ> (ΓΈ)
packages/builtin-tool-lobe-agent 18.52% <ΓΈ> (ΓΈ)
packages/context-engine 84.25% <96.87%> (+0.06%) ⬆️
packages/conversation-flow 91.29% <ΓΈ> (ΓΈ)
packages/device-gateway-client 90.18% <ΓΈ> (ΓΈ)
packages/eval-dataset-parser 95.15% <ΓΈ> (ΓΈ)
packages/eval-rubric 76.11% <ΓΈ> (ΓΈ)
packages/fetch-sse 85.57% <ΓΈ> (ΓΈ)
packages/file-loaders 87.89% <ΓΈ> (ΓΈ)
packages/memory-user-memory 74.99% <ΓΈ> (ΓΈ)
packages/model-bank 99.99% <ΓΈ> (ΓΈ)
packages/model-runtime 84.22% <ΓΈ> (ΓΈ)
packages/prompts 72.51% <ΓΈ> (ΓΈ)
packages/python-interpreter 92.90% <ΓΈ> (ΓΈ)
packages/ssrf-safe-fetch 0.00% <ΓΈ> (ΓΈ)
packages/types 35.23% <66.66%> (+0.05%) ⬆️
packages/utils 84.98% <ΓΈ> (ΓΈ)
packages/web-crawler 88.08% <ΓΈ> (ΓΈ)

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Ξ”
Store 68.37% <93.33%> (+<0.01%) ⬆️
Services 54.77% <ΓΈ> (ΓΈ)
Server 71.44% <27.09%> (-0.63%) ⬇️
Libs 54.62% <ΓΈ> (+0.13%) ⬆️
Utils 81.93% <ΓΈ> (ΓΈ)
πŸš€ New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • πŸ“¦ JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment thread src/features/Verify/CheckerDock.tsx Fixed
Comment thread src/features/Verify/CheckerDock.tsx Fixed
Comment thread src/features/Verify/CheckerDock.tsx Fixed
arvinxx and others added 2 commits June 7, 2026 23:54
…ecker

Implement the database layer for the Agent Run delivery checker (Verify System).

Reuse / definition layer:
- verify_criteria: a single reusable pass/fail standard (atomic unit), carrying
  its verifier config + onFail default and bound to a document for judging
  guidance (iteration history reuses document_history; no version columns)
- verify_rubrics: a named group that aggregates criteria β€” the reusable unit
- verify_rubric_criteria: junction, which criteria a rubric aggregates
  (criteria are reusable across rubrics)

Mounted onto an agent via the existing agency config jsonb:
- agencyConfig.verifyRubricId: a reusable rubric (criteria template)
- agencyConfig.verifyCriteriaIds: ad-hoc one-off criteria
A run's plan instantiates the union of both. No dedicated bindings table.

Snapshot + result layer:
- agent_operations.verify_plan (jsonb) + verify_plan_confirmed_at: the per-run
  immutable check-item snapshot lives ON the operation (1:1 β€” auto-repair spawns
  a new operation), instead of a separate plans table
- agent_operations.verify_status: denormalized rollup for list-page badges
- verify_check_results: per-criterion result with the Toulmin model
  (verdict/confidence as columns, narrative in a typed toulmin jsonb), N:1
  verifier_tracing_id for batch judging, FP/FN flags for the data flywheel;
  relates to the plan via operation_id + stable check_item_id

Ref: LOBE-10019

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Implements the verify system on top of the schema (PR #15480):
- models: verifyCriterion / verifyRubric (+junction) / verifyCheckResult;
  agentOperation verify plan/status methods
- services/verify: AI plan generation (auto-create criteria), executor with
  LLM Toulmin judge (per-criterion + batch), program placeholder, agent &
  auto-repair spawner seams, rollup chokepoint, feedback fp/fn, completion
  lifecycle bridge
- lambda verify router (criteria/rubric CRUD, plan, results, feedback)
- frontend feature module: service, SWR hooks, CheckerDock state machine,
  RunArtifact, verify i18n namespace
- tracing scenarios: VerifyPlanGen / VerifyJudge

Live UI mount (dock/artifact into chat) pending server operationId source.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
arvinxx and others added 17 commits June 7, 2026 23:54
…d border

- The kicker said "Run Artifact" β€” it's automated verification, not an artifact.
  Renamed to "Verification Β· Round N".
- Removed the red error border on a failed check β€” a normal card reads better.
- Fixes a render crash (`useVerifyState is not defined`): the border removal left
  a dangling reference after the import was dropped.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
When the live stream (gateway WebSocket / SSE) closes before the run finishes,
the run is still executing server-side β€” so instead of hard-exiting, fall back to
polling `aiAgent.getOperationStatus` every 10s until the run reaches a terminal
state (or is no longer tracked). Pairs with surfacing the WS close code/reason.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The generateVerifyPlan tool call rendered as the default param/result dump. Add a
Render that lists the generated delivery checks (title + gate/auto-fill tag), and
surface the items on the tool state so the Render can read them.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The agent generated a plan but it stayed `planned`/unconfirmed, so the completion
hook (which gates on a confirmed plan) never ran the checks β€” the card was stuck
at "awaiting confirmation" with no pass/fail. In the headless agent flow there's
no one to click Confirm, so `generateVerifyPlan` now auto-confirms the plan it
generates; the checks then run automatically on completion. (An interactive
"review before run" gate is a future enhancement.)

Also: the verify card header disappeared in the draft/planned phase
(`phaseToArtifact.draft` was null). Give it a header so the card always shows its
"Verification Β· Round N" heading.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…tural noise

The first structuralNoiseRatio charged ALL markup (every <...> tag) as noise,
which over-penalized legitimately structured results 3x. Grounding against real
web-search output (`<item title="…" url="…">snippet</item>`) showed the tags and
the title=/url= attributes ARE the signal the model reads.

Now only opaque/presentational attribute names (id, class, style, data-*, aria-*,
role, on*) count as noise; semantic element tags and content-bearing attributes
(title, url, href, name…) are kept. On a 57-op user-interrupted sample this drops
web-search noise 42%β†’0% and overall estimated waste 16%β†’5%, leaving large-payload
(readDocument) and high error-rate tools as the real signal.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…tion-in-document + agent verifier

Restructure the generateVerifyPlan tool to a createDocument-style full-create flow
and wire up the agent verifier path:

- criteria now = title + description (required one-liner) + instruction (required
  detailed rubric); instruction lives in a linked document (verify_criteria.documentId),
  description is a new verify_criteria column (migration 0111). verifierConfig no
  longer holds description/instruction.
- generateVerifyPlan creates verify_criteria + a rubric, snapshots the plan onto
  the operation and confirms it; judge resolves the instruction from the document.
- agent-type checks run as verifier sub-agents (execAgent + isolated thread) whose
  onComplete hook parses a VERDICT and writes it back to verify_check_results
  (renamed AgentVerifierSpawner β†’ VerifierAgentRunner).
- UI: custom Inspector for the tool header; check list shows per-verifier-type icons
  (llm/agent/program) + description + required/optional tag; i18n en/zh.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The three verifier kinds are independent; previously the agent spawn waited for
the batched LLM judge to finish. Run them via Promise.all so agent sub-agents
start immediately alongside the LLM batch.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…=verify message, portal check editor

- Add `@lobechat/builtin-tool-verify` (submitVerifyResult) + builtin `verify-agent`;
  agent-type checks now run as the dedicated verify agent (not the user's agent),
  which investigates and writes its verdict back via the tool during its run.
- Verifier inherits the parent run's model/provider (builtin default may be
  unconfigured locally).
- role=verify completion message no longer requires an assistantMessageId, so the
  delivery-checker card always surfaces when a plan exists.
- Portal editor for verify checks (title/description/instruction/verifier/onFail).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ng loader icon

Root cause of stuck `running` agent checks: the verify-agent ran in agent mode and
inherited all default tools (web-browsing, cloud-sandbox, skills, activator), so it
went off web-searching/crawling to "investigate" and never called submitVerifyResult.

- Run the verify-agent in chat mode (enableAgentMode: false, searchMode: off) β€” the
  strict whitelist β€” and whitelist `lobe-verify` for chat mode so the verifier gets
  ONLY its writeback tool.
- Sharpen the verify systemRole: judge from the provided deliverable/instruction
  (no external tools), always reach a verdict, and always call submitVerifyResult.
- CheckerDock: running check now uses the standard RingLoadingIcon (warning ring),
  matching the app's loader instead of a blue spinner.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…back on failed checks

When required checks fail with onFail=auto_repair, automatically run a second
iteration instead of ending at `failed`:

- createRepairRunner: re-runs the SAME agent in the same topic with the failure
  feedback as the prompt, re-snapshots the plan onto the repair operation and
  confirms it so it re-verifies on completion (the next round). Capped at
  MAX_REPAIR_ROUNDS via parent-chain depth to prevent runaway loops.
- maybeAutoRepair: fires only once every required check has a terminal result, so
  it works for inline LLM checks (triggered from lifecycle) and async agent checks
  (triggered from the verify tool's writeback path).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…result

- add a VerifyResult portal view: clicking any check row opens that result's
  detail (verdict, confidence, Toulmin sections, suggestion) on the right; agent
  checks expose their execution trace from inside the panel
- CheckerDock rows are all clickable now (chevron affordance), status shown by
  icon only; verify card uses colorBgElevated
- rename the run-result surface from "artifact" to "result" everywhere: RunArtifact
  β†’ RunResult, phaseToArtifact β†’ phaseToResult, and all `artifact.*` i18n keys β†’
  `result.*`
- ship verify namespace zh-CN / en-US locales

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…r detail view

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ify card

Auto-repair feedback now lives on the failed round's role=verify message
(content), and the VerifyMessageProcessor surfaces it into the repair run's
context as a tagged user turn β€” so the repair op runs off history via a new
execAgent `suppressUserMessage` path instead of injecting a synthetic user
message. createVerifyMessage is awaited before verification to avoid a race.

maxRepairRounds becomes a rubric-level config: new `verify_rubrics.config`
jsonb column, read live at repair time via the plan's sourceRubricId. Adds a
RubricConfig portal panel (reachable from the plan card's settings affordance)
to view/edit it, wired through the verify store + TRPC.

Verify domain types/vocab/config are extracted from the DB schema into
@lobechat/types as the single source of truth; schema and consumers import
from there.

Tests: VerifyMessageProcessor dual behavior; VerifyRubricModel config
round-trip; MessageModel.findVerifyMessageByOperationId.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Collapse 0110 (tables) + 0111 (criteria.description) + 0112 (rubrics.config)
into a single regenerated 0110_add_verify_tables so the PR ships one clean,
idempotent migration. No schema change vs the three combined.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…g-rule editor font

CLI: `verify rubric create --max-repair-rounds`, `verify rubric view`, and
`verify rubric update` exercise the rubric config endpoints end-to-end; adds a
mocked command test. UI: judging-rule editor font 16px β†’ 14px.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…repair rounds

Add a name (title) field to the RubricConfig portal, persisted via a new
updateRubricTitle store action + service (optimistic + debounced, alongside
the config write-back). Bump DEFAULT_MAX_REPAIR_ROUNDS 2 β†’ 3.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…-delivery-checker tool

Move the delivery-checker plan-creation flow out of the always-on lobe-agent
tool into a new standalone, installable builtin tool `lobe-delivery-checker`
(Skill Store, opt-in per agent β€” not loaded by default). lobe-agent no longer
ships generateVerifyPlan.

- new packages/builtin-tool-lobe-delivery-checker (manifest/types/systemRole +
  client Render/Inspector/Portal moved wholesale from lobe-agent)
- new serverRuntimes/lobeDeliveryChecker.ts (generateVerifyPlan moved out of
  lobeAgent.ts), registered alongside verifyResult
- registered installable in builtin-tools (no hidden/discoverable:false, not in
  defaultToolIds/alwaysOnToolIds/runtimeManagedToolIds); renders/inspectors/
  portals/identifiers wired; lobe-agent portal entries removed
- i18n keys moved builtins.lobe-agent.verifyPlan.* β†’ builtins.lobe-delivery-checker.*

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@arvinxx arvinxx force-pushed the feat/lobe-10019-verify-system branch from dae68df to 865071a Compare June 7, 2026 16:12
@arvinxx arvinxx force-pushed the feat/lobe-10019-verify-system branch from f4780ed to 7780a8b Compare June 7, 2026 18:29
…f chat-mode

Chat mode's contract is to strip ALL user/agent plugins (strict KB/memory/web
allow-list) β€” so the verify sub-agent couldn't get its writeback tool without a
leaky blanket rule. Introduce a third tool mode `custom` where the toolset is
EXACTLY the agent's declared plugins (no always-on, no defaults, no activator),
for focused builtin sub-agents.

- chatConfig.toolMode: 'agent' | 'chat' | 'custom' (overrides enableAgentMode)
- AgentToolsEngine: custom branch (defaultToolIds = plugins, rules = plugins-on,
  allowExplicitActivation only in agent mode); chatModeRules restored to strict
- verify agent β†’ toolMode: 'custom'; lobe-verify dropped from chatModeAllowedToolIds
- test: custom mode enables exactly the declared plugin, no always-on / defaults

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@arvinxx arvinxx force-pushed the feat/lobe-10019-verify-system branch from 7780a8b to 594fd05 Compare June 7, 2026 18:51
@arvinxx arvinxx merged commit dbf743c into canary Jun 8, 2026
33 of 35 checks passed
@arvinxx arvinxx deleted the feat/lobe-10019-verify-system branch June 8, 2026 01:16
arvinxx added a commit that referenced this pull request Jun 10, 2026
# πŸš€ LobeHub Release (20260610)

**Release Date:** June 10, 2026  
**Since v2.2.2:** 131 merged PRs Β· 13 contributors

> This weekly release strengthens agent collaboration across cloud,
desktop, CLI, and workspace flows, with steadier runtime behavior and a
broader foundation for workspace-scoped data.

---

## ✨ Highlights

- **Agent execution across devices** β€” Unifies per-device working
directories, project skill discovery, and sub-agent suspend/resume
behavior across server, QStash, and device RPC flows. (#15543, #15566,
#15481, #15620, #15591)
- **Connector and sandbox platform** β€” Expands connector permissions,
custom OAuth MCP connector onboarding, sandbox provider support, and
user-uploaded file sync into cloud sandbox runs. (#15463, #15546,
#15184, #15550)
- **Desktop and CLI reliability** β€” Fixes desktop cold-start,
auto-update, Windows build, CLI skill discovery, and `lh connect` agent
dispatch paths. (#15547, #15525, #15527, #15562, #15632, #15634)
- **Pages and sharing** β€” Refreshes topic sharing, improves Page Editor
layout behavior, and routes Page Agent tool execution through the
server-side editor path. (#15581, #15556, #15588, #15023, #15610)
- **Model availability and provider updates** β€” Adds user-scoped LobeHub
model availability, Claude Fable 5, Qwen thinking preservation, and
MiniMax M3 updates. (#15590, #15639, #13494, #15376)

---

## πŸ—οΈ Core Product & Architecture

### Agent Runtime & Heterogeneous Agents

- Improves sub-agent lifecycle handling, including async suspend/resume,
queue-mode QStash resume delivery, and blocking nested sub-agent calls.
(#15481, #15620, #15575)
- Stabilizes heterogeneous agent ingestion and streaming with raw stream
dumps, per-turn usage, image forwarding on regenerate, and
duplicate-text fixes. (#15602, #15577, #15592, #15585)
- Adds execution-device and working-directory controls across device
RPC, legacy defaults, and remote-spawned Claude Code sessions. (#15543,
#15566, #15591, #15572)
- Improves runtime diagnostics and compatibility, including Gemini
multimodal output capture, abort stream semantics, and trace quality
analysis. (#15535, #13677, #15508)

---

## πŸ“± Platforms, Integrations & UX

### Connectors, Sandbox & Tools

- Ships API-level connector tool permissions, custom OAuth MCP connector
onboarding, and connector-first runtime execution. (#15463, #15546)
- Adds sandbox provider support, cloud sandbox file sync, and safer
external URL file input handling with SSRF validation. (#15184, #15550,
#12657)
- Improves tool visibility and execution with pinned app-fixed tools,
ANSI output rendering, gateway-tunneled MCP calls, and automatic
headless tool runs. (#15509, #15516, #15469, #15492)

### Desktop, CLI & Web UX

- Restores desktop startup and reload behavior, preserves IPC error
causes, and keeps the tab bar new-tab action visible across routes.
(#15547, #15597, #15638)
- Fixes desktop update and build stability for browser quit guards,
macOS update signing, and Windows Visual Studio detection. (#15525,
#15527, #15562)
- Shows the plan-limit upgrade UI on desktop builds. (#15628)
- Adds the Agent Run delivery checker and fixes CLI device dispatch plus
skill list/search output. (#15489, #15634, #15632)
- Refreshes onboarding, auth source preservation, topic UI states,
referral/Fable campaign copy, and chat-input control bar behavior.
(#15629, #15544, #15573, #15614, #15616, #15617, #15622, #15643)

---

## πŸ”’ Security, Reliability & Rollout Notes

- External URL file input now includes SSRF validation for safer Google
file handling. (#12657)
- Database workspace-scope migrations are part of this release;
self-hosted operators should run the normal migration path before
serving the updated app. (#15446, #15465, #15468, #15472)
- The release branch was re-cut from `canary` and includes the latest
`main` release-version commit so `v2.2.2` is the verified compare base.

---

## πŸ‘₯ Contributors

@ONLY-yours, @sxjeru, @hardy-one, @xujingli, @hezhijie0327, @Coooolfan,
@arvinxx, @tjx666, @Innei, @rivertwilight, @rdmclin2, @cy948,
@AmAzing129

**Full Changelog**:
v2.2.2...release/weekly-20260610-recut-3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature:agent Assistant/Agent configuration and behavior size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant