Skip to content

✨ feat(agent-tracing): tool-result feedback quality analysis (tq command)#15508

Merged
arvinxx merged 2 commits into
canaryfrom
arvinxx/feat/tool-feedback-quality
Jun 6, 2026
Merged

✨ feat(agent-tracing): tool-result feedback quality analysis (tq command)#15508
arvinxx merged 2 commits into
canaryfrom
arvinxx/feat/tool-feedback-quality

Conversation

@arvinxx

@arvinxx arvinxx commented Jun 6, 2026

Copy link
Copy Markdown
Member

💻 Change Type

  • ✨ feat

🔗 Related Issue

Related to LOBE-10057

🔀 Description of Change

引入对 tool 返回 content(环境反馈)干净度 / LLM 友好度 的客观量化,作为 agent harness 评测的第一阶段。

环境反馈是 agent loop 里模型每步决策的唯一依据,也是 harness 最能控制格式、却最容易脏的部分。本 PR 提供一个无需 LLM 的共享分析库 + 一个 CLI 预览命令,用来快速感知哪些工具在往 context 灌噪声。

  • src/analysis/toolFeedback.ts —— 纯函数分析库(可被 CLI / 后续 DC 入库 / judge 复用的唯一核心)。取数 step.toolsResult[].output,unwrap {"content":...} 信封。每条 tool result 指标:
    • tokens(gpt-tokenizer)
    • selfRedundancy(80-char shingle 去重比,抓退化 dump / 重复报错)
    • structuralNoiseRatio(xml/html 标签占比,抓 markup 噪声)
    • isError + 错误体积(错误本应小)
    • formatestWasteTokens(token 加权浪费)
    • 以及 op 级 / corpus 级 rollup
  • src/cli/tool-quality.ts —— agent-tracing tq(别名 tool-quality):token-size 直方图、按 token 加权浪费排名的 dirty leaderboard、单 op 下钻、--json

纯新增 + 在 cli/index.ts 注册一个子命令,不改动任何既有逻辑。

🧪 How to Test

在任意有 .agent-tracing/_remote/ 快照缓存的目录运行:

agent-tracing tq                 # corpus 直方图 + dirty leaderboard
agent-tracing tq <opId>          # 单 op 逐 tool-result 下钻
agent-tracing tq --json          # 机读输出

实测 98 ops / 770 results / 1.4M tokens 的样例输出:

Tool-result feedback quality  (98 ops · 770 results · 1.4M tokens)
  est. wasted ≈ 165.9k (12%)  of all tool-result tokens

  token-size distribution   bar = % of results · right = % of tokens
  <128   ████████████████████ 59%    1% tok
  <512   ████                 12%    2% tok
  <2048  ██████               16%    10% tok
  <8192  ███                  9%     20% tok
  <32768 █                    3%     18% tok
  ≥32768                      0%     49% tok

  dirty leaderboard  (ranked by token-weighted waste)
  tool                              calls p99    redund noise err%  waste
  lobe-agent-documents/readDocument 49    663.8k 0%     5%    20%   ≈45.1k
  lobe-web-browsing/search          39    2.4k   0%     59%   5%    ≈28.9k
  lobe-local-system/runCommand      205   3.8k   0%     0%    40%   ≈19.3k
  • Tested locally
  • Added/updated tests
  • No tests needed

📝 Additional Information

仅 CLI / 分析库,无运行时 / UI / schema 改动。后续阶段(落库 agent_operations rollup、LLM-as-judge 语义指标)见 LOBE-10057。

🤖 Generated with Claude Code

…command)

Adds a shared, no-LLM analyzer that scores how "clean / LLM-friendly" the
environment feedback (tool return content) is, plus an `agent-tracing tq`
CLI command to preview it over a snapshot corpus.

- src/analysis/toolFeedback.ts: pure analysis lib (reusable core) — per
  tool-result metrics (tokens, self-redundancy, structural-noise ratio,
  error flag/size, format) + op-level and corpus-level rollups.
- src/cli/tool-quality.ts: `tq` (alias `tool-quality`) — token-size
  histogram, dirty leaderboard ranked by token-weighted waste, single-op
  drill-down, and --json.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@dosubot dosubot Bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Jun 6, 2026
@vercel

vercel Bot commented Jun 6, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
lobehub Ready Ready Preview, Comment Jun 6, 2026 10:21am

Request Review

@dosubot dosubot Bot added the feature:agent Assistant/Agent configuration and behavior label Jun 6, 2026

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've reviewed this pull request using the Sourcery rules engine

@codecov

codecov Bot commented Jun 6, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.64%. Comparing base (6f5a633) to head (684d1c1).
⚠️ Report is 1 commits behind head on canary.

Additional details and impacted files
@@            Coverage Diff             @@
##           canary   #15508      +/-   ##
==========================================
- Coverage   70.64%   70.64%   -0.01%     
==========================================
  Files        3274     3274              
  Lines      322959   322959              
  Branches    29419    29421       +2     
==========================================
- Hits       228155   228152       -3     
- Misses      94621    94624       +3     
  Partials      183      183              
Flag Coverage Δ
app 61.30% <ø> (+<0.01%) ⬆️
database 92.54% <ø> (ø)
packages/agent-manager-runtime 49.69% <ø> (ø)
packages/agent-runtime 81.04% <ø> (ø)
packages/builtin-tool-lobe-agent 18.52% <ø> (ø)
packages/context-engine 84.19% <ø> (ø)
packages/conversation-flow 91.29% <ø> (ø)
packages/device-gateway-client 90.51% <ø> (ø)
packages/eval-dataset-parser 95.15% <ø> (ø)
packages/eval-rubric 76.11% <ø> (ø)
packages/fetch-sse 85.57% <ø> (-1.72%) ⬇️
packages/file-loaders 87.89% <ø> (ø)
packages/memory-user-memory 74.99% <ø> (ø)
packages/model-bank 99.99% <ø> (ø)
packages/model-runtime 84.22% <ø> (ø)
packages/prompts 72.51% <ø> (ø)
packages/python-interpreter 92.90% <ø> (ø)
packages/ssrf-safe-fetch 0.00% <ø> (ø)
packages/types 35.38% <ø> (ø)
packages/utils 84.98% <ø> (ø)
packages/web-crawler 88.08% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
Store 68.39% <ø> (ø)
Services 54.77% <ø> (ø)
Server 71.82% <ø> (-0.01%) ⬇️
Libs 54.34% <ø> (+0.13%) ⬆️
Utils 81.71% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…ldCorpusReport

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@arvinxx arvinxx merged commit ad87e43 into canary Jun 6, 2026
34 of 35 checks passed
@arvinxx arvinxx deleted the arvinxx/feat/tool-feedback-quality branch June 6, 2026 10:31
This was referenced Jun 10, 2026
arvinxx added a commit that referenced this pull request Jun 10, 2026
# 🚀 LobeHub Release (20260610)

**Release Date:** June 10, 2026  
**Since v2.2.2:** 131 merged PRs · 13 contributors

> This weekly release strengthens agent collaboration across cloud,
desktop, CLI, and workspace flows, with steadier runtime behavior and a
broader foundation for workspace-scoped data.

---

## ✨ Highlights

- **Agent execution across devices** — Unifies per-device working
directories, project skill discovery, and sub-agent suspend/resume
behavior across server, QStash, and device RPC flows. (#15543, #15566,
#15481, #15620, #15591)
- **Connector and sandbox platform** — Expands connector permissions,
custom OAuth MCP connector onboarding, sandbox provider support, and
user-uploaded file sync into cloud sandbox runs. (#15463, #15546,
#15184, #15550)
- **Desktop and CLI reliability** — Fixes desktop cold-start,
auto-update, Windows build, CLI skill discovery, and `lh connect` agent
dispatch paths. (#15547, #15525, #15527, #15562, #15632, #15634)
- **Pages and sharing** — Refreshes topic sharing, improves Page Editor
layout behavior, and routes Page Agent tool execution through the
server-side editor path. (#15581, #15556, #15588, #15023, #15610)
- **Model availability and provider updates** — Adds user-scoped LobeHub
model availability, Claude Fable 5, Qwen thinking preservation, and
MiniMax M3 updates. (#15590, #15639, #13494, #15376)

---

## 🏗️ Core Product & Architecture

### Agent Runtime & Heterogeneous Agents

- Improves sub-agent lifecycle handling, including async suspend/resume,
queue-mode QStash resume delivery, and blocking nested sub-agent calls.
(#15481, #15620, #15575)
- Stabilizes heterogeneous agent ingestion and streaming with raw stream
dumps, per-turn usage, image forwarding on regenerate, and
duplicate-text fixes. (#15602, #15577, #15592, #15585)
- Adds execution-device and working-directory controls across device
RPC, legacy defaults, and remote-spawned Claude Code sessions. (#15543,
#15566, #15591, #15572)
- Improves runtime diagnostics and compatibility, including Gemini
multimodal output capture, abort stream semantics, and trace quality
analysis. (#15535, #13677, #15508)

---

## 📱 Platforms, Integrations & UX

### Connectors, Sandbox & Tools

- Ships API-level connector tool permissions, custom OAuth MCP connector
onboarding, and connector-first runtime execution. (#15463, #15546)
- Adds sandbox provider support, cloud sandbox file sync, and safer
external URL file input handling with SSRF validation. (#15184, #15550,
#12657)
- Improves tool visibility and execution with pinned app-fixed tools,
ANSI output rendering, gateway-tunneled MCP calls, and automatic
headless tool runs. (#15509, #15516, #15469, #15492)

### Desktop, CLI & Web UX

- Restores desktop startup and reload behavior, preserves IPC error
causes, and keeps the tab bar new-tab action visible across routes.
(#15547, #15597, #15638)
- Fixes desktop update and build stability for browser quit guards,
macOS update signing, and Windows Visual Studio detection. (#15525,
#15527, #15562)
- Shows the plan-limit upgrade UI on desktop builds. (#15628)
- Adds the Agent Run delivery checker and fixes CLI device dispatch plus
skill list/search output. (#15489, #15634, #15632)
- Refreshes onboarding, auth source preservation, topic UI states,
referral/Fable campaign copy, and chat-input control bar behavior.
(#15629, #15544, #15573, #15614, #15616, #15617, #15622, #15643)

---

## 🔒 Security, Reliability & Rollout Notes

- External URL file input now includes SSRF validation for safer Google
file handling. (#12657)
- Database workspace-scope migrations are part of this release;
self-hosted operators should run the normal migration path before
serving the updated app. (#15446, #15465, #15468, #15472)
- The release branch was re-cut from `canary` and includes the latest
`main` release-version commit so `v2.2.2` is the verified compare base.

---

## 👥 Contributors

@ONLY-yours, @sxjeru, @hardy-one, @xujingli, @hezhijie0327, @Coooolfan,
@arvinxx, @tjx666, @Innei, @rivertwilight, @rdmclin2, @cy948,
@AmAzing129

**Full Changelog**:
v2.2.2...release/weekly-20260610-recut-3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature:agent Assistant/Agent configuration and behavior size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant