Skip to content

feat(session): add run observability diagnostics#788

Merged
Astro-Han merged 7 commits into
devfrom
pawwork/issue-783-run-observability
May 20, 2026
Merged

feat(session): add run observability diagnostics#788
Astro-Han merged 7 commits into
devfrom
pawwork/issue-783-run-observability

Conversation

@Astro-Han

@Astro-Han Astro-Han commented May 20, 2026

Copy link
Copy Markdown
Owner

Summary

Add a run-level observability summary beside existing llm_trace diagnostics for session exports. This PR introduces a small diagnostic spine that records provider progress, visible output, tool-call/tool-execution facts, terminal failure classification, and retry-safety facts without changing retry behavior or user-facing UI.

Why

#783 needs recurring terminated / UND_ERR_SOCKET and local 已中断 failures to be debuggable from exports instead of appearing as opaque failures. This PR is the first diagnostic foundation: it distinguishes success, external stream disconnects, unknown local scope closes, setup failures, and tool failures, while keeping exported diagnostics bounded and safe.

Related Issue

Addresses part of #783. It does not close #783; follow-ups remain for watchdog/setup taxonomy wiring, lifecycle action provenance, and broader deterministic harness coverage.

Human Review Status

Pending

Review Focus

  • Whether run_observability captures only safe control-flow/failure facts and no prompt/tool/body/path content.
  • Whether success, stream disconnect, scope-close, setup, and tool classifications have conservative retry-safety semantics.
  • Whether export projection and sanitizeSnapshot preserve useful safe evidence like UND_ERR_SOCKET without leaking sensitive data.

Risk Notes

  • Diagnostic-only change: no automatic retry, UI copy, provider credentials, or runtime replacement.
  • Generated SDK type changed because the assistant diagnostics schema gained run_observability?: unknown.
  • Visible UI/copy check skipped: no visible UI or user-facing copy changed.
  • Platform/packaging check skipped: no macOS/Windows packaging, updater, signing, shell, or permissions surface changed.

How To Verify

Run observability + export tests: bun test test/session/run-observability.test.ts test/session/export.test.ts --timeout 30000 — 50 pass, 0 fail
Typecheck: bun run typecheck from packages/opencode — passed
Whitespace check: git diff --check — passed
SDK schema generation: bun run --cwd ../../packages/sdk/js build from packages/opencode — passed after replacing z.custom with schema-safe z.any()

Screenshots or Recordings

Not applicable — no visible UI changes.

Checklist

How to use this checklist:

  • Tick a box by replacing [ ] with [x]. Do not edit, add, or remove items.
  • The bot-applied label items can only be honestly ticked AFTER the PR is opened and the labeler / priority-triage bots have run — return to the PR description and tick them then.
  • Most items are required. The few that are conditional are explicitly marked (conditional); for those, leave unticked if they truly do not apply and explain why in Risk Notes. All other items must be ticked before requesting human review.
  • Type label — this PR carries exactly one of bug, enhancement, task, documentation. Type labels are author-added; the labeler bot does NOT assign them. Add the label in the GitHub UI, then tick this.
  • Routing labels — this PR carries at least one of app, ui, platform, harness, ci. The labeler bot assigns these on PR open based on changed paths. Confirm the bot's choice (or override if wrong), then tick this.
  • Priority label — this PR carries exactly one of P0, P1, P2, P3. The priority-triage bot suggests one on PR open. Confirm or override, then tick this.
  • Human Review Status above is set to Pending, Approved by @<reviewer>, or Not required: <reason> (default is Pending; "not required" is restricted to bot-authored low-risk PRs).
  • I linked the related issue, or stated in Summary why there is no issue.
  • I described the review focus and any meaningful risks.
  • I replaced the example block in How To Verify with the real verification steps and the key result for each.
  • I did not introduce unrelated refactors, dependencies, generated files, or file changes beyond the stated scope.
  • (conditional) I manually checked visible UI or copy changes when needed, with screenshots or recordings. Leave unticked only if no visible UI or copy changed.
  • (conditional) I considered macOS and Windows impact for platform, packaging, updater, signing, paths, shell, or permissions changes. Leave unticked only if no platform/packaging surface was touched.
  • (conditional) I called out docs, release notes, dependencies, permissions, credentials, deletion behavior, generated content, or local file changes when relevant. Leave unticked only if none of those surfaces was touched.
  • I reviewed the final diff for unrelated changes and suspicious dependency changes.
  • I am targeting dev, and my PR title and commit messages use Conventional Commits in English.

@coderabbitai

coderabbitai Bot commented May 20, 2026

Copy link
Copy Markdown
Contributor

Warning

Rate limit exceeded

@Astro-Han has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 54 minutes and 45 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 4a092be0-de80-4903-b758-082db5acf5e7

📥 Commits

Reviewing files that changed from the base of the PR and between aac7f10 and 3a00bf0.

⛔ Files ignored due to path filters (1)
  • packages/sdk/js/src/v2/gen/types.gen.ts is excluded by !**/gen/**
📒 Files selected for processing (10)
  • packages/opencode/src/session/export.ts
  • packages/opencode/src/session/message-v2.ts
  • packages/opencode/src/session/processor.ts
  • packages/opencode/src/session/prompt.ts
  • packages/opencode/src/session/run-observability/index.ts
  • packages/opencode/src/session/run-observability/recorder.ts
  • packages/opencode/src/session/run-observability/sanitize.ts
  • packages/opencode/src/session/run-observability/types.ts
  • packages/opencode/test/session/export.test.ts
  • packages/opencode/test/session/run-observability.test.ts
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch pawwork/issue-783-run-observability

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Astro-Han Astro-Han added enhancement New feature or request P2 Medium priority harness Model harness, prompts, tool descriptions, and session mechanics tech-debt Supplemental cleanup, maintainability, architecture, test, or quality debt context labels May 20, 2026

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested priority: P2 (includes non-doc, non-test paths outside the low-risk bucket).

P1/P0 are reserved for maintainer confirmation. Please relabel manually if this is a release blocker, security issue, data-loss risk, or updater/runtime failure.

@Astro-Han Astro-Han removed the tech-debt Supplemental cleanup, maintainability, architecture, test, or quality debt context label May 20, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive run observability system designed to track, classify, and sanitize the execution of LLM runs and tool calls. It implements a RunObservability module that records provider progress, visible output, and tool lifecycle events to provide automated retry safety recommendations. Feedback from the review highlights the need to add a "success" state to the Classification enum and recorder logic to prevent successful runs from being mislabeled as failures. Additionally, the reviewer recommended replacing hardcoded tool names with centralized constants to adhere to naming conventions.

Comment thread packages/opencode/src/session/run-observability/types.ts
Comment thread packages/opencode/src/session/run-observability/recorder.ts
Comment thread packages/opencode/src/session/run-observability/recorder.ts
Comment thread packages/opencode/src/session/run-observability/sanitize.ts
@Astro-Han Astro-Han force-pushed the pawwork/issue-783-run-observability branch from 54069e4 to 8dab8b8 Compare May 20, 2026 13:33
@Astro-Han Astro-Han merged commit 120fea0 into dev May 20, 2026
27 checks passed
@Astro-Han Astro-Han deleted the pawwork/issue-783-run-observability branch May 20, 2026 15:36
Astro-Han added a commit that referenced this pull request May 20, 2026
feat(session): trace lifecycle close provenance

Adds the second #783 run-observability slice after PR #788: local instance lifecycle closes now carry bounded parent provenance instead of stopping at generic scope-close diagnostics.

Change boundary:
- add local_instance_reload and local_instance_dispose run-observability classifications
- record bounded lifecycle action IDs for InstanceStore.reload, dispose, disposeDirectory, and disposeAll
- propagate lifecycle action metadata through SessionRunState interrupts and processor run diagnostics
- keep disposeAll fan-out under one parent action across affected in-flight runs
- harden review feedback by using a per-directory action stack for overlapping lifecycle operations and by capturing processor directory from InstanceState.context instead of static Instance.directory

Verification:
- bun test test/session/run-observability.test.ts test/session/run-state.test.ts test/session/export.test.ts --timeout 30000 (54 pass, 0 fail)
- bun run typecheck from packages/opencode
- git diff --check
- PR CI green, including unit-opencode, typecheck, CodeQL, desktop-smoke, and e2e-artifacts
- review threads resolved: 0 unresolved

Notes:
- Diagnostic-only change. No retry policy, provider credential, user-facing copy, UI, packaging, or release behavior changed.
- #721, #754, and #755 remain separate behavior/follow-up investigations; this PR only improves causal exports for local lifecycle closes.
@Astro-Han

Copy link
Copy Markdown
Owner Author

Back-reference from #808.

This merged run-observability PR is the immediate foundation for the Run Incident Framework. #808 builds on the run facts added here and turns them into structured incident cause/phase/policy/export semantics.

Astro-Han added a commit that referenced this pull request May 21, 2026
Add the first #808 RunIncident diagnostic/export layer so provider transport disconnects, partial tool input interruptions, cleanup/finalizer evidence, and materialized-but-not-executed tool boundaries are derived from ordered append-only evidence instead of mutable summary overwrites.

This keeps the PR diagnostic-only: it adds structured sanitized run_incidents export while preserving legacy classification, summary_key, and retry_safety compatibility, without adding recovery UI, retry behavior, or provider SDK changes.

Verification:
- bun test test/session/run-observability.test.ts test/session/export.test.ts --timeout 30000 — passed
- bun run typecheck — passed
- git diff --check — passed
- PR CI for #812 — all checks passed

Review follow-ups:
- Preserved newest terminal/cleanup anchors when bounded evidence exceeds the export cap.
- Aligned stream phase derivation with provider progress and terminal cause.
- Resolved all Gemini and CodeRabbit review threads.

Related: #808, #803, #788, #794
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request harness Model harness, prompts, tool descriptions, and session mechanics P2 Medium priority

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Task] Design run-level LLM stream diagnostics for recurring terminated failures

1 participant