Skip to content

Runtime: stabilize tool/run state transitions under compaction and backpressure#33826

Merged
Takhoffman merged 12 commits intomainfrom
task/workstream-5-state-machine
Mar 4, 2026
Merged

Runtime: stabilize tool/run state transitions under compaction and backpressure#33826
Takhoffman merged 12 commits intomainfrom
task/workstream-5-state-machine

Conversation

@Takhoffman
Copy link
Copy Markdown
Contributor

Summary

What changed

  • add shared run-state machine for busy/activeRuns/heartbeat/deactivation lifecycle handling
  • use the shared state machine in Discord message handler queue execution
  • keep Anthropic turn validation resilient by stripping dangling toolUse blocks after compaction/replay
  • preserve stale-busy recovery semantics in channel health policy/monitor paths
  • add changelog entry for this synthesized runtime fix

Regression coverage

  • compaction + replay idempotency in Anthropic turn validation
  • Discord queue recovery after a failed long-running run
  • stale busy/inherited busy recovery in channel health policy + monitor
  • shared run-state machine unit tests

Verification

  • pnpm install --frozen-lockfile
  • pnpm build
  • pnpm check
  • pnpm test:macmini

Provenance

Kevin Shenghui and others added 11 commits March 3, 2026 20:24
Fixes #33621

When compaction trims conversation history, some tool_use blocks may lose
their corresponding tool_result blocks. This causes Anthropic to reject
the history with 'tool_use ids found without tool_result blocks' error.

This change adds stripDanglingAnthropicToolUses() which:
- Removes tool_use blocks from assistant messages when the following user
  message doesn't have a matching tool_result (by tool_use_id)
- Preserves non-tool content in assistant messages
- Inserts '[tool calls omitted]' fallback when all content would be removed
@openclaw-barnacle openclaw-barnacle Bot added channel: discord Channel integration: discord app: web-ui App: web-ui gateway Gateway runtime agents Agent runtime and tooling size: XL maintainer Maintainer-authored PR labels Mar 4, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 4, 2026

Greptile Summary

This PR stabilizes the runtime state machine for Discord message handler runs and Anthropic tool-call lifecycle by synthesizing fixes from #33630 and #33583. It adds a new shared RunStateMachine that tracks busy/activeRuns/heartbeat state and clears stale inherited snapshots on startup, wires per-channel run serialization via KeyedAsyncQueue to prevent concurrent handler races, strips dangling tool_use blocks after compaction/replay in validateAnthropicTurns, and extends the channel health policy with busy/stuck states so that long-running but legitimately active channels are not restarted while truly stale ones are.

Key changes:

  • run-state-machine.ts – new shared factory with onRunStart/onRunEnd, abort/deactivate guards, and a 60-second heartbeat; emits an immediate { activeRuns: 0, busy: false } reset on init to overwrite stale inherited snapshots
  • message-handler.ts – replaces fire-and-forget processDiscordMessage calls with KeyedAsyncQueue keyed on session/channel, so queued messages for the same channel serialize without blocking the event loop for other channels
  • turns.tsstripDanglingAnthropicToolUses removes tool_use blocks whose IDs have no corresponding tool_result in the immediately following user message, inserting a [tool calls omitted] fallback when the whole content array would otherwise become empty
  • channel-health-policy.ts – adds busy/stuck evaluation reasons; the busyStateInitializedForLifecycle guard (lastRunActivityAt >= lastStartAt) prevents stale busy flags inherited across a restart from suppressing stuck-channel recovery
  • protocol/schema/channels.ts + types.core.ts – extends ChannelAccountSnapshot with the three new run-state fields

Confidence Score: 4/5

  • Safe to merge; the PR implements a well-structured shared state machine with comprehensive test coverage and appropriate error handling across all changed paths.
  • The PR synthesizes two prior fixes into a cohesive solution with solid test coverage (run-state-machine unit tests, Discord queue tests, health-policy and health-monitor tests, Anthropic turn-validation tests). Key logic paths — queue-based run serialization, busyStateInitializedForLifecycle guard, and tool_use stripping — are all exercised. No functional issues identified. A score of 4 reflects confidence in the implementation with standard rigor for runtime state machinery.
  • No files require special attention.

Last reviewed commit: 3299bd5

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3299bd5274

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/agents/pi-embedded-helpers/turns.ts Outdated
Comment thread src/discord/monitor/message-handler.ts
@Takhoffman
Copy link
Copy Markdown
Contributor Author

Addressed the current review feedback in eccd84586.

Fixed items:

  • Guarded Anthropic assistant content normalization so non-array content values no longer throw during dangling toolUse cleanup.
  • Added regression coverage for malformed/legacy assistant content in validateAnthropicTurns.
  • Prevented queued Discord runs from executing after lifecycle deactivation/abort by gating queued task execution on runtime lifecycle activity.
  • Added regression coverage to verify queued follow-up runs are skipped after handler deactivation.

Validation rerun:

  • pnpm test -- src/agents/pi-embedded-helpers.validate-turns.test.ts src/discord/monitor/message-handler.queue.test.ts src/channels/run-state-machine.test.ts
  • pnpm check

@Takhoffman Takhoffman merged commit 9889c6d into main Mar 4, 2026
29 checks passed
@Takhoffman Takhoffman deleted the task/workstream-5-state-machine branch March 4, 2026 03:25
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: eccd84586e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +193 to +197
// First, strip dangling tool_use blocks from assistant messages
const stripped = stripDanglingAnthropicToolUses(messages);

return validateTurnsWithConsecutiveMerge({
messages,
messages: stripped,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Merge user turns before stripping assistant toolUse blocks

validateAnthropicTurns currently runs stripDanglingAnthropicToolUses before mergeConsecutiveUserTurns, so it only inspects the immediately following user message when deciding whether a toolUse is dangling. If the matching toolResult sits in a second consecutive user turn (a case the validator is meant to normalize), the tool call is removed anyway, and the later merged user turn still contains that toolResult without a matching tool call, which can produce Anthropic turn-validation failures and lose valid tool context.

Useful? React with 👍 / 👎.

Comment on lines +82 to +84
result.push({
...assistantMsg,
content: filteredContent,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve legacy assistant content when it is not an array

When assistantMsg.content is a legacy non-array value, originalContent is forced to [] and then written back as content: filteredContent, so the validator silently erases the original assistant content whenever the next message is a user turn. This is a regression from previous behavior (which left these messages intact), and it can both drop prompt context and emit empty assistant turns in replayed histories.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling app: web-ui App: web-ui channel: discord Channel integration: discord gateway Gateway runtime maintainer Maintainer-authored PR size: XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants