embedded agent tool-calling is unreliable across volcengine-plan/kimi failover after successful toolResult

## Summary

Tool-calling is unreliable in embedded agent runs across provider failover.

In my local setup on OpenClaw `2026.3.7`, I can reproduce all of the following within the same session:

1. `volcengine-plan/ark-code-latest` does produce real `toolCall` + `toolResult`
2. the same run later gets aborted / times out before a stable final assistant response
3. OpenClaw then attempts to continue via fallback / retry paths
4. `kimi-coding/k2p5` may either:
   - reply `TOOL_UNAVAILABLE` without attempting a tool call, or
   - answer from prior context instead of issuing a fresh tool call, or
   - hit provider rate limits / timeouts during continuation

So this does **not** look like the older "tools parameter is never sent" bug from #8923.
It looks more like a session / embedded-run / failover reliability problem after tool execution has already started or completed.

## Environment

- OpenClaw version: `2026.3.7`
- OS: macOS arm64
- Primary model during repro: `volcengine-plan/ark-code-latest`
- Fallback model during repro: `kimi-coding/k2p5`
- Provider API types involved:
  - `volcengine-plan`: `openai-completions`
  - `kimi-coding`: `anthropic-messages`

## Minimal Prompt Used

```text
Use exactly one available tool to inspect the current working directory. Do not simulate a tool call or reuse prior results. If tool invocation is unavailable, reply with TOOL_UNAVAILABLE.
```

## What I Observed

### 1. Volcengine first fails with connection error

Session file recorded:

```json
{"stopReason":"error","errorMessage":"Connection error."}
```

### 2. Volcengine then successfully emits a real tool call

Recorded in session JSONL:

- assistant emits `toolCall` with `name:"exec"`
- tool result is recorded immediately after

Excerpt:

```json
{"type":"toolCall","name":"exec","arguments":{"command":"pwd && ls -la"}}
```

followed by:

```json
{"role":"toolResult","toolName":"exec","isError":false}
```

### 3. Later tagged repro (`[RUN V1]`) also succeeds at toolCall + toolResult

Again, Volcengine emitted a real `exec` call and OpenClaw recorded the corresponding `toolResult`.

### 4. But the embedded run still ends as aborted / timed out

After the successful tool result, the same run was still marked aborted / timed out:

```json
{"customType":"openclaw:prompt-error","data":{"error":"aborted"}}
```

and gateway logs showed:

```text
[agent/embedded] embedded run timeout ... timeoutMs=45000
FailoverError: LLM request timed out.
```

### 5. Kimi fallback / continuation is inconsistent

In the same session history, after switching to / continuing with `kimi-coding/k2p5`, I observed:

- one response that returned:

```text
TOOL_UNAVAILABLE
```

with no fresh `toolCall` recorded for that turn

- another response that answered from prior context instead of clearly issuing a fresh tool call
- a provider-side 429 during continuation:

```json
{"errorMessage":"429 {\"error\":{\"type\":\"rate_limit_error\",...}}"}
```

## Why This Seems Distinct From Existing Issues

- Not #8923: tools clearly **are** reaching at least `volcengine-plan/ark-code-latest`, because real `toolCall` / `toolResult` entries exist
- Not exactly #37834 either: I am not seeing a permanent orphaned-tool 400 loop here
- Instead, this looks like a broader reliability problem in:
  - embedded run timeout handling
  - provider failover after partial tool execution
  - continuation logic after abort / timeout
  - model/session state becoming inconsistent enough that fallback models may respond with `TOOL_UNAVAILABLE` or stale/context-derived answers instead of fresh tool use

## Expected Behavior

If a model successfully emits a tool call and OpenClaw records a valid tool result, then one of the following should happen deterministically:

1. the same run completes cleanly with a final assistant response, or
2. the run fails in a way that preserves coherent session state for the next retry / fallback attempt

Fallback / continuation should not degrade into:

- aborted runs after successful tool execution
- stale-context answers instead of fresh tool calls
- `TOOL_UNAVAILABLE` from the fallback model when tools are in fact available in the session

## Actual Behavior

Successful tool execution can still be followed by:

- `aborted`
- `LLM request timed out`
- `FailoverError: LLM request timed out`
- fallback continuation that no longer behaves consistently with available tools

## Related Issues

- #8923
- #37834

## Suggested Areas To Inspect

- embedded run timeout behavior after a successful `toolResult`
- failover / continuation serialization between provider adapters (`openai-completions` -> `anthropic-messages`)
- whether tool availability / tool schema state is preserved correctly across aborted runs
- whether continuation prompts after timeout are causing models to infer from context instead of issuing tool calls

## Local Evidence

I can provide the exact session JSONL / timestamps if helpful, but the key repro facts are already visible locally:

- real `toolCall` + `toolResult` for `volcengine-plan/ark-code-latest`
- later `aborted` for the same run
- subsequent fallback / continuation instability with `kimi-coding/k2p5`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

embedded agent tool-calling is unreliable across volcengine-plan/kimi failover after successful toolResult #40742

Summary

Environment

Minimal Prompt Used

What I Observed

1. Volcengine first fails with connection error

2. Volcengine then successfully emits a real tool call

3. Later tagged repro (`[RUN V1]`) also succeeds at toolCall + toolResult

4. But the embedded run still ends as aborted / timed out

5. Kimi fallback / continuation is inconsistent

Why This Seems Distinct From Existing Issues

Expected Behavior

Actual Behavior

Related Issues

Suggested Areas To Inspect

Local Evidence

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

embedded agent tool-calling is unreliable across volcengine-plan/kimi failover after successful toolResult #40742

Description

Summary

Environment

Minimal Prompt Used

What I Observed

1. Volcengine first fails with connection error

2. Volcengine then successfully emits a real tool call

3. Later tagged repro ([RUN V1]) also succeeds at toolCall + toolResult

4. But the embedded run still ends as aborted / timed out

5. Kimi fallback / continuation is inconsistent

Why This Seems Distinct From Existing Issues

Expected Behavior

Actual Behavior

Related Issues

Suggested Areas To Inspect

Local Evidence

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

3. Later tagged repro (`[RUN V1]`) also succeeds at toolCall + toolResult