feat(delegate): diagnostic dump when a subagent times out with 0 API calls by teknium1 · Pull Request #15105 · NousResearch/hermes-agent

teknium1 · 2026-04-24T11:57:04Z

When a subagent in delegate_task times out without having made any API call, delegate_task now writes a structured diagnostic to ~/.hermes/logs/subagent-timeout-<sid>-<ts>.log and surfaces the path in the error message.

Why

Issue #14726 reports a specific hang pattern: with toolsets=["web", <secondary>] + long context + max_iterations >= ~20, the subagent times out at the configured limit (300s by default for some users) with zero API calls and no way to see what the child was doing. The old error string gave users nothing actionable.

Root-causing "why 0 API calls happened for this specific user against their specific provider" requires data we don't have. But we can give users (and ourselves) the data they need to self-diagnose — and surface it automatically only when the shape of the failure is opaque (no API call was ever made).

What the diagnostic captures

timeout config vs actual duration
goal (truncated)
child config: model, provider, api_mode, base_url, max_iterations, platform, delegate_role, delegate_depth
enabled_toolsets + loaded tool names
system prompt byte/char count (catches oversized prompts providers silently choke on)
tool schema count + byte size (catches schema-size issues)
get_activity_summary() snapshot
Python stack of the worker thread at timeout — reveals whether the hang is in credential resolution, transport, prompt construction, etc.

Scope

Only fires when both is_timeout is true and child_api_calls == 0.
Subagents that made at least one API call before hanging keep the old "stuck on slow API call" message (correct diagnosis for that case).
Non-timeout errors unchanged.
Successful subagent runs unchanged.

Validation

tests/tools/test_delegate_subagent_timeout_diagnostic.py — 7 new cases
tests/tools/ full delegate suite: 124/124 pass (117 existing + 7 new)
E2E reproduction: simulated a subagent parked inside threading.Event.wait; diagnostic correctly dumps the parked stack trace all the way down to waiter.acquire.

Example output

# Subagent timeout diagnostic — issue #14726
# Generated: 2026-04-24T04:54:53

## Timeout
  task_index:        0
  subagent_id:       sa-0-testabc
  configured_timeout: 300.0s
  actual_duration:   300.01s

## Goal
Research Honcho architecture — long context blah blah blah

## Child config
  model: 'moonshotai/kimi-k2.6'
  provider: 'nous'
  base_url: 'https://inference-api.nousresearch.com/v1'
  max_iterations: 30
  ...

## Prompt / schema sizes
  system_prompt_bytes: 15020
  tool_schema_count: 20
  tool_schema_bytes: 4770

## Worker thread stack at timeout
    File "tools/delegate_tool.py", line ..., in _run_with_thread_capture
    File "run_agent.py", line ..., in run_conversation
    File ".../httpx/_client.py", line ..., in send
    ...

Refs #14726. Does not attempt to root-cause the underlying hang — just makes it debuggable.

…calls When a subagent in delegate_task times out before making its first LLM request, write a structured diagnostic file under ~/.hermes/logs/subagent-timeout-<sid>-<ts>.log capturing enough state for the user (and us) to debug the hang. The old error message — 'Subagent timed out after Ns with no response. The child may be stuck on a slow API call or unresponsive network request.' — gave no observability for the 0-API-call case, which is the hardest to reason about remotely. The diagnostic captures: - timeout config vs actual duration - goal (truncated to 1000 chars) - child config: model, provider, api_mode, base_url, max_iterations, quiet_mode, platform, _delegate_role, _delegate_depth - enabled_toolsets + loaded tool names - system prompt byte/char count (catches oversized prompts that providers silently choke on) - tool schema count + byte size - child's get_activity_summary() snapshot - Python stack of the worker thread at the moment of timeout (reveals whether the hang is in credential resolution, transport, prompt construction, etc.) Wiring: - _run_single_child captures the worker thread via a small wrapper around child.run_conversation so we can look up its stack at timeout. - After a FuturesTimeoutError, we pull child.get_activity_summary() to read api_call_count. If 0 AND it was a timeout (not a raise), _dump_subagent_timeout_diagnostic() is invoked. - The returned path is surfaced in the error string so the parent agent (and therefore the user / gateway) sees exactly where to look. - api_calls > 0 timeouts keep the old 'stuck on slow API call' phrasing since that's the correct diagnosis for those. This does NOT change any behavior for successful subagent runs, non-timeout errors, or subagents that made at least one API call before hanging. Tests: 7 cases (tests/tools/test_delegate_subagent_timeout_diagnostic.py) - output format + required sections + field values - long-goal truncation with [truncated] marker - missing / already-exited worker thread branches - unwritable HERMES_HOME/logs/ returns None without raising - _run_single_child wiring: 0 API calls → dump + diagnostic_path in error - _run_single_child wiring: N>0 API calls → no dump, old message Refs: #14726

…calls (NousResearch#15105) When a subagent in delegate_task times out before making its first LLM request, write a structured diagnostic file under ~/.hermes/logs/subagent-timeout-<sid>-<ts>.log capturing enough state for the user (and us) to debug the hang. The old error message — 'Subagent timed out after Ns with no response. The child may be stuck on a slow API call or unresponsive network request.' — gave no observability for the 0-API-call case, which is the hardest to reason about remotely. The diagnostic captures: - timeout config vs actual duration - goal (truncated to 1000 chars) - child config: model, provider, api_mode, base_url, max_iterations, quiet_mode, platform, _delegate_role, _delegate_depth - enabled_toolsets + loaded tool names - system prompt byte/char count (catches oversized prompts that providers silently choke on) - tool schema count + byte size - child's get_activity_summary() snapshot - Python stack of the worker thread at the moment of timeout (reveals whether the hang is in credential resolution, transport, prompt construction, etc.) Wiring: - _run_single_child captures the worker thread via a small wrapper around child.run_conversation so we can look up its stack at timeout. - After a FuturesTimeoutError, we pull child.get_activity_summary() to read api_call_count. If 0 AND it was a timeout (not a raise), _dump_subagent_timeout_diagnostic() is invoked. - The returned path is surfaced in the error string so the parent agent (and therefore the user / gateway) sees exactly where to look. - api_calls > 0 timeouts keep the old 'stuck on slow API call' phrasing since that's the correct diagnosis for those. This does NOT change any behavior for successful subagent runs, non-timeout errors, or subagents that made at least one API call before hanging. Tests: 7 cases (tests/tools/test_delegate_subagent_timeout_diagnostic.py) - output format + required sections + field values - long-goal truncation with [truncated] marker - missing / already-exited worker thread branches - unwritable HERMES_HOME/logs/ returns None without raising - _run_single_child wiring: 0 API calls → dump + diagnostic_path in error - _run_single_child wiring: N>0 API calls → no dump, old message Refs: NousResearch#14726

When a subagent times out after making N>0 API calls, the lead agent previously received no visibility into what tool was last running or whether its result was partial. This made it impossible to distinguish "tool ran to completion, next LLM request stalled" from "tool itself is hanging". Two prior commits addressed adjacent cases: - NousResearch#1175 (commit 7997569): tool_trace + tokens added to normal-completion results - NousResearch#15105 (commit 7634c13): diagnostic_path log written for 0-API-call timeouts This patch fills the gap for N-API-call timeouts by: 1. Building tool_trace from result["messages"] in the timeout path (mirrors the logic already present in the normal-completion path at ~line 1556). Returns tool_trace, last_tool, and last_tool_status in the timeout result dict so the lead can inspect the final tool outcome without a second round-trip. 2. Enriching the error message for api_calls>0 timeouts to include the value of current_tool from get_activity_summary(), replacing the generic "stuck on a slow API call" message with e.g.: "Subagent timed out after 300s with 3 API call(s) completed — last tool was 'web_fetch' (likely slow response). The tool may have completed; check tool_trace for result_bytes." 3. Adding a SKILL.md that documents the full diagnosis procedure for lead agents: diagnosis matrix (api_calls × last_tool × last_tool_status), step-by-step workflow, common pitfalls, and verification checklist. Also adds: skill/skills/software-development/subagent-timeout-diagnostics/

…calls (NousResearch#15105) When a subagent in delegate_task times out before making its first LLM request, write a structured diagnostic file under ~/.hermes/logs/subagent-timeout-<sid>-<ts>.log capturing enough state for the user (and us) to debug the hang. The old error message — 'Subagent timed out after Ns with no response. The child may be stuck on a slow API call or unresponsive network request.' — gave no observability for the 0-API-call case, which is the hardest to reason about remotely. The diagnostic captures: - timeout config vs actual duration - goal (truncated to 1000 chars) - child config: model, provider, api_mode, base_url, max_iterations, quiet_mode, platform, _delegate_role, _delegate_depth - enabled_toolsets + loaded tool names - system prompt byte/char count (catches oversized prompts that providers silently choke on) - tool schema count + byte size - child's get_activity_summary() snapshot - Python stack of the worker thread at the moment of timeout (reveals whether the hang is in credential resolution, transport, prompt construction, etc.) Wiring: - _run_single_child captures the worker thread via a small wrapper around child.run_conversation so we can look up its stack at timeout. - After a FuturesTimeoutError, we pull child.get_activity_summary() to read api_call_count. If 0 AND it was a timeout (not a raise), _dump_subagent_timeout_diagnostic() is invoked. - The returned path is surfaced in the error string so the parent agent (and therefore the user / gateway) sees exactly where to look. - api_calls > 0 timeouts keep the old 'stuck on slow API call' phrasing since that's the correct diagnosis for those. This does NOT change any behavior for successful subagent runs, non-timeout errors, or subagents that made at least one API call before hanging. Tests: 7 cases (tests/tools/test_delegate_subagent_timeout_diagnostic.py) - output format + required sections + field values - long-goal truncation with [truncated] marker - missing / already-exited worker thread branches - unwritable HERMES_HOME/logs/ returns None without raising - _run_single_child wiring: 0 API calls → dump + diagnostic_path in error - _run_single_child wiring: N>0 API calls → no dump, old message Refs: NousResearch#14726

teknium1 merged commit 7634c13 into main Apr 24, 2026
11 of 12 checks passed

teknium1 deleted the feat/subagent-timeout-diagnostic-14726 branch April 24, 2026 11:58

teknium1 mentioned this pull request Apr 24, 2026

delegate_task hangs with 0 API calls when web + secondary toolset + long context + max_iterations >= ~20 #14726

Closed

alt-glitch added type/feature New feature or request P2 Medium — degraded but workaround exists tool/delegate Subagent delegation labels Apr 24, 2026

This was referenced Apr 29, 2026

[Bug]: N-API-call subagent timeout lacks tool_trace diagnostics — cannot identify last stuck tool #17308

Open

feat(delegate): extend timeout diagnostics for N-API-call subagents #17312

Open

Sanjays2402 mentioned this pull request Apr 29, 2026

fix(delegate): surface tool_trace on N-API-call subagent timeouts (#17308) #17329

Open

github-actions Bot mentioned this pull request May 1, 2026

chore: bump NousResearch/hermes-agent version from v2026.4.23 to v2026.4.30 Docker-Hub-sirmark/docker-hermes-agent#4

Merged

M3NT8L-One mentioned this pull request Jun 2, 2026

fix(delegate): improve subagent timeout diagnostics #37724

Open

teknium1 mentioned this pull request Jun 10, 2026

fix: bound delegate task wall-clock runtime #8537

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(delegate): diagnostic dump when a subagent times out with 0 API calls#15105

feat(delegate): diagnostic dump when a subagent times out with 0 API calls#15105
teknium1 merged 1 commit into
mainfrom
feat/subagent-timeout-diagnostic-14726

teknium1 commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

teknium1 commented Apr 24, 2026

Why

What the diagnostic captures

Scope

Validation

Example output

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants