Skip to content

[Bug]: Plugin before_agent_start hook hang permanently blocks all agent runs with zero diagnostic output #48534

@VideoScape

Description

@VideoScape

Bug type

Regression (worked before, now fails)

Summary

If a plugin's before_agent_start hook awaits a network call that never resolves, the entire agent run pipeline blocks permanently with no timeout, no error logged, and no recovery path short of restarting the gateway.

Steps to reproduce

  1. Install memory-openviking plugin with autoRecall: true and mode: "local"
  2. Start the OpenClaw gateway — plugin loads and reports healthy
  3. OpenViking starts successfully (port 1933 up, health check passing)
  4. Send any message via Control UI or openclaw gateway call chat.send
  5. Observe status: started is returned
  6. Wait — no response ever arrives

The hang is reliably triggered when the plugin's Node.js client connection to the OpenViking Python subprocess is in a transient state (e.g. immediately after gateway start while OpenViking is still initialising, or after a gateway restart while the subprocess is warming up).

Expected behavior

The gateway should enforce a maximum execution time on all before_agent_start hook handlers. If a handler exceeds that limit, the run should proceed without the hook result and a warning should be logged. The user should receive a response regardless of plugin behaviour.

Actual behavior

The before_agent_start hook never completes. The gateway log stops at:
[hooks] running before_agent_start (1 handlers, sequential)
No further output at any log level below debug. No transcript is written, no Anthropic API call is made, no response is sent. The run stays permanently in started state. The gateway process remains alive and healthy — openclaw status, openclaw doctor, and all other diagnostics show nothing wrong. The only recovery is restarting the gateway service.
--expect-final on chat.send does not help — it returns started and exits with no indication of failure.

OpenClaw version

2026.3.13 (61d171a)

Operating system

Linux — Ubuntu 24.04, x64, systemd user service

Install method

npm install -g openclaw via npm global

Model

anthropic/claude-sonnet-4-6

Provider / routing chain

Anthropic API direct (auth profile: anthropic:manual, mode: token)

Config file / key location

~/.openclaw/openclaw.json — plugins.slots.memory = "memory-openviking", plugins.entries.memory-openviking.config.autoRecall = true

Additional provider/model setup details

No custom routing. Standard Anthropic provider with a personal API token stored in auth-profiles.json. The hang occurs before any model call is attempted — the issue is entirely in the plugin hook layer, upstream of the LLM request.

Logs, screenshots, and evidence

At --log-level debug, processing stops permanently at:
[hooks] running before_agent_start (1 handlers, sequential)
chat.send response:
json{ "runId": "test-...", "status": "started" }
Session entry in sessions.json is updated (new sessionId created) but no corresponding .jsonl transcript file is ever written to disk.
openclaw doctor output: all green, no errors. openclaw status: gateway running, sessions active, memory plugin enabled.
Root cause confirmed via --log-level debug on gateway service: execution enters runBeforeAgentStart and never exits. The memory-openviking plugin's before_agent_start handler calls getClient().then(client => client.find(...)) with no timeout wrapper, despite autoRecallTimeoutMs = 5_000 being defined in the same file.

Impact and severity

Affected users: Anyone using memory-openviking (or any plugin with a network-dependent before_agent_start hook) in local mode
Severity: Critical — total agent failure, not degraded performance
Frequency: Reliably triggered on gateway restart while OpenViking subprocess is warming up; also occurs on any transient connection issue between the Node client and Python subprocess
Consequence: All user messages silently dropped, no response ever sent, gateway restart required to recover. Practically undiagnosable without debug logging — every standard diagnostic tool reports healthy

Additional information

The immediate cause is in the memory-openviking plugin (volcengine/OpenViking#673) which defines autoRecallTimeoutMs and imports withTimeout but never uses them in the recall path. A fix has been submitted upstream.
However, the OpenClaw gateway should not allow any plugin hook to permanently block the run pipeline regardless of plugin behaviour. Related silent failure issues: #12595 (tool errors cause no response), #8288 (agent hangs on failed tool calls), #14797 (production reliability gaps). This is the same class of problem one layer earlier — in the plugin hook runner rather than the agent/tool layer.
Suggested fix: Add a configurable hook execution timeout (e.g. plugins.hookTimeoutMs, default 10000ms) enforced by the hook runner in src/plugins/hooks.ts. If runBeforeAgentStart exceeds the timeout, log a warning and proceed with the run. This prevents any single plugin from silently killing agent processing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingregressionBehavior that previously worked and now fails

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions