fix(plugins): add timeout protection to before_agent_start hook#48741
fix(plugins): add timeout protection to before_agent_start hook#48741Jerry-Xin wants to merge 11 commits intoopenclaw:mainfrom
Conversation
Greptile SummaryThis PR adds per-handler timeout protection to all async plugin hook runners, preventing a single hanging plugin from permanently blocking agent runs. The approach — wrapping each call in a Key observations:
Confidence Score: 4/5
|
| async function callHandlerWithTimeout<T>( | ||
| fn: () => Promise<T>, | ||
| hookName: PluginHookName, | ||
| pluginId: string, | ||
| ): Promise<T> { | ||
| if (!hookTimeoutMs) { | ||
| return fn(); | ||
| } | ||
| return withTimeout(() => fn(), hookTimeoutMs, `${hookName} handler from ${pluginId}`); | ||
| } |
There was a problem hiding this comment.
Timed-out hook promise is leaked
When the timeout fires, withTimeout races the handler's promise against the abort promise and returns early — but the underlying hook-handler promise (fn()) continues executing in the background indefinitely. The AbortSignal that withTimeout exposes through its work callback is never forwarded to the actual hook handler because the wrapper is () => fn() (signal ignored).
For hooks doing I/O or significant async work (network calls, file I/O, DB queries) this means that work keeps running and consuming resources long after the agent has already moved on. Consider documenting this behavior, and if the plugin contract ever evolves to accept an AbortSignal, wiring it through here would allow cooperative cancellation:
async function callHandlerWithTimeout<T>(
fn: (signal?: AbortSignal) => Promise<T>,
hookName: PluginHookName,
pluginId: string,
): Promise<T> {
if (!hookTimeoutMs) {
return fn();
}
return withTimeout((signal) => fn(signal), hookTimeoutMs, `${hookName} handler from ${pluginId}`);
}Prompt To Fix With AI
This is a comment left during a code review.
Path: src/plugins/hooks.ts
Line: 179-188
Comment:
**Timed-out hook promise is leaked**
When the timeout fires, `withTimeout` races the handler's promise against the abort promise and returns early — but the underlying hook-handler promise (`fn()`) continues executing in the background indefinitely. The `AbortSignal` that `withTimeout` exposes through its `work` callback is never forwarded to the actual hook handler because the wrapper is `() => fn()` (signal ignored).
For hooks doing I/O or significant async work (network calls, file I/O, DB queries) this means that work keeps running and consuming resources long after the agent has already moved on. Consider documenting this behavior, and if the plugin contract ever evolves to accept an `AbortSignal`, wiring it through here would allow cooperative cancellation:
```ts
async function callHandlerWithTimeout<T>(
fn: (signal?: AbortSignal) => Promise<T>,
hookName: PluginHookName,
pluginId: string,
): Promise<T> {
if (!hookTimeoutMs) {
return fn();
}
return withTimeout((signal) => fn(signal), hookTimeoutMs, `${hookName} handler from ${pluginId}`);
}
```
How can I resolve this? If you propose a fix, please make it concise.|
Good point about the leaked promise. Added a documentation comment in commit 6de4b04 explaining the limitation — since the current plugin hook contract ( Changing the handler signature would be a breaking change to the plugin API, so keeping it as documented behavior for now seemed like the right scope for this PR. |
6de4b04 to
3d5b7a4
Compare
|
Rebased on latest upstream/main and resolved conflicts in |
3d5b7a4 to
8499e58
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8499e58e8b
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
src/plugins/hooks.ts
Outdated
| if (!hookTimeoutMs) { | ||
| return fn(); | ||
| } | ||
| return withTimeout(() => fn(), hookTimeoutMs, `${hookName} handler from ${pluginId}`); |
There was a problem hiding this comment.
Scope the timeout away from fail-open policy hooks
This helper now wraps every async hook in the same timeout, not just before_agent_start. That changes the semantics of gatekeeper hooks: runMessageSending() and runBeforeToolCall() both flow through runModifyingHook(), while their callers in src/infra/outbound/deliver.ts:424-465 and src/agents/pi-tools.before-tool-call.ts:148-190 allow the send/tool call whenever the hook returns nothing or throws. In any deployment where a plugin does outbound moderation, ownership checks, or tool authorization over the network, a >10s check will now silently fail open instead of enforcing the policy.
Useful? React with 👍 / 👎.
src/plugins/hooks.ts
Outdated
| /** Default timeout for async plugin hook handlers (ms). */ | ||
| const DEFAULT_HOOK_TIMEOUT_MS = 10_000; |
There was a problem hiding this comment.
Raise the default timeout above the memory recall budget
The new default is only 10s, but the bundled memory-lancedb plugin runs its before_agent_start hook by making a live embedding request and then searching LanceDB (extensions/memory-lancedb/index.ts:548-555). OpenClaw’s own memory stack already budgets 60s for remote embedding queries and 5 minutes for local ones (src/memory/manager-embedding-ops.ts:39-40), so this default will spuriously disable auto-recall on cold or slower providers. In those environments the regression is user-visible: the agent starts without recalled memories even though the underlying embedding request is still within the system’s normal timeout budget.
Useful? React with 👍 / 👎.
8499e58 to
b430cf6
Compare
|
Addressed both review points in commit b430cf6: P1 — Scope timeout away from gatekeeper hooks: Added a P2 — Raise default timeout: Bumped Added tests verifying that |
b430cf6 to
42dec4e
Compare
|
Rebased on latest upstream/main and resolved conflict in |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 42dec4e4d3
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| }; | ||
|
|
||
| /** Default timeout for async plugin hook handlers (ms). */ | ||
| const DEFAULT_HOOK_TIMEOUT_MS = 30_000; |
There was a problem hiding this comment.
Raise default hook timeout above remote memory recall budget
Fresh evidence in this revision: DEFAULT_HOOK_TIMEOUT_MS is now 30_000, but the bundled memory-lancedb plugin still does a live embedding request in extensions/memory-lancedb/index.ts:548-555, while core embedding queries allow up to 60s remotely in src/memory/manager-embedding-ops.ts:39-40. In deployments using slower remote embedders, a recall that is still within the supported memory budget will now time out here, so the agent starts without recalled memories even though the underlying request has not actually failed.
Useful? React with 👍 / 👎.
| await callHandlerWithTimeout( | ||
| () => (hook.handler as (event: unknown, ctx: unknown) => Promise<void>)(event, ctx), | ||
| hookName, | ||
| hook.pluginId, | ||
| ); |
There was a problem hiding this comment.
Exempt
agent_end from the blanket async-hook timeout
runVoidHook() now sends every async void hook through callHandlerWithTimeout(), so agent_end inherits the same 30s cap. That breaks the bundled memory-lancedb auto-capture path in extensions/memory-lancedb/index.ts:576-657: it can perform up to three sequential embeddings.embed() calls plus DB searches/stores, while the core embedding budget is 60s remote / 5m local in src/memory/manager-embedding-ops.ts:39-42. On slower providers this change will drop memory capture after a run even though each embedding request is still within the system’s normal timeout.
Useful? React with 👍 / 👎.
|
Addressed both P2 comments in commit 533235d: P2 — Raise default hook timeout / memory recall budget (comment #2970594620): Added P2 — Exempt Updated the comment above |
533235d to
b57da38
Compare
|
Rebased on latest upstream/main and resolved conflict in |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b57da3803d
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
src/plugins/hooks.ts
Outdated
| const TIMEOUT_EXEMPT_HOOKS: ReadonlySet<PluginHookName> = new Set([ | ||
| "message_sending", | ||
| "before_tool_call", | ||
| "before_agent_start", |
There was a problem hiding this comment.
Restore timeout protection for
before_agent_start
Fresh evidence in this revision: adding "before_agent_start" to TIMEOUT_EXEMPT_HOOKS means the original deadlock is still possible on the main agent-start path. Both src/agents/pi-embedded-runner/run.ts:348-353 and src/agents/pi-embedded-runner/run/attempt.ts:1436-1444 await runBeforeAgentStart(), so any plugin hook that never resolves here will still stall the run indefinitely with no timeout or recovery. That leaves the bug this change is meant to fix unresolved for the very hook named in the PR.
Useful? React with 👍 / 👎.
src/config/zod-schema.ts
Outdated
| plugins: z | ||
| .object({ | ||
| enabled: z.boolean().optional(), | ||
| hookTimeoutMs: z.number().int().min(0).optional(), |
There was a problem hiding this comment.
Regenerate the base config schema for
hookTimeoutMs
Adding plugins.hookTimeoutMs here is not enough for config-schema consumers, because buildConfigSchema() serves GENERATED_BASE_CONFIG_SCHEMA from src/config/schema.base.generated.ts, and that generated file still has no plugins.hookTimeoutMs entry. In practice the gateway config-schema API and any settings/editor UI built on it will not expose or round-trip this new option even though runtime validation now accepts it.
Useful? React with 👍 / 👎.
b57da38 to
26c775e
Compare
|
Addressed both review points in commit 26c775e: P1 — Restore timeout protection for memory hooks: Removed P2 — Regenerate config schema: Ran |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 26c775e61c
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
src/plugins/hooks.ts
Outdated
| if (!hookTimeoutMs || TIMEOUT_EXEMPT_HOOKS.has(hookName)) { | ||
| return fn(); | ||
| } | ||
| const effectiveTimeout = MEMORY_HOOK_NAMES.has(hookName) ? MEMORY_HOOK_TIMEOUT_MS : hookTimeoutMs; |
There was a problem hiding this comment.
Honor configured timeout for memory hooks
callHandlerWithTimeout() hard-codes before_agent_start and agent_end to 120000 ms, so plugins.hookTimeoutMs cannot raise the cap for those hooks. That breaks the new config knob for deployments using memory-lancedb: recall/capture in extensions/memory-lancedb/index.ts:548-657 can legitimately ride on embedding operations budgeted up to 5 minutes for local providers in src/memory/manager-embedding-ops.ts:39-42. In that environment, even an explicit plugins.hookTimeoutMs: 300000 will still cut memory recall/capture off after 2 minutes.
Useful? React with 👍 / 👎.
26c775e to
29d22f2
Compare
|
Fixed in commit 29d22f2 — memory hooks now use Added a test verifying that a 4-minute handler completes successfully when |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 29d22f2308
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| const effectiveTimeout = MEMORY_HOOK_NAMES.has(hookName) | ||
| ? Math.max(MEMORY_HOOK_TIMEOUT_MS, hookTimeoutMs) | ||
| : hookTimeoutMs; |
There was a problem hiding this comment.
Exempt Discord subagent binding hooks from the 30s timeout
subagent_spawning now flows through callHandlerWithTimeout(), so the generic 30s cap applies to channel thread-binding hooks too. The built-in Discord hook in extensions/discord/src/subagent-hooks.ts:67-84 awaits autoBindSpawnedDiscordSubagent(), which can make several REST calls in extensions/discord/src/monitor/thread-bindings.lifecycle.ts:173-223 and extensions/discord/src/monitor/thread-bindings.discord-api.ts:240-297 without any tighter timeout. In environments where those Discord API calls take longer than 30s, src/agents/subagent-spawn.ts:257-283 now reports thread=true as unavailable even though the bind may still finish later, so threaded subagent spawns regress on slow or unhealthy Discord links.
Useful? React with 👍 / 👎.
29d22f2 to
57ec97a
Compare
|
Good catch — added |
57ec97a to
ffeac52
Compare
|
Rebased on latest upstream/main and resolved conflict in |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cde5f14e77
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| "./plugin-sdk/ollama": { | ||
| "types": "./dist/plugin-sdk/ollama.d.ts", | ||
| "default": "./dist/plugin-sdk/ollama.js" | ||
| }, |
There was a problem hiding this comment.
Restore exported ollama-setup SDK subpath
Adding ./plugin-sdk/ollama here without keeping the previously exported ./plugin-sdk/ollama-setup subpath creates a runtime compatibility break: extensions built against openclaw/plugin-sdk/ollama-setup will now throw ERR_PACKAGE_PATH_NOT_EXPORTED when loaded. Since this commit is otherwise a hook-timeout bugfix, dropping an existing package export is an unintended breaking API regression; keep the old subpath as a compatibility alias while introducing the new one.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Verified — this is a false positive. The ./plugin-sdk/ollama-setup subpath does not exist in upstream/main and never has:
$ git show upstream/main:package.json | grep ollama-setup
(no matches)
Upstream only has ./plugin-sdk/ollama at line 483, and our branch's package.json is identical to upstream (zero diff). No subpath was dropped by this PR.
cde5f14 to
2a34e3b
Compare
2a34e3b to
4cb88ca
Compare
Add configurable timeout (default 30s) to plugin lifecycle hooks to prevent a single misbehaving plugin from permanently blocking all agent runs. - Add hookTimeoutMs config option under plugins section - Wrap hook execution in Promise.race with timeout - Log warning with plugin name when hook exceeds timeout - Agent continues gracefully after timeout instead of hanging Fixes openclaw#48534
Add a code comment explaining that timed-out handler promises continue running in the background since the current plugin hook API does not accept an AbortSignal for cooperative cancellation. Notes the path forward for when the plugin contract evolves.
…t for memory hooks Remove before_agent_start and agent_end from TIMEOUT_EXEMPT_HOOKS so hung handlers can no longer stall runs indefinitely. Instead, give them a dedicated MEMORY_HOOK_TIMEOUT_MS of 120 s — enough headroom for remote embed() calls while still protecting against genuinely stuck handlers. Regenerate base config schema and config baseline for hookTimeoutMs.
Use Math.max(MEMORY_HOOK_TIMEOUT_MS, hookTimeoutMs) for memory-critical hooks so that plugins.hookTimeoutMs can raise the cap above the 120s floor. Previously, memory hooks were hard-capped at 120s regardless of the configured value, preventing deployments with slow local embedding providers from increasing the budget via config. Fixes the case where hookTimeoutMs: 300000 (5 min) would still cut off memory recall/capture at 2 minutes.
Channel thread-binding hooks (e.g. Discord) make several REST calls during subagent_spawning that can legitimately exceed the generic 30s timeout. Add subagent_spawning to the elevated timeout set so it gets at least 120s (or the configured hookTimeoutMs, whichever is higher), preventing premature timeout failures for threaded subagent spawns on slow or congested Discord links.
Values above 2,147,483,647 overflow Node's setTimeout to ~1ms, silently inverting behavior. Add .max(2_147_483_647) to the Zod schema so invalid values are caught at config validation time.
hasExplicitPluginConfig did not consider plugins.hookTimeoutMs, so in
VITEST runs applyTestPluginDefaults would classify configs like
{ plugins: { hookTimeoutMs: 5000 } } as non-explicit and force
plugins.enabled = false. Recognize hookTimeoutMs as explicit plugin
configuration to prevent test/prod behavior drift.
4cb88ca to
36233b7
Compare
|
Rebased on latest upstream/main and resolved conflicts in |
Summary
Fixes #48534 — A plugin
before_agent_starthook that hangs permanently blocks all agent runs with zero diagnostic output.Root Cause
Plugin lifecycle hooks (
before_agent_start, etc.) were executed withPromise.allbut without any timeout protection. If a single plugin hook hangs indefinitely, it blocks all agent runs permanently with no diagnostic output to help identify the culprit.Changes
1.
src/plugins/hooks.tswithHookTimeout()wrapper that races hook execution against a configurable timeout2.
src/plugins/hook-runner-global.ts3.
src/plugins/loader.tshookTimeoutMsfrom plugin config and pass to hook runner4.
src/config/types.plugins.ts+zod-schema.ts+schema.help.ts+schema.labels.tshookTimeoutMsconfig option (default: 30000ms)5.
src/plugins/hooks.timeout.test.tsChange Type
Security Impact
Improves reliability — prevents a single misbehaving plugin from permanently blocking the agent runtime.
Human Verification