Skip to content

GatewayDrainingError should auto-retry, not surface to user #55412

@assimetria-ai

Description

@assimetria-ai

Problem

When the gateway restarts (e.g., after config.patch), any in-flight agent run that triggers a new command during the drain window gets GatewayDrainingError. This falls through to the generic error handler in agent-runner.runtime and surfaces to the user as:

⚠️ Agent failed before reply: Gateway is draining for restart; new tasks are not accepted.
Logs: openclaw logs --follow

This is a transient error — the gateway comes back up seconds later. But the user sees an error and thinks something is broken.

Expected behavior

GatewayDrainingError should be treated like isTransientHttp errors — auto-retry after a short delay (e.g., wait for the restart to complete, then retry). The error should never surface to the user since it always resolves on its own.

Current behavior

In agent-runner.runtime, the error handling chain checks for billing, context overflow, role ordering, session corruption, and transient HTTP — but GatewayDrainingError is not checked and falls to the generic Agent failed before reply message.

Suggested fix

Add a check before the generic error handler:

if (message.includes('Gateway is draining') || error?.name === 'GatewayDrainingError') {
  // Wait for restart to complete (poll gateway health or fixed delay)
  await new Promise(r => setTimeout(r, 15000));
  continue; // retry the run
}

Environment

  • OpenClaw 2026.3.24
  • macOS, local gateway, config.patch triggered restart
  • Happens every time a restart occurs while agents are active

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions