Skip to content

[Bug]: Gateway/provider race: stale heartbeat reconnect callback throws after disconnect #63387

@Megaplex1

Description

@Megaplex1

Summary

I hit an internal invariant failure in the gateway/provider websocket heartbeat logic:

Attempted to reconnect zombie connection after disconnecting first (this shouldn't be possible)

This appears to be a race where a stale heartbeat reconnect callback fires after the websocket/session has already been closed or disconnected.

Error

Attempted to reconnect zombie connection after disconnecting first (this shouldn't be possible)

What I found

The installed code throws from the provider heartbeat/reconnect path in the built distribution:

  • file: dist/provider-DEWH9yd9.js

Relevant logic is effectively:

startHeartbeat(this, {
  interval,
  reconnectCallback: () => {
    if (closed) throw new Error("Attempted to reconnect zombie connection after disconnecting first (this shouldn't be possible)");
    closed = true;
    this.handleZombieConnection();
  }
});

And heartbeat scheduling looks like:

function startHeartbeat(manager, options) {
  stopHeartbeat(manager);
  const sendHeartbeat = () => {
    if (!manager.lastHeartbeatAck) {
      options.reconnectCallback();
      return;
    }
    manager.lastHeartbeatAck = false;
    manager.send({ op: GatewayOpcodes.Heartbeat, d: manager.sequence });
  };
  manager.firstHeartbeatTimeout = setTimeout(() => {
    sendHeartbeat();
    manager.heartbeatInterval = setInterval(sendHeartbeat, interval);
  }, initialDelay);
}

disconnect() does call stopHeartbeat(this), but it looks like a stale timer / overlapping close-reconnect state can still let reconnectCallback() run on an already-closed connection object.

Expected behavior

A stale heartbeat callback should exit quietly or no-op once the connection/session has already been closed, not throw an exception.

Actual behavior

An exception is thrown from reconnect logic for a connection that was already considered closed/disconnected.

Suspected cause

Race between:

  • websocket/session disconnect/cleanup
  • pending heartbeat timeout or interval callback
  • reconnect callback closure retaining stale closed state

Observed impact

  • noisy internal exception
  • possible gateway/provider instability after the event
  • openclaw status did not return cleanly around the same time, which may indicate daemon state disruption

Trigger conditions

Not 100% certain, but likely one of:

  • transient network flap
  • websocket close during heartbeat timing window
  • rapid gateway restart/reload while old provider session is unwinding
  • delayed/missed heartbeat ack followed by overlapping reconnect and close

Suggested fix

Defensively guard reconnect callback / heartbeat send path so stale callbacks do not throw after disconnect. For example:

  • no-op if websocket/session is no longer current
  • no-op if manager is already disconnected
  • bind heartbeat callbacks to a connection generation/token and ignore stale generations
  • avoid throwing on closed === true; log/debug and return instead

Environment

  • OpenClaw installed via npm on Windows
  • observed in built dist file: dist/provider-DEWH9yd9.js

If useful, I can provide a fuller stack trace/log context.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions