Skip to content

[Bug] Subagent tasks delayed 44 minutes after Gateway restart - drainLane may get stuck with draining=true #27407

@fryccerGit

Description

@fryccerGit

Bug Description

After Gateway restart (SIGUSR1), subagent tasks were delayed by ~44 minutes before execution. Investigation suggests that drainLane() may get stuck with draining=true in edge cases, causing the queue to be permanently blocked.

Environment

  • OpenClaw version: 2026.2.25
  • Platform: Ubuntu 24.04
  • Node.js: v22+

Observed Behavior

Timeline:

  • 14:50: 3 subagent tasks dispatched
  • 14:53: Gateway restarted via SIGUSR1
  • 14:53-15:34: Tasks queued but not executed
  • 15:34: Tasks finally started executing

Configuration:

{
  "agents": {
    "defaults": {
      "subagents": {
        "maxConcurrent": 5,
        "archiveAfterMinutes": 30
      }
    }
  }
}

Root Cause Analysis

Code Location

src/process/command-queue.ts - drainLane() function

Problematic Code

function drainLane(lane: string) {
  const state = getLaneState(lane);
  if (state.draining) {
    return;  // 🚨 If draining=true, permanently blocked
  }
  state.draining = true;

  const pump = () => {
    while (state.activeTaskIds.size < state.maxConcurrent && state.queue.length > 0) {
      // ... task execution in async IIFE
    }
    state.draining = false;  // 🚨 If pump() throws synchronously, this won't execute
  };

  pump();  // No try/finally
}

Potential Bug Scenario

  1. drainLane() is called after resetAllLanes() during SIGUSR1 restart
  2. During pump() execution, an unexpected synchronous error occurs (edge case)
  3. state.draining remains true
  4. All subsequent drainLane() calls return immediately because draining=true
  5. Queue is permanently blocked until Gateway process restart

Suggested Fix

Add try/finally to ensure draining is always reset:

function drainLane(lane: string) {
  const state = getLaneState(lane);
  if (state.draining) {
    return;
  }
  state.draining = true;

  const pump = () => {
    try {
      while (state.activeTaskIds.size < state.maxConcurrent && state.queue.length > 0) {
        // ... task execution
      }
    } finally {
      state.draining = false;  // ✅ Always reset, even on error
    }
  };

  pump();
}

Additional Recommendation

Add diagnostic logging when drainLane() is blocked:

function drainLane(lane: string) {
  const state = getLaneState(lane);
  if (state.draining) {
    diag.warn(`drainLane blocked: lane=${lane} draining=true queue=${state.queue.length}`);
    return;
  }
  // ...
}

Impact

  • Severity: Medium
  • Affected: Subagent task reliability after Gateway restart
  • Recovery: Requires Gateway process restart (not just SIGUSR1 in-process restart)

Workaround

Full Gateway process restart (not SIGUSR1) will reset the in-memory lane state.

Related Code

  • resetAllLanes() in src/process/command-queue.ts - Called during SIGUSR1 restart
  • applyGatewayLaneConcurrency() in src/gateway/server-lanes.ts - Sets lane concurrency

Note: I initially suspected the maxConcurrent config wasn't being applied, but after reviewing the source, I confirmed that applyGatewayLaneConcurrency() is correctly called at Gateway startup. The real issue appears to be the draining flag getting stuck.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions