-
-
Notifications
You must be signed in to change notification settings - Fork 52.5k
Closed
Description
Bug Description
After Gateway restart (SIGUSR1), subagent tasks were delayed by ~44 minutes before execution. Investigation suggests that drainLane() may get stuck with draining=true in edge cases, causing the queue to be permanently blocked.
Environment
- OpenClaw version: 2026.2.25
- Platform: Ubuntu 24.04
- Node.js: v22+
Observed Behavior
Timeline:
- 14:50: 3 subagent tasks dispatched
- 14:53: Gateway restarted via SIGUSR1
- 14:53-15:34: Tasks queued but not executed
- 15:34: Tasks finally started executing
Configuration:
{
"agents": {
"defaults": {
"subagents": {
"maxConcurrent": 5,
"archiveAfterMinutes": 30
}
}
}
}Root Cause Analysis
Code Location
src/process/command-queue.ts - drainLane() function
Problematic Code
function drainLane(lane: string) {
const state = getLaneState(lane);
if (state.draining) {
return; // 🚨 If draining=true, permanently blocked
}
state.draining = true;
const pump = () => {
while (state.activeTaskIds.size < state.maxConcurrent && state.queue.length > 0) {
// ... task execution in async IIFE
}
state.draining = false; // 🚨 If pump() throws synchronously, this won't execute
};
pump(); // No try/finally
}Potential Bug Scenario
drainLane()is called afterresetAllLanes()during SIGUSR1 restart- During
pump()execution, an unexpected synchronous error occurs (edge case) state.drainingremainstrue- All subsequent
drainLane()calls return immediately becausedraining=true - Queue is permanently blocked until Gateway process restart
Suggested Fix
Add try/finally to ensure draining is always reset:
function drainLane(lane: string) {
const state = getLaneState(lane);
if (state.draining) {
return;
}
state.draining = true;
const pump = () => {
try {
while (state.activeTaskIds.size < state.maxConcurrent && state.queue.length > 0) {
// ... task execution
}
} finally {
state.draining = false; // ✅ Always reset, even on error
}
};
pump();
}Additional Recommendation
Add diagnostic logging when drainLane() is blocked:
function drainLane(lane: string) {
const state = getLaneState(lane);
if (state.draining) {
diag.warn(`drainLane blocked: lane=${lane} draining=true queue=${state.queue.length}`);
return;
}
// ...
}Impact
- Severity: Medium
- Affected: Subagent task reliability after Gateway restart
- Recovery: Requires Gateway process restart (not just SIGUSR1 in-process restart)
Workaround
Full Gateway process restart (not SIGUSR1) will reset the in-memory lane state.
Related Code
resetAllLanes()insrc/process/command-queue.ts- Called during SIGUSR1 restartapplyGatewayLaneConcurrency()insrc/gateway/server-lanes.ts- Sets lane concurrency
Note: I initially suspected the maxConcurrent config wasn't being applied, but after reviewing the source, I confirmed that applyGatewayLaneConcurrency() is correctly called at Gateway startup. The real issue appears to be the draining flag getting stuck.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels