Skip to content

Multi-agent orchestration is unstable: concurrent agents add/config overwrites, session-lock failures, and detached child work #43367

@waliddafif

Description

@waliddafif

Summary

I tried to orchestrate a small parallel coding batch from the OpenClaw CLI on 2026.3.8 and hit a cluster of failures that make multi-agent runs unreliable in practice:

  1. openclaw agents add appears unsafe when invoked concurrently: config gets overwritten repeatedly and only a subset of agents persist.
  2. openclaw agent concurrent runs hit session lock timeouts even with isolated agents/workspaces.
  3. After the lock failure path, the model failover chain eventually hits openai-codex OAuth refresh races (refresh_token_reused).
  4. Some agent runs appear to fail at the CLI layer but continue detached in the background, leaving session locks and child work (for example next build, npm install) running without a clean handle.

This makes it hard to use OpenClaw as an orchestrator for parallel coding tasks, even when each agent has its own workspace.

Version / environment

  • OpenClaw CLI: 2026.3.8 (3caab92)
  • Host: Linux
  • Gateway mode: local
  • Main model default: openai-codex/gpt-5.3-codex
  • Per-agent test model: anthropic/claude-sonnet-4-6

Reproduction

I created 4 isolated git worktrees for the same repo, then tried to create and run 4 isolated agents for parallel work.

1. Concurrent agent creation

Commands like:

openclaw agents add lane125 --non-interactive --workspace /path/to/worktree1 --json
openclaw agents add lane128 --non-interactive --workspace /path/to/worktree2 --json
openclaw agents add lane129 --non-interactive --workspace /path/to/worktree3 --json
openclaw agents add lane130 --non-interactive --workspace /path/to/worktree4 --json

Observed behavior:

  • repeated output like:
Config overwrite: /home/user/.openclaw/openclaw.json (... -> ..., backup=...)
  • after the parallel adds completed, openclaw agents list --json showed only a subset of the agents.
  • creating the same agents sequentially worked much more reliably.

2. Concurrent agent runs

After recreating the agents sequentially, I launched multiple runs in parallel, for example:

openclaw agent --agent lane125 --message "..." --json --timeout 2400
openclaw agent --agent lane128 --message "..." --json --timeout 2400
openclaw agent --agent lane129 --message "..." --json --timeout 2400
openclaw agent --agent lane130 --message "..." --json --timeout 2400

I also tested --local.

Observed failures included:

Gateway agent failed; falling back to embedded: Error: Error: All models failed (3): anthropic/claude-sonnet-4-6: session file locked (timeout 10000ms): pid=57884 /home/user/.openclaw/agents/lane129/sessions/2eb87792-30e5-4284-9d30-50f4b384f884.jsonl.lock (timeout) | google-gemini-cli/gemini-3-pro-preview: session file locked (timeout 10000ms): pid=57884 /home/user/.openclaw/agents/lane129/sessions/2eb87792-30e5-4284-9d30-50f4b384f884.jsonl.lock (timeout) | openai-codex/gpt-5.3-codex: OAuth token refresh failed for openai-codex: Failed to refresh OAuth token for openai-codex. Please try again or re-authenticate. (auth)

and:

[openai-codex] Token refresh failed: 401 {
  "error": {
    "message": "Your refresh token has already been used to generate a new access token. Please try signing in again.",
    "type": "invalid_request_error",
    "param": null,
    "code": "refresh_token_reused"
  }
}

The lock files were agent-specific, for example:

  • /home/user/.openclaw/agents/lane129/sessions/...jsonl.lock
  • /home/user/.openclaw/agents/lane130/sessions/...jsonl.lock

Unexpected detached work

In at least one case, the CLI path reported failure or became unusable, but the underlying agent had clearly kept working in the background:

  • session lock file remained active
  • openclaw-agent processes were still running
  • child processes in the isolated worktrees were still running, for example:
    • next build --turbopack
    • npm install

So from the operator perspective:

  • the command looked failed or unrecoverable
  • but the agent was still mutating the worktree in the background
  • cleanup had to be manual (kill, worktree cleanup, agent deletion)

Expected behavior

  • Concurrent agents add should not race on the global config file.
  • Isolated agents/workspaces should be able to run in parallel without tripping per-agent session locks for normal orchestration scenarios.
  • If a run truly fails, the CLI should retain control and ensure child work is canceled or clearly surfaced.
  • Model failover should not end up surfacing unrelated provider auth races when the primary failure is local session locking.

Actual behavior

  • concurrent config writes race
  • parallel agent runs hit lock timeouts
  • failover path cascades into OAuth refresh race noise
  • some failed runs continue detached in the background

Related issues

I found partial overlap with existing issues, especially around session locks and OAuth refresh races, for example:

  • #42160 Session store monolithic JSON with global lock causes ...
  • #32799 Session file lock not released when holding process dies ...
  • #26322 OAuth token refresh race condition causes spurious failover ...

But the scenario here is specifically the end-to-end multi-agent orchestration path: create multiple isolated agents + launch multiple coding runs + observe config races, session locks, auth noise, and detached child work.

What would help

A fix or guardrail in any of these areas would help a lot:

  • serialize or file-lock agents add config writes
  • make concurrent isolated-agent runs not compete on the wrong session locks
  • avoid falling through to unrelated providers after a local lock error
  • add a clear kill/cancel path for detached agent work when the CLI command fails
  • improve operator visibility when a run is still active in the background

If helpful, I can also provide the exact command transcript / cleanup steps I used.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High-priority user-facing bug, regression, or broken workflow.clawsweeper:fix-shape-clearClawSweeper found a clear likely implementation shape for this issue.clawsweeper:linked-pr-openClawSweeper found an open linked pull request for this issue.clawsweeper:needs-live-reproClawSweeper needs live local, crabbox, or manual validation to confirm this issue.clawsweeper:needs-maintainer-reviewClawSweeper marked this issue as needing maintainer review before automation.clawsweeper:needs-product-decisionClawSweeper marked this issue as needing a product or behavior decision.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.impact:auth-providerAuth, provider routing, model choice, or SecretRef resolution may break.impact:message-lossChannel message delivery can be lost, duplicated, or misrouted.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🐚 platinum hermitGood issue quality with a plausible reproduction path needing some confirmation.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions