Multi-agent orchestration is unstable: concurrent agents add/config overwrites, session-lock failures, and detached child work

### Summary

I tried to orchestrate a small parallel coding batch from the OpenClaw CLI on `2026.3.8` and hit a cluster of failures that make multi-agent runs unreliable in practice:

1. `openclaw agents add` appears unsafe when invoked concurrently: config gets overwritten repeatedly and only a subset of agents persist.
2. `openclaw agent` concurrent runs hit session lock timeouts even with isolated agents/workspaces.
3. After the lock failure path, the model failover chain eventually hits `openai-codex` OAuth refresh races (`refresh_token_reused`).
4. Some agent runs appear to fail at the CLI layer but continue detached in the background, leaving session locks and child work (for example `next build`, `npm install`) running without a clean handle.

This makes it hard to use OpenClaw as an orchestrator for parallel coding tasks, even when each agent has its own workspace.

### Version / environment

- OpenClaw CLI: `2026.3.8 (3caab92)`
- Host: Linux
- Gateway mode: local
- Main model default: `openai-codex/gpt-5.3-codex`
- Per-agent test model: `anthropic/claude-sonnet-4-6`

### Reproduction

I created 4 isolated git worktrees for the same repo, then tried to create and run 4 isolated agents for parallel work.

#### 1. Concurrent agent creation

Commands like:

```bash
openclaw agents add lane125 --non-interactive --workspace /path/to/worktree1 --json
openclaw agents add lane128 --non-interactive --workspace /path/to/worktree2 --json
openclaw agents add lane129 --non-interactive --workspace /path/to/worktree3 --json
openclaw agents add lane130 --non-interactive --workspace /path/to/worktree4 --json
```

Observed behavior:

- repeated output like:

```text
Config overwrite: /home/user/.openclaw/openclaw.json (... -> ..., backup=...)
```

- after the parallel adds completed, `openclaw agents list --json` showed only a subset of the agents.
- creating the same agents sequentially worked much more reliably.

#### 2. Concurrent agent runs

After recreating the agents sequentially, I launched multiple runs in parallel, for example:

```bash
openclaw agent --agent lane125 --message "..." --json --timeout 2400
openclaw agent --agent lane128 --message "..." --json --timeout 2400
openclaw agent --agent lane129 --message "..." --json --timeout 2400
openclaw agent --agent lane130 --message "..." --json --timeout 2400
```

I also tested `--local`.

Observed failures included:

```text
Gateway agent failed; falling back to embedded: Error: Error: All models failed (3): anthropic/claude-sonnet-4-6: session file locked (timeout 10000ms): pid=57884 /home/user/.openclaw/agents/lane129/sessions/2eb87792-30e5-4284-9d30-50f4b384f884.jsonl.lock (timeout) | google-gemini-cli/gemini-3-pro-preview: session file locked (timeout 10000ms): pid=57884 /home/user/.openclaw/agents/lane129/sessions/2eb87792-30e5-4284-9d30-50f4b384f884.jsonl.lock (timeout) | openai-codex/gpt-5.3-codex: OAuth token refresh failed for openai-codex: Failed to refresh OAuth token for openai-codex. Please try again or re-authenticate. (auth)
```

and:

```text
[openai-codex] Token refresh failed: 401 {
  "error": {
    "message": "Your refresh token has already been used to generate a new access token. Please try signing in again.",
    "type": "invalid_request_error",
    "param": null,
    "code": "refresh_token_reused"
  }
}
```

The lock files were agent-specific, for example:

- `/home/user/.openclaw/agents/lane129/sessions/...jsonl.lock`
- `/home/user/.openclaw/agents/lane130/sessions/...jsonl.lock`

### Unexpected detached work

In at least one case, the CLI path reported failure or became unusable, but the underlying agent had clearly kept working in the background:

- session lock file remained active
- `openclaw-agent` processes were still running
- child processes in the isolated worktrees were still running, for example:
  - `next build --turbopack`
  - `npm install`

So from the operator perspective:

- the command looked failed or unrecoverable
- but the agent was still mutating the worktree in the background
- cleanup had to be manual (`kill`, worktree cleanup, agent deletion)

### Expected behavior

- Concurrent `agents add` should not race on the global config file.
- Isolated agents/workspaces should be able to run in parallel without tripping per-agent session locks for normal orchestration scenarios.
- If a run truly fails, the CLI should retain control and ensure child work is canceled or clearly surfaced.
- Model failover should not end up surfacing unrelated provider auth races when the primary failure is local session locking.

### Actual behavior

- concurrent config writes race
- parallel agent runs hit lock timeouts
- failover path cascades into OAuth refresh race noise
- some failed runs continue detached in the background

### Related issues

I found partial overlap with existing issues, especially around session locks and OAuth refresh races, for example:

- `#42160` Session store monolithic JSON with global lock causes ...
- `#32799` Session file lock not released when holding process dies ...
- `#26322` OAuth token refresh race condition causes spurious failover ...

But the scenario here is specifically the end-to-end multi-agent orchestration path: create multiple isolated agents + launch multiple coding runs + observe config races, session locks, auth noise, and detached child work.

### What would help

A fix or guardrail in any of these areas would help a lot:

- serialize or file-lock `agents add` config writes
- make concurrent isolated-agent runs not compete on the wrong session locks
- avoid falling through to unrelated providers after a local lock error
- add a clear kill/cancel path for detached agent work when the CLI command fails
- improve operator visibility when a run is still active in the background

If helpful, I can also provide the exact command transcript / cleanup steps I used.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multi-agent orchestration is unstable: concurrent agents add/config overwrites, session-lock failures, and detached child work #43367

Summary

Version / environment

Reproduction

1. Concurrent agent creation

2. Concurrent agent runs

Unexpected detached work

Expected behavior

Actual behavior

Related issues

What would help

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Multi-agent orchestration is unstable: concurrent agents add/config overwrites, session-lock failures, and detached child work #43367

Description

Summary

Version / environment

Reproduction

1. Concurrent agent creation

2. Concurrent agent runs

Unexpected detached work

Expected behavior

Actual behavior

Related issues

What would help

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions