Skip to content

JSON file reads should retry on transient race conditions #83657

@bottenbenny

Description

@bottenbenny

Bug Description

JsonFileReadError: File changed during read occurs when paired.json is rewritten while another component is reading it. This is a transient race condition that should be handled gracefully rather than causing gateway failures.

Steps to Reproduce

  1. Have multiple gateway clients active (Discord, Control UI, CLI)
  2. Trigger a paired.json rewrite (e.g., open openclaw logs --follow)
  3. Another component reads paired.json during the temp-file + rename operation
  4. Observe error: JsonFileReadError: Failed to read JSON file: /home/simon/.openclaw/devices/paired.json
  5. Followed by: gateway closed (1000)

Expected Behavior

JSON file reads should be resilient to concurrent rewrites:

  • Retry on File changed during read (max 3 attempts, 50ms backoff)
  • Distinguish between "file changed" (retry) and "file corrupted" (fail)
  • Do not surface transient read races as gateway failures

Actual Behavior

  • Single read attempt fails immediately
  • Error propagates as gateway failure
  • Triggers websocket reconnect
  • Reconnect may trigger another paired.json rewrite → amplification loop

Environment

  • OpenClaw version: 2026.5.12
  • File write pattern: temp-file + fsync + rename (atomic)
  • Read pattern: single attempt, no retry

Related Issues

Recommendation

Implement bounded retry logic for JSON file reads:

// Pseudocode
async function readJsonFile(path, maxRetries = 3) {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await fs.readJson(path);
    } catch (err) {
      if (err.message.includes('File changed during read') && i < maxRetries - 1) {
        await sleep(50 * Math.pow(2, i)); // exponential backoff
        continue;
      }
      throw err;
    }
  }
}

This is a standard pattern for atomic file operations and should be applied to all JSON config reads, not just paired.json.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High-priority user-facing bug, regression, or broken workflow.clawsweeper:fix-shape-clearClawSweeper found a clear likely implementation shape for this issue.clawsweeper:queueable-fixClawSweeper marked this issue as an existing queue_fix_pr work candidate.clawsweeper:source-reproClawSweeper found a high-confidence source-level issue reproduction.impact:crash-loopCrash, hang, restart loop, or process-level availability failure.issue-rating: 🦞 diamond lobsterVery strong issue quality with high-confidence source-level or clear reproduction.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions