Environment
- OS: Windows (11, WSL2 gateway 127.0.0.1:18850)
- Config: %USERPROFILE%.openclaw
- Session storage: %USERPROFILE%.openclaw\sessions*.jsonl
Symptoms (Recurring)
- Stale session lock wedge: Session file locked by dead PID; gate returns 'All models failed' until manual cleanup of *.jsonl.lock
- cron EPERM on jobs.json: Windows rename jobs.json.*.tmp → jobs.json fails with Permission Denied in background tasks
- Multiple gateway instances: No guard prevents starting multiple gateways on same port; leads to conflict
- ** Skills listing drift**: Skills API returns inconsistent endpoint expectations vs. actual agent capabilities
Impact
- Gateway listening but agent unusable; requires manual rm *.jsonl.lock
- Cron tasks back up; scheduled jobs fail silently
- Users can accidentally spawn multiple gateways without clear error message
Local Mitigations Applied (by NEXAR team)
- Watchdog cleanup script (TTL-based stale lock removal)
- Single-instance guard in gateway.cmd (port lock file)
- Atomic write helper (safer cron rename pattern)
- Skills doc alignment (manual endpoint stub)
Requested Upstream Fixes
Priority 1 (Critical)
a) Crash-safe session lock ownership: Store PID in lock file; auto-recover stale locks (TTL-based expiry, e.g., 1 hour)
b) Built-in single-instance guard: Detect running gateway before bind; fail fast with clear message (not silent multi-spawn)
c) Windows-safe atomic write for cron/jobs.json: Atomic rename fallback (e.g., write-to-temp + in-process swap)
Priority 2 (QoL)
d) Skills API stability: Stable endpoint contract or backward-compatible aliases to reduce drift
References
Note: No secrets/credentials included. These are core reliability gaps blocking production Windows use.
Environment
Symptoms (Recurring)
Impact
Local Mitigations Applied (by NEXAR team)
Requested Upstream Fixes
Priority 1 (Critical)
a) Crash-safe session lock ownership: Store PID in lock file; auto-recover stale locks (TTL-based expiry, e.g., 1 hour)
b) Built-in single-instance guard: Detect running gateway before bind; fail fast with clear message (not silent multi-spawn)
c) Windows-safe atomic write for cron/jobs.json: Atomic rename fallback (e.g., write-to-temp + in-process swap)
Priority 2 (QoL)
d) Skills API stability: Stable endpoint contract or backward-compatible aliases to reduce drift
References
Note: No secrets/credentials included. These are core reliability gaps blocking production Windows use.