Skip to content

Gateway: harden singleton lock startup with stale recovery telemetry#29118

Open
cfregly wants to merge 3 commits intoopenclaw:mainfrom
cfregly:gateway-singleton-lock-hardening
Open

Gateway: harden singleton lock startup with stale recovery telemetry#29118
cfregly wants to merge 3 commits intoopenclaw:mainfrom
cfregly:gateway-singleton-lock-hardening

Conversation

@cfregly
Copy link

@cfregly cfregly commented Feb 27, 2026

Summary

  • Problem: gateway singleton lock behavior under contention had limited startup observability and weaker stale-lock recovery signaling, making startup storms and stale-lock incidents harder to triage.
  • Why it matters: parallel starts/restarts can happen in real use (daemon restarts, local relaunches, CI/e2e), and lock ambiguity increases false "already running" failures and operator confusion.
  • What changed:
    • Added structured gateway lock telemetry events for acquire start/contention/stale recovery/success/timeout.
    • Hardened stale-lock recovery and lock lifecycle cleanup paths in acquireGatewayLock.
    • Wired explicit startup telemetry logs into gateway run-loop lock acquisition/reacquisition paths.
    • Added regression tests for macOS/Linux stale recovery paths and parallel startup storms.
    • Added AGENTS.local.md to .gitignore so local guardrails are never staged.
  • What did NOT change (scope boundary):
    • No auth model or gateway network bind behavior changes.
    • No config schema/user-facing flags added.
    • No channel/business-logic routing changes.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Related #

User-visible / Behavior Changes

  • Gateway startup now logs explicit singleton-lock lifecycle telemetry (acquire/wait/recovered/acquired/timeout).
  • Under stale-lock conditions, lock recovery behavior is more deterministic and diagnosable.

Security Impact (required)

  • New permissions/capabilities? (No)
  • Secrets/tokens handling changed? (No)
  • New/changed network calls? (No) (existing localhost port probe behavior only)
  • Command/tool execution surface changed? (No)
  • Data access scope changed? (No)
  • If any Yes, explain risk + mitigation:

Repro + Verification

Environment

  • OS: macOS (authoring), Linux lock paths covered in tests.
  • Runtime/container: Node 22 + pnpm.
  • Model/provider: N/A.
  • Integration/channel (if any): Gateway core lock path.
  • Relevant config (redacted): default local config path hashing + lock dir behavior.

Steps

  1. pnpm format:check -- .gitignore src/infra/gateway-lock.ts src/infra/gateway-lock.test.ts src/cli/gateway-cli/run-loop.ts src/cli/gateway-cli/run-loop.test.ts
  2. pnpm test -- src/infra/gateway-lock.test.ts src/cli/gateway-cli/run-loop.test.ts
  3. pnpm build

Expected

  • New telemetry tests pass.
  • Parallel startup storm regression passes.
  • Gateway lock + run-loop changes build and tests are green.

Actual

  • Format check passed for touched files.
  • Targeted tests passed (21/21).
  • Build passed.

Evidence

Attach at least one:

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Evidence included via new/updated tests:

  • src/infra/gateway-lock.test.ts
  • src/cli/gateway-cli/run-loop.test.ts

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios:
    • lock telemetry flow (start -> contention -> stale recovery -> success)
    • lock reacquire path during restart in run-loop
    • stale lock handling on darwin-style and linux-style paths
    • parallel startup storm serialization and lock cleanup
  • Edge cases checked:
    • malformed/unknown owner + stale timeout reclaim
    • owner dead reclaim path
    • startup timeout telemetry emission path
  • What you did not verify:
    • full end-to-end multi-process gateway startup under external daemon orchestration beyond unit-level coverage.

Compatibility / Migration

  • Backward compatible? (Yes)
  • Config/env changes? (No)
  • Migration needed? (No)
  • If yes, exact upgrade steps:

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly:
    • revert commits 3c4533239 and 4525a5a23
  • Files/config to restore:
    • .gitignore
    • src/infra/gateway-lock.ts
    • src/infra/gateway-lock.test.ts
    • src/cli/gateway-cli/run-loop.ts
    • src/cli/gateway-cli/run-loop.test.ts
  • Known bad symptoms reviewers should watch for:
    • unexpected lock timeout spam during normal single-instance startup
    • lock telemetry logs appearing without corresponding contention

Risks and Mitigations

  • Risk:
    • False stale-lock reclaim in edge environments with unusual FS/stat behavior.
    • Mitigation:
      • Conservative stat fallback; reclaim only on explicit stale/dead signals; added regression coverage.
  • Risk:
    • Increased startup log volume from telemetry.
    • Mitigation:
      • Concise single-line events emitted only on lock lifecycle transitions.

AI-assisted: Yes (Codex). Author reviewed resulting code and behavior.
Testing level: Fully tested for touched units + build.

@openclaw-barnacle openclaw-barnacle bot added cli CLI command changes size: M labels Feb 27, 2026
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 27, 2026

Greptile Summary

Enhanced gateway singleton lock startup behavior with comprehensive telemetry and hardened stale-lock recovery.

Key improvements:

  • Added structured telemetry events covering full lock lifecycle (acquire-start, contention, stale recovery, success, timeout)
  • Hardened error handling: lock write failures now properly clean up file handles and attempt lock removal
  • Refactored stale detection logic into well-named helper functions (isLockPayloadStale, isLockMtimeStale, tryRemoveLockFile)
  • Integrated telemetry into run-loop with formatted log messages for better startup observability
  • Added comprehensive test coverage including parallel startup storm test with 20 concurrent contenders
  • Added AGENTS.local.md to .gitignore for local agent guardrails

Code quality:

  • Defensive programming with try-catch around telemetry callbacks (best-effort only)
  • Proper type guards for optional PIDs throughout
  • contentionReported flag prevents telemetry spam during repeated lock attempts
  • Lock cleanup happens in all error paths

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk - it improves observability and error handling without changing core lock behavior
  • Score reflects thorough testing (21 tests including parallel startup storm), backward-compatible changes, defensive error handling with proper cleanup, and no security concerns. The refactoring maintains existing lock semantics while adding valuable telemetry.
  • No files require special attention

Last reviewed commit: 4525a5a

@cfregly
Copy link
Author

cfregly commented Feb 27, 2026

CI note: current check failure on this gateway PR comes from upstream baseline (src/agents/pi-embedded-runner-extraparams.test.ts TS2352), not from the gateway lock changes.

Opened a tiny standalone fix PR here: #29131

Plan: keep this gateway PR focused, merge #29131 first, then rerun this PR.

@cfregly
Copy link
Author

cfregly commented Feb 28, 2026

Update: the upstream typing unblocker has been replaced with a cleaner, rebased PR: #29244.

I closed #29131 to avoid ambiguity.

@cfregly cfregly force-pushed the gateway-singleton-lock-hardening branch from 4525a5a to 8d28087 Compare February 28, 2026 02:48
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8d28087940

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@cfregly
Copy link
Author

cfregly commented Feb 28, 2026

Rebased this PR onto current main (afa7ac1f6) and pushed updated head (8d2808794).

Local re-validation on rebased branch:

  • pnpm check
  • pnpm test -- src/infra/gateway-lock.test.ts src/cli/gateway-cli/run-loop.test.ts
  • pnpm build

Fresh CI run is now in progress.

@openclaw-barnacle
Copy link

This pull request has been automatically marked as stale due to inactivity.
Please add updates or it will be closed.

@openclaw-barnacle openclaw-barnacle bot added the stale Marked as stale due to inactivity label Mar 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cli CLI command changes size: L stale Marked as stale due to inactivity

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant