fix: resolve gateway infinite restart loop (zombie PID + lock race)#23416
Merged
steipete merged 5 commits intoopenclaw:mainfrom Feb 22, 2026
Merged
fix: resolve gateway infinite restart loop (zombie PID + lock race)#23416steipete merged 5 commits intoopenclaw:mainfrom
steipete merged 5 commits intoopenclaw:mainfrom
Conversation
kill(pid, 0) succeeds for zombie processes, causing the gateway lock to treat a zombie lock owner as alive. Read /proc/<pid>/status on Linux to check for 'Z' (zombie) state before reporting the process as alive. This prevents the lock from being held indefinitely by a zombie process during gateway restart. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
process.exit() called from inside an async IIFE bypasses the outer try/finally block that releases the gateway lock. This leaves a stale lock file pointing to a zombie PID, preventing the spawned child or systemctl restart from acquiring the lock. Release the lock explicitly before calling exit in both the restart-spawned and stop code paths. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move lock.release() before restartGatewayProcessWithFreshPid() so the spawned child can immediately acquire the lock without racing against a zombie parent. This eliminates the root cause of the restart loop where the child times out waiting for a lock held by its now-dead parent. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…cate gateway start The bundler exports shared symbols from dist/entry.js, so other chunks import it as a dependency. When dist/index.js is the actual entry point (e.g. systemd service), lazy module loading eventually imports entry.js, triggering its unguarded top-level code which calls runCli(process.argv) a second time. This starts a duplicate gateway that fails on lock/port contention and crashes the process with exit(1), causing a restart loop. Wrap all top-level executable code in an isMainModule() check so it only runs when entry.ts is the actual main module, not when imported as a shared dependency by the bundler.
aad980c to
5573517
Compare
Contributor
Contributor
hughmadden
pushed a commit
to turquoisebaydev/openclaw
that referenced
this pull request
Feb 23, 2026
* test(cli): use lightweight clears in daemon lifecycle setup * test(models): use lightweight clears in shared config setup * test(agents): use lightweight clears for stable subagent announce defaults * Sessions: persist prompt-token totals without usage * fix(security): normalize hook auth rate-limit client keys * refactor(cli): dedupe skills command report loading * refactor(cli): dedupe channel auth resolution flow * refactor(cli): dedupe allowlist command wiring * test(cli): dedupe update restart fallback scenario setup * test(cli): dedupe cron shared test fixtures * refactor(cli): extract fish completion line builders * test(cli): share nodes ios fixture helpers * refactor(cli): share npm install metadata helpers * refactor(cli): share pinned npm install record helper * refactor(slack): dedupe modal lifecycle interaction handlers * refactor(commands): share preview streaming migration logic * test(gateway): reuse last agent command assertion helper * test(discord): share provider lifecycle test harness * test(discord): share thread binding sweep fixtures * test(infra): dedupe shell env fallback test setup * refactor(discord): dedupe voice command runtime checks * test(discord): share model picker fallback fixtures * test(discord): share message handler draft fixtures * test(discord): share resolve-users guild probe fixture * test(inbound): share dispatch capture mock across channels * test(security): dedupe external marker sanitization assertions * test(wizard): share onboarding prompter scaffold * test(memory): share memory-tool manager mock fixture * test(subagents): dedupe focus thread setup fixtures * test(auth-profiles): dedupe cleared-state assertions * test(memory): share short-timeout test helper * test(outbound): share resolveOutboundTarget test suite * test(auth-profiles): dedupe oauth mode resolution setup * test(gateway): dedupe transcript seed fixtures in fs session tests * refactor(text): share code-region parsing for reasoning tags * refactor(node-host): share invoke type definitions * refactor(logging): share node createRequire resolution * test(models): dedupe auth-sync command assertions * test(pi): share overflow-compaction test setup * test(discord): dedupe guild permission route mocks * refactor(config): dedupe legacy stream-mode migration paths * test(gateway): dedupe tailscale header auth fixtures * test(browser): dedupe relay probe server scaffolding * test(cron): dedupe delivered-status run scaffolding * test(gateway): dedupe control-ui not-found fixture assertions * test(gateway): dedupe openai context assertions * test(config): dedupe traversal include assertions * test(config): dedupe nested redaction round-trip assertions * test(gateway): reuse shared openai timeout e2e helpers * test(gateway): dedupe chat history transcript helpers * test(gateway): dedupe canvas ws connect assertions * test(hooks): dedupe unsupported npm spec assertion * test(agent): reuse isolated agent mock setup * test(utils): share temp-dir helper across cli and web tests * test(browser): dedupe generated-token persistence assertions * test(browser): dedupe pw-session playwright mock wiring * test(agents): dedupe spawn-hook wait mocks and add readiness error coverage * test(agents): dedupe sanitize-session-history copilot fixtures * test: dedupe lifecycle oauth and prompt-limit fixtures * refactor(agents): share volc model catalog helpers * refactor(agents): reuse shared tool-policy base helpers * refactor: eliminate remaining duplicate blocks across draft streams and tests * refactor(core): dedupe gateway runtime and config tests * refactor(channels): dedupe message routing and telegram helpers * refactor(agents): dedupe plugin hooks and test helpers * chore: remove dead plugin hook loader * fix(security): harden gateway command/audit guardrails * test: dedupe telegram draft stream setup and extend state-dir env coverage * Agents: drop stale pre-compaction usage snapshots * docs(changelog): note next npm release for hook auth fix * test(telegram): dedupe native-command test setup * fix(gateway): block avatar symlink escapes * test: dedupe cron and slack monitor test harness setup * refactor(security): unify hook rate-limit and hook module loading * test(gateway): dedupe loopback cases and trim setup resets * test(agents): use lightweight clears in supervisor and session-status setup * test(auto-reply): centralize subagent command test reset setup * test(agents): centralize sessions tool gateway mock reset * test(telegram): centralize native command session-meta mock setup * test(browser): use lightweight clears in server lifecycle setup * test(gateway): use lightweight clears in cron service setup * test(commands): use lightweight clears in doctor memory search setup * test(outbound): dedupe shared setup hooks in message e2e * test(gateway): use lightweight clears in push handler setup * test(gateway): use lightweight clears in node invoke wake setup * test(gateway): use lightweight clears in node event setup * test(gateway): use lightweight clears for hook cron run fences * test(auto-reply): use lightweight clears in dispatch setup * test(agents): use lightweight clears in sandbox browser create setup * test(auto-reply): use lightweight clears in agent runner setup * test(plugins): use lightweight clears in wired hooks setup * test(gateway): use lightweight clears in client close setup * test(ui): use lightweight clears in theme and telegram media retry setup * test(agents): use lightweight clears in skills install e2e setup * test(gateway): use lightweight clears for chat-b reply spy fences * test(gateway): use lightweight clears for openai http agent fences * test(gateway): use lightweight clears for openresponses agent fences * test(core): use lightweight clears in update, child adapter, and copilot token setup * test(agents): dedupe sessions_spawn e2e reset setup * test(core): use lightweight clears in stable mock setup * test(agents): dedupe sessions_spawn allowlist reset setup * test(agents): drop redundant subagent registry cleanups * test(core): trim redundant mock resets in heartbeat suites * test(daemon): use lightweight clears in systemd mocks * test(infra): use lightweight clears in update startup mocks * test(gateway): use lightweight clears in agent handler tests * test(infra): use lightweight clears in message action threading setup * test(telegram): use lightweight clears in media handler setup * test(commands): use lightweight clears in agents/channels setup * fix: align draft/outbound typings and tests * test: stabilize pw-session cdp mocking in parallel runs * chore(docs): normalize security finding table formatting * fix(ci): add explicit mock types in pw-session mock setup * test(core): use lightweight clears in command and dispatch setup * test(agents): use lightweight clears in skills/sandbox setup * test(core): use lightweight clears in subagent and browser setup * test(core): use lightweight clears in runtime and telegram setup * test(core): trim redundant test resets and use mockClear * test(slack): use lightweight clear in interactions modal-close case * test(slack): avoid redundant reset in slash metadata wait case * test(reply): replace heavy resets in media and runner helper specs * test(agents): reduce reset overhead in session visibility and hooks specs * test(subagents): lighten session delete mock reset in announce spec * test(memory): prefer clear over reset in qmd spawn setup * test(agents): keep targeted resets minimal in overflow retry spec * chore: remove verified dead code paths * test(core): reduce mock reset overhead across unit and e2e specs * Agents: add fallback reply for tool-only completions * test(core): trim reset usage in gateway and install source specs * test(commands): use lightweight clears in config snapshot specs * refactor(gateway)!: remove legacy v1 device-auth handshake * test(subagents): use lightweight clears in sessions spawn suites * test(core): continue mock reset reductions in auth, gateway, npm install * test(core): continue reset-to-clear cleanup in subagent focus and web fetch * test(config): use lightweight clear in session pruning e2e setup * test(core): reduce reset overhead in messaging and agent e2e mocks * test(core): tighten reset usage in auth, registry restart, and memory search * fix: decouple owner display secret from gateway auth token * chore: remove dead macos relay and daemon code * test(core): use lightweight clear in cron, claude runner, and telegram delivery specs * Agents/Subagents: honor subagent alsoAllow grants * test(core): reduce mock reset overhead in targeted suites * fix(security): block HOME and ZDOTDIR env override injection * test(core): dedupe auth rotation and credential injection specs * test(agents): dedupe subagent announce direct-send variants * docs(changelog): add shell startup env override fix note * chore(test): make shell-env trusted-shell assertion platform-aware * test(commands): dedupe subagent status assertions * fix: harden exec allowlist wrapper resolution * test(agents): avoid full mock resets in cli credential specs * chore(test): harden models status mock restoration * test(core): dedupe command gating and trim announce reset overhead * test(agents): unify hook thread-target announce assertions * test(agents): collapse repeated announce direct-send scenarios * test(reply): merge duplicate runReplyAgent streaming and fallback cases * test(agents): use lightweight clear for active-run announce mock * test(agents): remove overflow compaction mock reset dependency * test(reply): use lightweight clears for runner-level mocks * test(agents): consolidate repeated announce deferral and fallback matrices * test(commands): replace subagent gateway reset with lightweight clear * TUI: preserve RTL text order in terminal output * docs(security): clarify dangerous control-ui bypass policy * feat(security): warn on dangerous config flags at startup * perf(test): bypass queue debounce in fast mode and tighten announce defaults * fix(security): harden channel token and id generation * refactor(security): unify secure id paths and guard weak patterns * fix(gateway): remove hello-ok host and commit fields * fix(security): block hook transform symlink escapes * refactor: unify exec wrapper resolution and parity fixtures * TUI: make Ctrl+C exit behavior reliably responsive * test(heartbeat): dedupe sandbox/session helpers and collapse ack cases * test(agents): simplify subagent announce suite imports and call assertions * test(heartbeat): reuse shared temp sandbox in model override suite * test(heartbeat): reuse shared sandbox for ghost reminder scenarios * perf(test): compact heartbeat session fixture writes * perf(test): shrink subagent announce fast-mode settle waits * fix: use SID-based ACL classification for non-English Windows * fix: detect zombie processes in isPidAlive on Linux kill(pid, 0) succeeds for zombie processes, causing the gateway lock to treat a zombie lock owner as alive. Read /proc/<pid>/status on Linux to check for 'Z' (zombie) state before reporting the process as alive. This prevents the lock from being held indefinitely by a zombie process during gateway restart. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: release gateway lock before process.exit in run-loop process.exit() called from inside an async IIFE bypasses the outer try/finally block that releases the gateway lock. This leaves a stale lock file pointing to a zombie PID, preventing the spawned child or systemctl restart from acquiring the lock. Release the lock explicitly before calling exit in both the restart-spawned and stop code paths. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: release gateway lock before spawning restart child Move lock.release() before restartGatewayProcessWithFreshPid() so the spawned child can immediately acquire the lock without racing against a zombie parent. This eliminates the root cause of the restart loop where the child times out waiting for a lock held by its now-dead parent. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: guard entry.ts top-level code with isMainModule to prevent duplicate gateway start The bundler exports shared symbols from dist/entry.js, so other chunks import it as a dependency. When dist/index.js is the actual entry point (e.g. systemd service), lazy module loading eventually imports entry.js, triggering its unguarded top-level code which calls runCli(process.argv) a second time. This starts a duplicate gateway that fails on lock/port contention and crashes the process with exit(1), causing a restart loop. Wrap all top-level executable code in an isMainModule() check so it only runs when entry.ts is the actual main module, not when imported as a shared dependency by the bundler. * fix: tighten gateway restart loop handling (openclaw#23416) (thanks @jeffwnli) * chore: fix temp-path guard skip for *.test-helpers.ts * fix: include modelByChannel in config validator allowedChannels The hand-written config validator rejects `channels.modelByChannel` as "unknown channel id: modelByChannel" even though the Zod schema, TypeScript types, runtime code, and CLI docs all treat it as valid. The `defaults` meta-key was already whitelisted but `modelByChannel` was missed when the feature was added in 2026.2.21. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * also skip modelByChannel in plugin-auto-enable channel iteration Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: cover channels.modelByChannel validation/auto-enable * fix: finalize modelByChannel validator landing (openclaw#23412) (thanks @ProspectOre) * refactor: simplify windows ACL parsing and expand coverage * refactor(gateway): simplify restart flow and expand lock tests * refactor(plugin-sdk): unify channel dedupe primitives * fix(acp): wait for gateway connection before processing ACP messages - Move gateway.start() before AgentSideConnection creation - Wait for hello message to confirm connection is established - This fixes issues where messages were processed before gateway was ready Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: harden ACP gateway startup sequencing (openclaw#23390) (thanks @janckerchen) * Memory/QMD: normalize Han-script BM25 search queries * fix(stability): patch regex retries and timeout abort handling * fix: handle intentional signal daemon shutdown on abort (openclaw#23379) (thanks @frankekn) * refactor(signal): extract daemon lifecycle and typed exit handling * Exec: fail closed when sandbox host is unavailable * fix: harden exec sandbox fallback semantics (openclaw#23398) (thanks @bmendonca3) * test: stabilize temp-path guard across runtimes (openclaw#23398) * test: harden temp path guard detection (openclaw#23398) * fix(feishu): avoid template tmpdir join in dedup state path (openclaw#23398) * feat(feishu): persistent message deduplication to prevent duplicate replies Closes openclaw#23369 Feishu may redeliver the same message during WebSocket reconnects or process restarts. The existing in-memory dedup map is lost on restart, so duplicates slip through. This adds a dual-layer dedup strategy: - Memory cache (fast synchronous path, unchanged capacity) - Filesystem store (~/.openclaw/feishu/dedup/) that survives restarts TTL is extended from 30 min to 24 h. Disk writes use atomic rename and probabilistic cleanup to keep each per-account file under 10 k entries. Disk errors are caught and logged — message handling falls back to memory-only behaviour so it is never blocked. * fix(feishu): address dedup race condition, namespace isolation, and cache staleness - Prefix memoryCache keys with namespace to prevent cross-account false positives when different accounts receive the same message_id - Add inflight tracking map to prevent TOCTOU race where concurrent async calls for the same message both pass the check and both proceed - Remove expired-entry deletion from has() to avoid silent cache/disk divergence; actual cleanup happens probabilistically inside record() - Add time-based cache invalidation (30s) to DedupStore.load() so external writes are eventually picked up - Refresh cacheLoadedAt after flush() so we don't immediately re-read data we just wrote Co-authored-by: Cursor <cursoragent@cursor.com> * fix: tighten feishu dedupe boundary (openclaw#23377) (thanks @SidQin-cyber) * Feat/logger support log level validation0222 (openclaw#23436) * 1、环境变量**:新增 `OPENCLAW_LOG_LEVEL`,可取值 `silent|fatal|error|warn|info|debug|trace`。设置后同时覆盖**文件日志**与**控制台**的级别,优先级高于配置文件。 2、启动参数**:在 `openclaw gateway run` 上新增 `--log-level <level>`,对该次进程同时生效于文件与控制台;未传时仍使用环境变量或配置文件。 * fix(logging): make log-level override global and precedence-safe --------- Co-authored-by: Peter Steinberger <steipete@gmail.com> * fix(telegram): prevent update offset skipping queued updates (openclaw#23284) Merged via /review-pr -> /prepare-pr -> /merge-pr. Prepared head SHA: 92efaf9 Co-authored-by: frankekn <4488090+frankekn@users.noreply.github.com> Co-authored-by: obviyus <22031114+obviyus@users.noreply.github.com> Reviewed-by: @obviyus * fix: stop hardcoded channel fallback and auto-pick sole configured channel (openclaw#23357) (thanks @lbo728) Co-authored-by: lbo728 <extreme0728@gmail.com> * docs(security): clarify workspace memory trust boundary * Security: expand audit checks for mDNS and real-IP fallback * fix: land security audit severity + temp-path guard fixes (openclaw#23428) (thanks @bmendonca3) * test(heartbeat): use shared sandbox in sender target suite * perf(test): compact remaining heartbeat fixture writes * test(reply): align native trigger suite with fast-test fixture patterns * perf(test): speed subagent announce retry polling in fast mode * test(agents): dedupe auth profile rotation fixture setup * perf(test): trim background abort settle waits and dedupe cmd fixture * perf(test): trim nested subagent output wait floor in fast mode * perf(test): lower fast-mode nested output wait floor to 80ms * test(agents): remove dead shell-timeout override in safeBins suite * perf(test): lower fast-mode nested output wait floor to 70ms * perf(test): remove flaky transport timeout and dedupe safeBins checks * perf(test): mock compact module in auth rotation e2e * perf(test): reduce subagent announce fast-mode polling waits * perf(test): lower subagent fast-mode wait floors * perf(test): trim bash e2e sleep and poll windows * perf(test): narrow pi-embedded runner e2e import path * test: reclassify mocked runner/safe-bins suites as unit tests * test: reclassify auth-profile-rotation suite as unit test * test: reclassify mocked announce and sandbox suites as unit tests * perf(test): tighten background abort timing windows * test: reclassify sandbox merge and exec path suites as unit tests * perf(test): speed up sessions_spawn lifecycle suite setup * test: reclassify sessions_spawn lifecycle suite as unit test * perf(test): reduce bash e2e wait windows * fix(gateway): strip directive tags from non-streaming webchat broadcasts Closes openclaw#23053 The streaming path already strips [[reply_to_current]] and other directive tags via stripInlineDirectiveTagsForDisplay, but the non-streaming broadcastChatFinal path and the chat.inject path sent raw message content to webchat clients, causing tags to appear in rendered messages after streaming completes. * fix: add non-streaming directive-tag regression tests (openclaw#23298) (thanks @SidQin-cyber) * test: reclassify skills suites from e2e to unit lane * test: reclassify models-config suites from e2e to unit lane * test: harden models-config env isolation list * refactor: clarify strict loopback proxy audit rules * fix(session): resolve agent session path with configured sessions dir Co-authored-by: David Rudduck <david@rudduck.org.au> * fix(telegram): classify undici fetch errors as recoverable for retry (openclaw#16699) Merged via /review-pr -> /prepare-pr -> /merge-pr. Prepared head SHA: 67b5bce Co-authored-by: Glucksberg <80581902+Glucksberg@users.noreply.github.com> Co-authored-by: obviyus <22031114+obviyus@users.noreply.github.com> Reviewed-by: @obviyus * fix(config): add missing comment field to BindingsSchema Strict validation (added in d1e9490) rejects the legitimate 'comment' field on bindings. This field is used for annotations in config files. Changes: - BindingsSchema: added comment: z.string().optional() - AgentBinding type: added comment?: string Fixes openclaw#23385 * fix: add bindings comment regression test (openclaw#23458) (thanks @echoVic) * fix(bluebubbles): treat null privateApiStatus as disabled, not enabled Bug: privateApiStatus cache expires after 10 minutes, returning null. The check '!== false' treats null as truthy, causing 500 errors when trying to use Private API features that aren't actually available. Root cause: In JavaScript, null !== false evaluates to true. Fix: Changed all checks from '!== false' to '=== true', so null (cache expired/unknown) is treated as disabled (safe default). Files changed: - extensions/bluebubbles/src/send.ts (line 376) - extensions/bluebubbles/src/monitor-processing.ts (line 423) - extensions/bluebubbles/src/attachments.ts (lines 210, 220) Fixes openclaw#23393 * fix: align BlueBubbles private-api null fallback + warning (openclaw#23459) (thanks @echoVic) * refactor(session): centralize transcript path option resolution * fix: add operator.read and operator.write to default CLI scopes (openclaw#22582) Merged via /review-pr -> /prepare-pr -> /merge-pr. Prepared head SHA: 8569fc8 Co-authored-by: YuzuruS <1485195+YuzuruS@users.noreply.github.com> Co-authored-by: obviyus <22031114+obviyus@users.noreply.github.com> Reviewed-by: @obviyus * feat(workspace): add PROFILE-<name>.md bootstrap file support When OPENCLAW_PROFILE is set (and not "default"), automatically load a PROFILE-<profileName>.md file from the workspace as an additional bootstrap context file. This gives each profile instance its own personality/context overlay without needing hook configuration. Changes: - Add isProfileBootstrapName() helper to validate PROFILE-*.md pattern - Update loadWorkspaceBootstrapFiles() to load profile file when env var is set - Insert profile file in correct order (after USER.md, before HEARTBEAT.md) - Update loadExtraBootstrapFiles() to accept PROFILE-*.md filenames - Update filterBootstrapFilesForSession() to preserve profile files in subagent/cron sessions - Widen WorkspaceBootstrapFileName type to include dynamic profile filenames - Add comprehensive test coverage for all profile file scenarios - Update bootstrap-extra-files hook documentation The profile file is optional - if it doesn't exist, it's silently skipped without adding a [MISSING] marker. This makes it zero-config for multi-instance setups like hive clusters. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * ci: add promoted release workflow for v*-turq.* tags --------- Co-authored-by: Peter Steinberger <steipete@gmail.com> Co-authored-by: Vignesh Natarajan <vigneshnatarajan92@gmail.com> Co-authored-by: SK Akram <skcodewizard786@gmail.com> Co-authored-by: jeffr <jeffr@local> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: pickaxe <54486432+ProspectOre@users.noreply.github.com> Co-authored-by: janckerchen <janckerchen@gmail.com> Co-authored-by: Frank Yang <frank.ekn@gmail.com> Co-authored-by: Brian Mendonca <brianmendonca@Brians-MacBook-Air.local> Co-authored-by: SidQin-cyber <sidqin0410@gmail.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: maweibin <532282155@qq.com> Co-authored-by: frankekn <4488090+frankekn@users.noreply.github.com> Co-authored-by: obviyus <22031114+obviyus@users.noreply.github.com> Co-authored-by: lbo728 <extreme0728@gmail.com> Co-authored-by: David Rudduck <david@rudduck.org.au> Co-authored-by: Glucksberg <80581902+Glucksberg@users.noreply.github.com> Co-authored-by: echoVic <echoVic@users.noreply.github.com> Co-authored-by: Yuzuru Suzuki <navitima@gmail.com> Co-authored-by: YuzuruS <1485195+YuzuruS@users.noreply.github.com>
18 tasks
gabrielkoo
pushed a commit
to gabrielkoo/openclaw
that referenced
this pull request
Feb 23, 2026
mreedr
pushed a commit
to mreedr/openclaw-custom
that referenced
this pull request
Feb 24, 2026
mylukin
pushed a commit
to mylukin/openclaw
that referenced
this pull request
Feb 26, 2026
6 tasks
zooqueen
pushed a commit
to hanzoai/bot
that referenced
this pull request
Mar 6, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When a gateway restart is triggered via SIGUSR1, the process spawns a detached child and calls
process.exit(0). This bypasses the outerfinallyblock, leaving a stale lock file on disk pointing to the now-exiting parent PID. The spawned child tries to acquire the lock, callsisPidAlive()which returnstruefor zombie processes (becausekill(pid, 0)succeeds on zombies), and times out after 5 seconds. Meanwhile, systemd'sRestart=alwaysstarts a competing process, creating a self-sustaining loop.Fixes #21685
Fixes #7999
Root Causes & Fixes
Three independent, layered fixes — each breaks the loop on its own, all three together provide defense in depth:
1. Zombie process detection in
isPidAlive(src/shared/pid-alive.ts)kill(pid, 0)succeeds for zombie processes, causing the lock to treat a zombie lock owner as alive. Added/proc/<pid>/statuscheck on Linux to detectZ(zombie) state and returnfalse.2. Release lock before
process.exit()(src/cli/gateway-cli/run-loop.ts)process.exit()inside an async IIFE bypasses the outertry/finallythat releases the gateway lock. Added explicitawait lock?.release()before callingexit(0)in both the restart-spawned and stop code paths.3. Release lock before spawning the child (
src/cli/gateway-cli/run-loop.ts)Moved
lock.release()to beforerestartGatewayProcessWithFreshPid()so the child can immediately acquire the lock without any race window against the parent.4. Guard
entry.tstop-level code withisMainModule(src/entry.ts)The bundler exports shared symbols from
dist/entry.js, causing lazy imports to re-execute its unguarded top-levelrunCli(process.argv)call, which starts a duplicate gateway that fails on lock/port contention and exits with code 1 — triggering another restart cycle.Impact
isPidAlive— all lock consumersrun-loop.ts— restart + stop pathsrun-loop.ts— restart path onlyentry.ts— prevents duplicate gateway on importTests
src/shared/pid-alive.test.ts— covers zombie detection, invalid PIDs, running processessrc/cli/gateway-cli/run-loop.test.ts— lock release assertions for restart and stop pathsGreptile Summary
This PR fixes a critical infinite restart loop caused by three interconnected race conditions during gateway restart. The fix implements defense-in-depth with four independent changes:
pid-alive.ts): Added Linux/proc/<pid>/statuscheck to detect zombie processes, preventingisPidAlivefrom incorrectly returning true for zombiesrun-loop.ts): Explicitly releases gateway lock beforeprocess.exit(0)in both restart and stop paths, preventing lock file from persisting after process exitrun-loop.ts): Releases lock before spawning the restart child process, eliminating the race window where the child waits for the parent's zombie to be reapedentry.ts): Wraps top-level code withisMainModulecheck to prevent duplicate gateway startup whenentry.jsis imported as a shared dependencyAll fixes are well-tested with comprehensive unit tests covering zombie detection, lock release ordering, and edge cases. The implementation is clean, follows the codebase patterns, and each fix independently breaks the restart loop while together providing robust defense.
Confidence Score: 5/5
isMainModuleguard prevents duplicate gateway startup without affecting normal execution. Comprehensive test coverage validates all critical paths including zombie detection, lock release ordering, and edge cases.Last reviewed commit: aad980c
(2/5) Greptile learns from your feedback when you react with thumbs up/down!