Skip to content

test: fix crash-loop gateway PID detection#4045

Merged
cv merged 2 commits into
NVIDIA:mainfrom
jyaunches:fix/2478-gateway-pid-diagnostics
May 22, 2026
Merged

test: fix crash-loop gateway PID detection#4045
cv merged 2 commits into
NVIDIA:mainfrom
jyaunches:fix/2478-gateway-pid-diagnostics

Conversation

@jyaunches

@jyaunches jyaunches commented May 22, 2026

Copy link
Copy Markdown
Contributor

Summary

Why

Run 26262708948 failed with "Gateway never came up after onboard", but diagnostics showed the gateway was actually healthy: pgrep found 311 openclaw, gateway.log contained http server listening and gateway ready, and nemoclaw status reported sandbox Phase: Ready with healthy inference. The E2E matcher only looked for openclaw gateway / openclaw-gateway, so current OpenClaw process-title shape produced a false negative.

Validation

  • bash -n test/e2e/test-issue-2478-crash-loop-recovery.sh
  • git diff --check
  • synthetic parser checks for explicit openclaw gateway run, retitled openclaw-gateway, plain openclaw with ready gateway log, and plain openclaw without ready log

Summary by CodeRabbit

  • Tests
    • Improved gateway process identification during crash-loop recovery to more reliably pick the correct gateway instance, using explicit label/argument patterns and a readiness-based fallback.
    • Updated sandbox diagnostic capture to produce a clearer sandbox state snapshot for troubleshooting.

Review Change Stack

@coderabbitai

coderabbitai Bot commented May 22, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4a25148e-74b5-48d5-951f-161511f37c31

📥 Commits

Reviewing files that changed from the base of the PR and between 5b5e769 and 8a7e1c9.

📒 Files selected for processing (1)
  • test/e2e/test-issue-2478-crash-loop-recovery.sh

📝 Walkthrough

Walkthrough

This PR hardens the E2E crash-loop recovery test by replacing a single pgrep-based gateway PID lookup with an in-sandbox ps/awk selection that prefers explicit gateway argv/comm matches and conditionally falls back to an older openclaw process, and changes sandbox diagnostics to use openshell sandbox get.

Changes

Test Helper Improvements

Layer / File(s) Summary
Gateway PID detection resilience
test/e2e/test-issue-2478-crash-loop-recovery.sh
gateway_pid() now uses in-sandbox ps/awk to match gateway argv/comm variants (openclaw-gateway, openclaw ... gateway), selects the oldest matching PID, and falls back to the oldest openclaw PID only when /tmp/gateway.log shows gateway-ready/HTTP-listening markers.
Sandbox diagnostics command
test/e2e/test-issue-2478-crash-loop-recovery.sh
gateway_diagnostics() prints an openshell sandbox get <name> snapshot instead of openshell sandbox info --name <name>.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

Integration: OpenClaw, E2E, OpenShell, fix

Suggested reviewers

  • cv

Poem

🐇 In sandbox shadows I do peep,
I sniff the ps, the awk, the heap,
I find the gateway, old or new,
Fallback steady, steady true.
Sandbox snapshot, squeak of joy—tests hue!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'test: fix crash-loop gateway PID detection' directly addresses the main change—improving gateway PID detection in the crash-loop recovery test by handling current OpenClaw process labels and adding fallback logic.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@jyaunches jyaunches requested a review from cv May 22, 2026 02:35
@cv cv merged commit a907d28 into NVIDIA:main May 22, 2026
28 checks passed
ericksoa added a commit that referenced this pull request May 22, 2026
## Summary
- Revert the crash-loop E2E PID-detector change from #4045.
- #4045 adapted `test-issue-2478-crash-loop-recovery.sh` for the
OpenClaw 2026.5.x process-title shape seen after #3820.
- #4051 reverted #3820 and the latest sandbox is back on OpenClaw
2026.4.24, where the pre-#3820 detector is the known-good path.

## Why
- The known-good pre-#3820 run at
`80ee341686d695147c5cd118d1049c32f52d5af9` passed
`issue-2478-crash-loop-recovery-e2e`.
- The current failing run on reverted main showed `openclaw-gateway`
alive, `[gateway] ready`, sandbox `Ready`, and `Agent: OpenClaw
v2026.4.24`, but the #4045 detector still returned empty.

## Validation
- `bash -n test/e2e/test-issue-2478-crash-loop-recovery.sh`
- `git diff --check`
- `git diff --stat 80ee341 --
test/e2e/test-issue-2478-crash-loop-recovery.sh` is empty; the test file
now matches the known-good pre-#3820 version.


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Tests**
* Enhanced crash-loop recovery testing with simplified process detection
and improved diagnostic capabilities for system troubleshooting.

<!-- review_stack_entry_start -->

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/4056?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)

<!-- review_stack_entry_end -->

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@wscurran wscurran added the chore Build, CI, dependency, or tooling maintenance label Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

chore Build, CI, dependency, or tooling maintenance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants