Skip to content

fix(sdk): improve web SDK test port management and process cleanup#3794

Merged
jolestar merged 7 commits into
mainfrom
fix/sdk-test-port-management
Nov 22, 2025
Merged

fix(sdk): improve web SDK test port management and process cleanup#3794
jolestar merged 7 commits into
mainfrom
fix/sdk-test-port-management

Conversation

@jolestar

Copy link
Copy Markdown
Contributor

Problem

Web SDK tests frequently timeout when trying to start the Rooch server, with errors like:

Error: Timed out waiting for: tcp:0.0.0.0:6767

Root causes:

  1. Port conflicts: Hard-coded ports (6767/6768) conflict when multiple tests run or old processes remain
  2. No timeout mechanism: Server startup waits indefinitely if port is occupied
  3. Incomplete cleanup: cleanEnv() didn't ensure processes were fully terminated
  4. Poor error visibility: No clear indication of why startup failed

Solution

1. Dynamic Port Allocation

  • Added isPortAvailable() function to check port status before binding
  • Modified loadRoochEnv() to automatically fall back to a random available port if the requested port is occupied
  • Each test now gets a unique port, eliminating conflicts

2. Timeout Mechanism

  • Added 30-second timeout to roochAsyncCommand()
  • Provides clear error messages with captured output when startup fails
  • Prevents indefinite hangs in CI

3. Improved Process Cleanup

  • Enhanced cleanEnv() to:
    • Try graceful shutdown (SIGTERM) first
    • Force kill (SIGKILL) after 2 seconds if needed
    • Fallback to port-based cleanup using lsof
  • Cross-platform support (macOS/Linux/Windows)

4. Better Debugging

  • Added comprehensive debug logging for port allocation and process lifecycle
  • Logs PID and port for each server instance
  • Helps diagnose issues in both local and CI environments

Changes

sdk/typescript/test-suite/src/testbox.ts

  • Added isPortAvailable() helper function
  • Enhanced roochAsyncCommand() with timeout and better error handling
  • Improved loadRoochEnv() to handle port conflicts automatically
  • Rewrote cleanEnv() for more robust process termination

sdk/typescript/rooch-sdk/test-e2e/setup.ts

  • Changed default port from 6768 to 0 (auto-assign) to avoid conflicts

Testing

Verified locally:

  • ✅ Multiple Rooch servers can start concurrently without port conflicts
  • ✅ Process cleanup properly terminates servers and frees ports
  • ✅ Timeout mechanism correctly handles unavailable ports
  • ✅ All improvements work in both local dev and CI environments

Test results:

✓ Port availability check works
✓ Server starts with dynamic port (5288ms)
✓ Two servers use different ports (6542ms)

Impact

  • 🚀 Tests no longer timeout due to port conflicts
  • 🧹 No more zombie Rooch processes after tests
  • ⏱️ Fast failure with clear errors when issues occur
  • 🔍 Better debugging information for troubleshooting

Related to #3755

@vercel

vercel Bot commented Nov 21, 2025

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
rooch-portal-v2.1 Ready Ready Preview Comment Nov 22, 2025 3:54am
test-portal Ready Ready Preview Comment Nov 22, 2025 3:54am
1 Skipped Deployment
Project Deployment Preview Comments Updated (UTC)
rooch Ignored Ignored Preview Nov 22, 2025 3:54am

@jolestar jolestar requested review from Copilot and removed request for wow-sven and yubing744 November 21, 2025 15:21
@github-actions

github-actions Bot commented Nov 21, 2025

Copy link
Copy Markdown

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses critical test reliability issues in the Web SDK by improving port management and process cleanup. The changes focus on preventing port conflicts during concurrent test execution and ensuring complete cleanup of server processes.

Key Changes

  • Dynamic port allocation: Added isPortAvailable() helper and logic to automatically fall back to random ports when requested ports are occupied
  • Timeout mechanism: Added 30-second timeout to server startup with better error reporting
  • Enhanced process cleanup: Improved cleanEnv() with graceful shutdown (SIGTERM), force kill fallback (SIGKILL), and platform-specific port-based cleanup

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File Description
sdk/typescript/test-suite/src/testbox.ts Added port availability checking, timeout handling, and comprehensive process cleanup with logging
sdk/typescript/rooch-sdk/test-e2e/setup.ts Changed default port from 6768 to 0 for automatic port assignment

Comment on lines +315 to +326
const timeout = setTimeout(() => {
child.kill('SIGTERM')
// Force kill after 2 seconds if SIGTERM doesn't work
setTimeout(() => {
try {
child.kill('SIGKILL')
} catch (e) {
// Process already dead, ignore
}
}, 2000)
reject(new Error(`Timeout (${timeoutMs}ms) waiting for: ${waitFor}\nOutput so far: ${output}`))
}, timeoutMs)

Copilot AI Nov 21, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The nested setTimeout for force-killing the child process creates a potential memory leak. If the timeout triggers and the process terminates quickly from SIGTERM, the inner setTimeout (lines 318-324) will still execute after 2 seconds. While the try-catch prevents crashes, this leaves an unreferenced timer that should be cleaned up.

Consider storing the timer reference and clearing it when appropriate:

const timeout = setTimeout(() => {
  child.kill('SIGTERM')
  const forceKillTimer = setTimeout(() => {
    try {
      child.kill('SIGKILL')
    } catch (e) {
      // Process already dead, ignore
    }
  }, 2000)
  // Store forceKillTimer to clear if needed
  reject(new Error(`Timeout (${timeoutMs}ms) waiting for: ${waitFor}\nOutput so far: ${output}`))
}, timeoutMs)

Or use a simpler approach with direct SIGKILL after clearing the main timeout.

Copilot uses AI. Check for mistakes.
Comment on lines +107 to +108
// Always use dynamic port in local mode to avoid conflicts
// Check if the requested port is available, if not, get a random one

Copilot AI Nov 21, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The comment "Always use dynamic port in local mode to avoid conflicts" is misleading. The code still allows using a specific port if it's available (lines 109-116), so it doesn't always use a dynamic port.

Consider updating to:

// Check if the requested port is available, if not, get a dynamic one
Suggested change
// Always use dynamic port in local mode to avoid conflicts
// Check if the requested port is available, if not, get a random one
// Check if the requested port is available, if not, get a dynamic one
// (If port is unavailable or zero, get a random unused port)

Copilot uses AI. Check for mistakes.
Comment thread sdk/typescript/test-suite/src/testbox.ts Outdated
Comment on lines 217 to 271
if (typeof this.roochContainer === 'number') {
process.kill(this.roochContainer)
const pid = this.roochContainer
log(`Cleaning up Rooch server process with PID: ${pid}`)

try {
// Try graceful shutdown first with SIGTERM
process.kill(pid, 'SIGTERM')
log(`Sent SIGTERM to process ${pid}`)

// Wait a bit, then force kill if process still exists
setTimeout(() => {
try {
// Check if process still exists (will throw if not)
process.kill(pid, 0)
// Process still exists, force kill
log(`Process ${pid} still running, sending SIGKILL`)
process.kill(pid, 'SIGKILL')
} catch (e) {
// Process already dead, this is expected
log(`Process ${pid} already terminated`)
}
}, 2000)
} catch (e: any) {
// Process might already be dead
log(`Failed to kill process ${pid}: ${e.message}`)
}

// Fallback: kill any process listening on the port
if (this.roochPort) {
try {
log(`Cleaning up any process on port ${this.roochPort}`)
const { execSync } = require('child_process')
// Use platform-appropriate command
if (process.platform === 'win32') {
execSync(
`for /f "tokens=5" %a in ('netstat -aon ^| findstr :${this.roochPort}') do taskkill /F /PID %a`,
{ stdio: 'ignore' },
)
} else {
execSync(`lsof -ti:${this.roochPort} | xargs kill -9 2>/dev/null || true`, {
stdio: 'ignore',
})
}
} catch (e) {
// Ignore errors - port might already be free
log(`Port cleanup completed (or was already free)`)
}
}
} else {
this.roochContainer?.stop()
}

this.tmpDir.removeCallback()
log('Environment cleanup completed')
}

Copilot AI Nov 21, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cleanEnv() method has a timing issue with asynchronous cleanup. The setTimeout() calls execute asynchronously but cleanEnv() doesn't wait for them, so the method returns immediately while cleanup is still in progress. This means:

  1. The temporary directory removal on line 269 may happen before process termination completes
  2. Tests may proceed before cleanup finishes, potentially causing resource leaks
  3. The port-based cleanup (lines 245-264) runs immediately and may kill the process before the SIGTERM timeout completes

Consider making cleanEnv() async and using await with Promise-based delays, or use execSync for immediate synchronous cleanup.

Copilot uses AI. Check for mistakes.
Comment thread sdk/typescript/test-suite/src/testbox.ts Outdated
Comment thread sdk/typescript/test-suite/src/testbox.ts Outdated
This commit addresses the SDK test timeout issues caused by port conflicts
and incomplete process cleanup.

## Problem
- Web SDK tests frequently timeout when trying to start Rooch server
- Port 6767/6768 conflicts when multiple tests run or old processes remain
- No timeout mechanism for server startup, leading to indefinite hangs
- Process cleanup was incomplete, leaving zombie processes

## Solution
1. **Dynamic Port Allocation**
   - Added isPortAvailable() function to check port availability
   - Modified loadRoochEnv() to check if requested port is available
   - Falls back to dynamic port (0) if requested port is occupied
   - Ensures each test gets a unique port

2. **Timeout Mechanism**
   - Added 30-second timeout to roochAsyncCommand()
   - Provides clear error messages when startup fails
   - Includes captured output in error for debugging

3. **Improved Process Cleanup**
   - Enhanced cleanEnv() to use SIGTERM first, then SIGKILL
   - Added fallback port-based cleanup using lsof
   - Cross-platform support (macOS/Linux/Windows)
   - Better logging for troubleshooting

4. **Better Logging**
   - Added debug logs for port allocation and process lifecycle
   - Logs PID and port for each server instance
   - Helps diagnose issues in CI environment

## Testing
- Verified multiple servers can start concurrently without conflicts
- Confirmed process cleanup properly frees ports
- Tested timeout mechanism with unavailable ports

Fixes #3755
- Fix Windows batch command syntax: use %%a instead of %a
- Make cleanup synchronous to avoid race conditions
- Remove redundant execSync require (use top-level import)
- Simplify cleanup logic by removing async setTimeout

Changes based on Copilot review feedback.
…e tests

## Problem
- Previous approach used hardcoded ports (6767) causing conflicts
- Manual server management via npm scripts was error-prone
- Tests couldn't run concurrently due to port conflicts
- Platform-specific cleanup commands (lsof/netstat)

## Solution: Global Setup with Dynamic Port Allocation

### 1. Global Setup (globalSetup.ts)
- Starts single shared Rooch server for all e2e tests
- Uses dynamic port allocation (0 = auto-assign)
- Sets VITE_FULLNODE_URL environment variable
- Automatic cleanup via Vitest teardown

### 2. Simplified npm Scripts
- Removed: prepare:e2e, stop:e2e, wait-on dependency
- Now: just 'vitest run e2e'
- No manual server management needed

### 3. TestBox Integration
- Each test's TestBox now configures rooch CLI to use global server
- Fixed cmdPublishPackage to replace 'default' with actual address
- Improved JSON parsing to handle mixed output (logs + JSON)
- Enhanced error handling with detailed logging

### 4. Better Error Handling
- Capture both stdout and stderr from execSync
- Robust JSON extraction from mixed output
- Clear error messages for debugging

## Benefits
✅ No port conflicts (dynamic allocation)
✅ Faster tests (shared server, no repeated startup)
✅ Cross-platform (no lsof/netstat needed)
✅ Simpler scripts (3 files changed, 4 scripts removed)
✅ Better reliability (automatic cleanup)
✅ CI compatible (no workflow changes needed)

## Test Results
- All 13 test files pass (32 tests)
- Duration: ~89s
- Dynamic port confirmed: http://127.0.0.1:64271

Fixes #3755
Comment thread sdk/typescript/test-suite/src/testbox.ts Fixed
Comment thread sdk/typescript/test-suite/src/testbox.ts Fixed
## Problem
The isPortAvailable() function had a Time-of-Check-Time-of-Use (TOCTOU)
race condition:

1. Check if port is available ✓
2. Close test server, releasing the port ⚠️
3. **Another process could steal the port here** 💥
4. Try to start Rooch server on that port ✗

This could cause:
- Server startup failures
- 30-second timeout instead of fast failure
- Flaky tests in concurrent environments

## Solution
Remove the isPortAvailable() check entirely and use a simpler strategy:

- For port 0, 6767, or 6768: Always use dynamic allocation
- For other ports: Trust the caller, let server fail fast if unavailable

Benefits:
✅ No race condition
✅ Simpler code (removed isPortAvailable function)
✅ Faster failure (server bind error vs 30s timeout)
✅ Common ports (6767/6768) always use dynamic allocation

## Trade-offs
- Custom ports (not 0/6767/6768) won't be pre-checked
- Server will fail at bind time instead of during port check
- This is actually better: fail-fast with clear error message
…ckage

## Reason
The rooch CLI already supports 'default' keyword in named addresses:
--named-addresses alice=0x1234,bob=default

The 'default' keyword is automatically resolved to the active account
address by the rooch command itself.

## Changes
- Removed manual address replacement logic (await defaultCmdAddress())
- Simplified cmdPublishPackage to directly use options.namedAddresses
- Added comment explaining that rooch CLI handles 'default' keyword

## Benefits
✅ Simpler code (removed 2 lines)
✅ Faster execution (no async address lookup)
✅ Consistent with rooch CLI behavior
✅ Less error-prone (let CLI handle address resolution)

## Test Results
All 13 test files pass (32 tests) - Duration: 89.66s
…injection

## Security Issue
Using execSync with shell: true can lead to command injection vulnerabilities
if user input is not properly sanitized.

## Solution
Refactored buildRoochCommand to return structured command information:
- cmd: executable path
- args: array of arguments
- env: environment variables

Updated roochCommand to use execFileSync instead of execSync:
- execFileSync doesn't use shell, preventing injection attacks
- Arguments are passed as array, not concatenated strings
- More secure and explicit command execution

Updated roochAsyncCommand to use spawn without shell:
- spawn(cmd, args, { env }) instead of spawn(command, { shell: true })
- Same security benefits as execFileSync

## Benefits
✅ Prevents shell injection vulnerabilities
✅ More explicit argument handling
✅ Proper environment variable passing
✅ Cleaner separation of concerns

## Test Results
All 13 test files pass (32 tests) - Duration: 90.03s

Addresses GitHub Advanced Security code scanning alert
@jolestar jolestar merged commit 7afd420 into main Nov 22, 2025
19 checks passed
@jolestar jolestar deleted the fix/sdk-test-port-management branch November 22, 2025 04:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants