fix(sdk): improve web SDK test port management and process cleanup#3794
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Scanned FilesNone |
There was a problem hiding this comment.
Pull request overview
This PR addresses critical test reliability issues in the Web SDK by improving port management and process cleanup. The changes focus on preventing port conflicts during concurrent test execution and ensuring complete cleanup of server processes.
Key Changes
- Dynamic port allocation: Added
isPortAvailable()helper and logic to automatically fall back to random ports when requested ports are occupied - Timeout mechanism: Added 30-second timeout to server startup with better error reporting
- Enhanced process cleanup: Improved
cleanEnv()with graceful shutdown (SIGTERM), force kill fallback (SIGKILL), and platform-specific port-based cleanup
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| sdk/typescript/test-suite/src/testbox.ts | Added port availability checking, timeout handling, and comprehensive process cleanup with logging |
| sdk/typescript/rooch-sdk/test-e2e/setup.ts | Changed default port from 6768 to 0 for automatic port assignment |
| const timeout = setTimeout(() => { | ||
| child.kill('SIGTERM') | ||
| // Force kill after 2 seconds if SIGTERM doesn't work | ||
| setTimeout(() => { | ||
| try { | ||
| child.kill('SIGKILL') | ||
| } catch (e) { | ||
| // Process already dead, ignore | ||
| } | ||
| }, 2000) | ||
| reject(new Error(`Timeout (${timeoutMs}ms) waiting for: ${waitFor}\nOutput so far: ${output}`)) | ||
| }, timeoutMs) |
There was a problem hiding this comment.
[nitpick] The nested setTimeout for force-killing the child process creates a potential memory leak. If the timeout triggers and the process terminates quickly from SIGTERM, the inner setTimeout (lines 318-324) will still execute after 2 seconds. While the try-catch prevents crashes, this leaves an unreferenced timer that should be cleaned up.
Consider storing the timer reference and clearing it when appropriate:
const timeout = setTimeout(() => {
child.kill('SIGTERM')
const forceKillTimer = setTimeout(() => {
try {
child.kill('SIGKILL')
} catch (e) {
// Process already dead, ignore
}
}, 2000)
// Store forceKillTimer to clear if needed
reject(new Error(`Timeout (${timeoutMs}ms) waiting for: ${waitFor}\nOutput so far: ${output}`))
}, timeoutMs)Or use a simpler approach with direct SIGKILL after clearing the main timeout.
| // Always use dynamic port in local mode to avoid conflicts | ||
| // Check if the requested port is available, if not, get a random one |
There was a problem hiding this comment.
[nitpick] The comment "Always use dynamic port in local mode to avoid conflicts" is misleading. The code still allows using a specific port if it's available (lines 109-116), so it doesn't always use a dynamic port.
Consider updating to:
// Check if the requested port is available, if not, get a dynamic one| // Always use dynamic port in local mode to avoid conflicts | |
| // Check if the requested port is available, if not, get a random one | |
| // Check if the requested port is available, if not, get a dynamic one | |
| // (If port is unavailable or zero, get a random unused port) |
| if (typeof this.roochContainer === 'number') { | ||
| process.kill(this.roochContainer) | ||
| const pid = this.roochContainer | ||
| log(`Cleaning up Rooch server process with PID: ${pid}`) | ||
|
|
||
| try { | ||
| // Try graceful shutdown first with SIGTERM | ||
| process.kill(pid, 'SIGTERM') | ||
| log(`Sent SIGTERM to process ${pid}`) | ||
|
|
||
| // Wait a bit, then force kill if process still exists | ||
| setTimeout(() => { | ||
| try { | ||
| // Check if process still exists (will throw if not) | ||
| process.kill(pid, 0) | ||
| // Process still exists, force kill | ||
| log(`Process ${pid} still running, sending SIGKILL`) | ||
| process.kill(pid, 'SIGKILL') | ||
| } catch (e) { | ||
| // Process already dead, this is expected | ||
| log(`Process ${pid} already terminated`) | ||
| } | ||
| }, 2000) | ||
| } catch (e: any) { | ||
| // Process might already be dead | ||
| log(`Failed to kill process ${pid}: ${e.message}`) | ||
| } | ||
|
|
||
| // Fallback: kill any process listening on the port | ||
| if (this.roochPort) { | ||
| try { | ||
| log(`Cleaning up any process on port ${this.roochPort}`) | ||
| const { execSync } = require('child_process') | ||
| // Use platform-appropriate command | ||
| if (process.platform === 'win32') { | ||
| execSync( | ||
| `for /f "tokens=5" %a in ('netstat -aon ^| findstr :${this.roochPort}') do taskkill /F /PID %a`, | ||
| { stdio: 'ignore' }, | ||
| ) | ||
| } else { | ||
| execSync(`lsof -ti:${this.roochPort} | xargs kill -9 2>/dev/null || true`, { | ||
| stdio: 'ignore', | ||
| }) | ||
| } | ||
| } catch (e) { | ||
| // Ignore errors - port might already be free | ||
| log(`Port cleanup completed (or was already free)`) | ||
| } | ||
| } | ||
| } else { | ||
| this.roochContainer?.stop() | ||
| } | ||
|
|
||
| this.tmpDir.removeCallback() | ||
| log('Environment cleanup completed') | ||
| } |
There was a problem hiding this comment.
The cleanEnv() method has a timing issue with asynchronous cleanup. The setTimeout() calls execute asynchronously but cleanEnv() doesn't wait for them, so the method returns immediately while cleanup is still in progress. This means:
- The temporary directory removal on line 269 may happen before process termination completes
- Tests may proceed before cleanup finishes, potentially causing resource leaks
- The port-based cleanup (lines 245-264) runs immediately and may kill the process before the SIGTERM timeout completes
Consider making cleanEnv() async and using await with Promise-based delays, or use execSync for immediate synchronous cleanup.
This commit addresses the SDK test timeout issues caused by port conflicts and incomplete process cleanup. ## Problem - Web SDK tests frequently timeout when trying to start Rooch server - Port 6767/6768 conflicts when multiple tests run or old processes remain - No timeout mechanism for server startup, leading to indefinite hangs - Process cleanup was incomplete, leaving zombie processes ## Solution 1. **Dynamic Port Allocation** - Added isPortAvailable() function to check port availability - Modified loadRoochEnv() to check if requested port is available - Falls back to dynamic port (0) if requested port is occupied - Ensures each test gets a unique port 2. **Timeout Mechanism** - Added 30-second timeout to roochAsyncCommand() - Provides clear error messages when startup fails - Includes captured output in error for debugging 3. **Improved Process Cleanup** - Enhanced cleanEnv() to use SIGTERM first, then SIGKILL - Added fallback port-based cleanup using lsof - Cross-platform support (macOS/Linux/Windows) - Better logging for troubleshooting 4. **Better Logging** - Added debug logs for port allocation and process lifecycle - Logs PID and port for each server instance - Helps diagnose issues in CI environment ## Testing - Verified multiple servers can start concurrently without conflicts - Confirmed process cleanup properly frees ports - Tested timeout mechanism with unavailable ports Fixes #3755
- Fix Windows batch command syntax: use %%a instead of %a - Make cleanup synchronous to avoid race conditions - Remove redundant execSync require (use top-level import) - Simplify cleanup logic by removing async setTimeout Changes based on Copilot review feedback.
d9bdf16 to
bdd036c
Compare
…e tests ## Problem - Previous approach used hardcoded ports (6767) causing conflicts - Manual server management via npm scripts was error-prone - Tests couldn't run concurrently due to port conflicts - Platform-specific cleanup commands (lsof/netstat) ## Solution: Global Setup with Dynamic Port Allocation ### 1. Global Setup (globalSetup.ts) - Starts single shared Rooch server for all e2e tests - Uses dynamic port allocation (0 = auto-assign) - Sets VITE_FULLNODE_URL environment variable - Automatic cleanup via Vitest teardown ### 2. Simplified npm Scripts - Removed: prepare:e2e, stop:e2e, wait-on dependency - Now: just 'vitest run e2e' - No manual server management needed ### 3. TestBox Integration - Each test's TestBox now configures rooch CLI to use global server - Fixed cmdPublishPackage to replace 'default' with actual address - Improved JSON parsing to handle mixed output (logs + JSON) - Enhanced error handling with detailed logging ### 4. Better Error Handling - Capture both stdout and stderr from execSync - Robust JSON extraction from mixed output - Clear error messages for debugging ## Benefits ✅ No port conflicts (dynamic allocation) ✅ Faster tests (shared server, no repeated startup) ✅ Cross-platform (no lsof/netstat needed) ✅ Simpler scripts (3 files changed, 4 scripts removed) ✅ Better reliability (automatic cleanup) ✅ CI compatible (no workflow changes needed) ## Test Results - All 13 test files pass (32 tests) - Duration: ~89s - Dynamic port confirmed: http://127.0.0.1:64271 Fixes #3755
## Problem The isPortAvailable() function had a Time-of-Check-Time-of-Use (TOCTOU) race condition: 1. Check if port is available ✓ 2. Close test server, releasing the port⚠️ 3. **Another process could steal the port here** 💥 4. Try to start Rooch server on that port ✗ This could cause: - Server startup failures - 30-second timeout instead of fast failure - Flaky tests in concurrent environments ## Solution Remove the isPortAvailable() check entirely and use a simpler strategy: - For port 0, 6767, or 6768: Always use dynamic allocation - For other ports: Trust the caller, let server fail fast if unavailable Benefits: ✅ No race condition ✅ Simpler code (removed isPortAvailable function) ✅ Faster failure (server bind error vs 30s timeout) ✅ Common ports (6767/6768) always use dynamic allocation ## Trade-offs - Custom ports (not 0/6767/6768) won't be pre-checked - Server will fail at bind time instead of during port check - This is actually better: fail-fast with clear error message
…ckage ## Reason The rooch CLI already supports 'default' keyword in named addresses: --named-addresses alice=0x1234,bob=default The 'default' keyword is automatically resolved to the active account address by the rooch command itself. ## Changes - Removed manual address replacement logic (await defaultCmdAddress()) - Simplified cmdPublishPackage to directly use options.namedAddresses - Added comment explaining that rooch CLI handles 'default' keyword ## Benefits ✅ Simpler code (removed 2 lines) ✅ Faster execution (no async address lookup) ✅ Consistent with rooch CLI behavior ✅ Less error-prone (let CLI handle address resolution) ## Test Results All 13 test files pass (32 tests) - Duration: 89.66s
…injection
## Security Issue
Using execSync with shell: true can lead to command injection vulnerabilities
if user input is not properly sanitized.
## Solution
Refactored buildRoochCommand to return structured command information:
- cmd: executable path
- args: array of arguments
- env: environment variables
Updated roochCommand to use execFileSync instead of execSync:
- execFileSync doesn't use shell, preventing injection attacks
- Arguments are passed as array, not concatenated strings
- More secure and explicit command execution
Updated roochAsyncCommand to use spawn without shell:
- spawn(cmd, args, { env }) instead of spawn(command, { shell: true })
- Same security benefits as execFileSync
## Benefits
✅ Prevents shell injection vulnerabilities
✅ More explicit argument handling
✅ Proper environment variable passing
✅ Cleaner separation of concerns
## Test Results
All 13 test files pass (32 tests) - Duration: 90.03s
Addresses GitHub Advanced Security code scanning alert
Problem
Web SDK tests frequently timeout when trying to start the Rooch server, with errors like:
Root causes:
cleanEnv()didn't ensure processes were fully terminatedSolution
1. Dynamic Port Allocation
isPortAvailable()function to check port status before bindingloadRoochEnv()to automatically fall back to a random available port if the requested port is occupied2. Timeout Mechanism
roochAsyncCommand()3. Improved Process Cleanup
cleanEnv()to:lsof4. Better Debugging
Changes
sdk/typescript/test-suite/src/testbox.tsisPortAvailable()helper functionroochAsyncCommand()with timeout and better error handlingloadRoochEnv()to handle port conflicts automaticallycleanEnv()for more robust process terminationsdk/typescript/rooch-sdk/test-e2e/setup.ts6768to0(auto-assign) to avoid conflictsTesting
Verified locally:
Test results:
Impact
Related to #3755