fix(cdp): prevent indefinite hangs on remote browsers via DOMWatchdog timeouts#4875
Conversation
|
Hey, I dug into #4579 around remote CDP sessions hanging indefinitely on degraded websocket connections. The main issue seemed to be that several DOMWatchdog CDP calls were awaiting without timeout boundaries, so once a remote connection went half-open the watchdog could block the event bus and stall parallel sessions. I pushed a fix that:
Also added tests around timeout handling + connection recovery flow. Would appreciate feedback on the timeout/reconnect strategy, especially around whether you'd prefer more aggressive session invalidation or softer recovery behavior for pooled remote browsers. |
|
this is awesome, can you remove unrelated changes to your PR like spacing and stuff, doesnt make sense. |
|
Yeah, some extra formatting changes slipped in locally. I’ll clean the diff up and keep it scoped to the timeout/recovery fix. |
… timeouts - Add _is_healthy flag to BrowserSession for connection state tracking - Add verify_connection_health() lightweight CDP ping (1.5s timeout) - Guard is_cdp_connected with _is_healthy check - Wrap _get_pending_network_requests CDP call in asyncio.wait_for (5s) - Add TimeoutError catch blocks to 3 critical DOMWatchdog methods - On timeout: log, mark session unhealthy, raise ConnectionError - Add proactive health check in watchdog event handler circuit breaker - Trigger auto-reconnect when health check fails - Add 4 targeted tests for health verification and timeout handling Fixes browser-use#4579
3e0b578 to
4bc538d
Compare
|
Cleaned up the unrelated formatting changes and kept the diff scoped to the timeout/recovery logic + tests. |
So I tracked down the hang issue around remote CDP sessions getting stuck indefinitely when the websocket silently degrades in Docker/cloud setups.
The main problem was that a few DOMWatchdog calls were awaiting CDP responses forever with no timeout boundaries, so once the socket went half-open the event bus lock never got released and parallel browser sessions just piled up behind it.
What's working now:
One annoying thing I hit while testing was reproducing the half-open websocket state consistently. Local runs were fine because TCP teardown happens immediately on localhost. Ended up simulating packet drops inside Docker/WSL to reproduce the actual hanging behavior reliably.
Also hit a small WSL issue with pre-commit expecting python3.11 specifically, but that was just local env noise — ruff + formatting + targeted tests are all passing.
Core fix looks like this now:
I also added tests around timeout handling + connection health verification so we don't regress on this later.
Summary by cubic
Prevents remote CDP sessions from hanging by adding bounded timeouts and proactive health checks, so half‑open WebSockets in Docker/cloud no longer stall the event loop. Parallel sessions now recover or fail fast. Fixes #4579.
_is_healthyflag that gatesis_cdp_connected, plusverify_connection_health()(1.5sBrowser.getVersionping).get_or_create_cdp_session, network request checks, DOM build, screenshot); on timeout mark session unhealthy and raiseConnectionError.tests/ci/test_connection_health.pycover health flag lifecycle, ping success/timeout, and DOMWatchdog timeout handling.Written for commit 4bc538d. Summary will update on new commits. Review in cubic