Skip to content

🐛 BUG: Unexplainable behavior with x-dev-env - worker hangs after first request in GH Actions #7635

@theKashey

Description

@theKashey

Which Cloudflare product(s) does this pertain to?

Wrangler

What versions are you using?

3.86.0 [Wrangler]

What operating system and version are you using?

Mac Sonoma 14.7.1

Please provide a link to a minimal reproduction

No response

Describe the Bug

We were trying to update wrangler for quite a while and here are some very weird problems we've experienced.

  • first of all - everything works (almost) fine on local and in the Cloudflare. The only problem is Playwright running tests in GitHubActions (via mcr.microsoft.com/playwright:v1.48.1-focal)
  • locally problem can be reproduced by changing worker code - eventually after another reload it will stop processing requests, but will accept connection (so socket will hang). Any other change will restore the functionality
  • everything works fine on 3.78.10
  • any version bump will require --x-dev-env=false, or it would not work
  • as this flag was removed at 3.86.1 any newer version dont work.
  • unless you use --log-level=debug which somehow resolve the problem (tests are passing 9 out of 10)

We run 7 workers in parallel, with the last one - "gateway" starting after all. Here are logs from the area of interest

v3.78.10

[WebServer] [1] [GATEWAY]  ⛅️ wrangler 3.78.10 (update available 3.95.0)
[WebServer] [1] [GATEWAY] --------------------------------------------------------
[WebServer] [1] [GATEWAY] Using vars defined in .dev.vars
[WebServer] [1] [GATEWAY] Your worker has access to the following bindings:
[WebServer] [1] [GATEWAY] - Services:
[WebServer] [1] [GATEWAY]   - IMPRESSIONS: thanks-impressions
[WebServer] [1] [GATEWAY]   - IMPRESSIONS-V2: thanks-impressions-v2
[WebServer] [1] [GATEWAY]   - PROMOTIONS: thanks-promotion-service
[WebServer] [1] [GATEWAY]   - SUBSCRIPTIONS: thanks-subscription-service
[WebServer] [1] [GATEWAY]   - NODE_SERVICES: thanks-node-services
[WebServer] [1] [GATEWAY]   - TEMPLATE: thanks-template-service
[WebServer] [1] [GATEWAY] [wrangler:inf] Ready on http://localhost:8081
[WebServer] [1] [GATEWAY] ⎔ Starting local server...
[WebServer] [1] [GATEWAY] @service:gateway@ : [get]/ : dev,US,IAD
[WebServer] [1] [GATEWAY] [wrangler:inf] GET / 400 Bad Request (598ms)
/// ⬆️ Playwright probing URL and ⬇️ running tests
Running 18 tests using 1 worker
[WebServer] [1] [GATEWAY] [wrangler:inf] GET /v1/widget.js 200 OK (28ms)
/// ⬆️ working

v3.86.0 and --x-dev-env=false

[WebServer] [1] [GATEWAY]  ⛅️ wrangler 3.86.0 (update available 3.99.0)
[WebServer] [1] [GATEWAY] -------------------------------------------------------
[WebServer] [1] [GATEWAY] Using vars defined in .dev.vars
[WebServer] [1] [GATEWAY] Your worker has access to the following bindings:
[WebServer] [1] [GATEWAY] - Services:
[WebServer] [1] [GATEWAY]   - IMPRESSIONS: thanks-impressions [not connected]
[WebServer] [1] [GATEWAY]   - IMPRESSIONS-V2: thanks-impressions-v2 [not connected]
[WebServer] [1] [GATEWAY]   - PROMOTIONS: thanks-promotion-service [not connected]
[WebServer] [1] [GATEWAY]   - SUBSCRIPTIONS: thanks-subscription-service [not connected]
[WebServer] [1] [GATEWAY]   - NODE_SERVICES: thanks-node-services [not connected]
[WebServer] [1] [GATEWAY]   - TEMPLATE: thanks-template-service [not connected]
[WebServer] [1] [GATEWAY] 
[WebServer] [1] [GATEWAY] Service bindings & durable object bindings connect to other `wrangler dev` processes running locally, with their connection status indicated by [connected] or [not connected]. For more details, refer to https://developers.cloudflare.com/workers/runtime-apis/bindings/service-bindings/#local-development
[WebServer] [1] [GATEWAY] [wrangler:inf] Ready on http://localhost:8081
[WebServer] [1] [WebServer] [GATEWAY] ⎔ Starting local server...
[WebServer] [1] [GATEWAY] [wrangler:inf] GET / 400 Bad Request (497ms)
/// ⬆️ Playwright probing URL and ⬇️ running tests
Running 27 tests using 1 worker
[WebServer] [1] [GATEWAY] ⎔ Reloading local server...
[WebServer] [1] [GATEWAY] [wrangler:inf] GET /v1/widget.js 200 OK (24ms)
/// ⬆️ working

3.86.1 (without -x-dev-env)

[WebServer] [1] [GATEWAY]  ⛅️ wrangler 3.86.1 (update available 3.99.0)
[WebServer] [1] [GATEWAY] -------------------------------------------------------
[WebServer] [1] [GATEWAY] Your worker has access to the following bindings:
[WebServer] [1] [GATEWAY] - Services:
[WebServer] [1] [GATEWAY]   - IMPRESSIONS: thanks-impressions [not connected]
[WebServer] [1] [GATEWAY]   - IMPRESSIONS-V2: thanks-impressions-v2 [not connected]
[WebServer] [1] [GATEWAY]   - PROMOTIONS: thanks-promotion-service [not connected]
[WebServer] [1] [GATEWAY]   - SUBSCRIPTIONS: thanks-subscription-service [not connected]
[WebServer] [1] [GATEWAY]   - NODE_SERVICES: thanks-node-services [not connected]
[WebServer] [1] [GATEWAY]   - TEMPLATE: thanks-template-service [not connected]
[WebServer] [1] [GATEWAY] 
[WebServer] [1] [GATEWAY] Service bindings & durable object bindings connect to other `wrangler dev` processes running locally, with their connection status indicated by [connected] or [not connected]. For more details, refer to https://developers.cloudflare.com/workers/runtime-apis/bindings/service-bindings/#local-development
[WebServer] [1] [GATEWAY] 
[WebServer] [1] [GATEWAY] [wrangler:inf] Ready on http://localhost:8081
[WebServer] [1] [GATEWAY] Using vars defined in .dev.vars
[WebServer] [1] [GATEWAY] - Services:
[WebServer] [1] [GATEWAY]   - IMPRESSIONS: thanks-impressions [connected]
[WebServer] [1] [GATEWAY]   - IMPRESSIONS-V2: thanks-impressions-v2 [connected]
[WebServer] [1] [GATEWAY]   - PROMOTIONS: thanks-promotion-service [connected]
[WebServer] [1] [GATEWAY]   - SUBSCRIPTIONS: thanks-subscription-service [connected]
[WebServer] [1] [GATEWAY]   - NODE_SERVICES: thanks-node-services [connected]
[WebServer] [1] [GATEWAY]   - TEMPLATE: thanks-template-service [connected]
[WebServer] [1] [GATEWAY] ⎔ Starting local server...
[WebServer] [1] [GATEWAY] [wrangler:inf] GET / 400 Bad Request (502ms)
/// ⬆️ Playwright probing URL and ⬇️ running tests
Running 27 tests using 1 worker
/// 😵 hangs

It does not matter which URL to try. It will handle the first request, but none other.
"hangs" mean that Playwright can establish a connection to get a resource (widget.js) but will never get anything.

v3.99.0

totally equal to 3.86.1

v3.99.0 and --log-level=debug

output is too verbose, but I found an interesting difference between adding debug for everybody or just gateway

  • everybody
[WebServer] [1] [GATEWAY] [InspectorProxyWorker] RUNTIME INCOMING MESSAGE { id: 100000003, method: 'Network.enable', result: {} }
[WebServer] [1] [SERVICES] [@theway/template-service] [InspectorProxyWorker] SEND TO RUNTIME {"method":"Runtime.getIsolateId","id":100000011}
[WebServer] [1] [SERVICES] [@theway/template-service] [InspectorProxyWorker] RUNTIME INCOMING MESSAGE { id: 100000011, result: { id: '7296e37d6767447e' } }
[WebServer] [1] [SERVICES] [@theway/impressions-service-v2] [InspectorProxyWorker] SEND TO RUNTIME {"method":"Runtime.getIsolateId","id":100000011}
[WebServer] [1] [SERVICES] [@theway/impressions-service-v2] [InspectorProxyWorker] RUNTIME INCOMING MESSAGE { id: 100000011, result: { id: '66815b1649719b4f' } }
/// ⬆️ Playwright probing URL and ⬇️ running tests
Running 27 tests using 1 worker
[WebServer] [1] [SERVICES] [@theway/node-service] [InspectorProxyWorker] SEND TO RUNTIME {"method":"Runtime.getIsolateId","id":100000011}
[WebServer] [1] [SERVICES] [@theway/node-service] [InspectorProxyWorker] RUNTIME INCOMING MESSAGE { id: 100000011, result: { id: '85b82413505896e2' } }
[WebServer] [1] [GATEWAY] @service:gateway@ : [get]/v1/widget.js : dev,US,IAD 
⬆️⬆️ its working! ⬆️⬆️
  • just gateway
[WebServer] [1] [GATEWAY] [InspectorProxyWorker] RUNTIME INCOMING MESSAGE { id: 100000003, method: 'Network.enable', result: {} }
/// ⬆️ Playwright probing URL and ⬇️ running tests
Running 27 tests using 1 worker
[WebServer] [1] [GATEWAY] [wrangler-ProxyWorker:inf] GET /cdn-cgi/ProxyWorker/pause 204 No Content (1ms) 
                                                              ⬆️ What is that? ⬆️
[WebServer] [1] [GATEWAY] [InspectorProxyWorker] SEND TO RUNTIME {"method":"Runtime.getIsolateId","id":100000004}
[WebServer] [1] [GATEWAY] [InspectorProxyWorker] RUNTIME INCOMING MESSAGE { id: 100000004, result: { id: 'd413491f66c9cc25' } }
⬆️ "no reaction" ⬆️
[WebServer] [1] [GATEWAY] [InspectorProxyWorker] SEND TO RUNTIME {"method":"Runtime.getIsolateId","id":100000005} 
⬆️ that is already another request ⬆️, there would be "no reaction" as well

Adding logs makes test just a little slower, so we are ok to processed with this solution, but I dont really understand what can be a problem here

My main consern here is a call to /cdn-cgi/ProxyWorker/pause
In a working scenario the last message related to ProxyWorker is play

[WebServer] [1] [GATEWAY] [wrangler-ProxyWorker:inf] GET /cdn-cgi/ProxyWorker/play 204 No Content (14ms)

We had a long history with wrangler flakes in CI, so it runs quite "reliable"

  • first run a "dummy" worker, waits for healthcheck
  • later starts all leaf services, waits for healthcheck
  • then starts the gateway, waits for healthcheck

Every service will eventually produce the /cdn-cgi/ProxyWorker/pause message, and it's somehow a matter of luck of the last message is /cdn-cgi/ProxyWorker/play

In the "good scenario" this is what I can see in the logs right after gateway start

[WebServer] [1] [GATEWAY] [wrangler-ProxyWorker:inf] Ready on http://localhost:8081
[WebServer] [1] [GATEWAY] [wrangler-ProxyWorker:inf] GET /cdn-cgi/ProxyWorker/pause 204 No Content (6ms)
[WebServer] [1] [GATEWAY] [wrangler-ProxyWorker:inf] GET /cdn-cgi/ProxyWorker/pause 204 No Content (1ms)
[WebServer] [1] [WebServer] [GATEWAY] [wrangler-ProxyWorker:inf] GET /cdn-cgi/ProxyWorker/pause 204 No Content (1ms)
[WebServer] [1] [GATEWAY] [wrangler-ProxyWorker:inf] GET /cdn-cgi/ProxyWorker/pause 204 No Content (2ms)
[WebServer] [1] [SERVICES] [@theway/promotion-service] [InspectorProxyWorker] SEND TO RUNTIME {"method":"Runtime.getIsolateId","id":100000011}
[WebServer] [1] [SERVICES] [@theway/promotion-service] [InspectorProxyWorker] RUNTIME INCOMING MESSAGE { id: 100000011, result: { id: '1107c602f204561f' } }
[WebServer] [1] [GATEWAY] RemoteRuntimeController teardown beginning...
[WebServer] [1] [GATEWAY] RemoteRuntimeController teardown complete
[WebServer] [1] [GATEWAY] ⎔ Starting local server...
[WebServer] [1] [GATEWAY] [wrangler-ProxyWorker:inf] GET /cdn-cgi/ProxyWorker/pause 204 No Content (1ms)
[WebServer] [1] [GATEWAY] [InspectorProxyWorker] handleProxyControllerIncomingMessage {"type":"reloadStart"}
[WebServer] [1] [GATEWAY] ✘ [ERROR] workerd/server/server.c++:4017: info: Inspector is listening
[WebServer] [1] [GATEWAY] 
[WebServer] [1] [GATEWAY] 
[WebServer] [1] [GATEWAY] [wrangler-ProxyWorker:inf] GET /cdn-cgi/ProxyWorker/pause 204 No Content (0ms)
[WebServer] [1] [GATEWAY] [wrangler-ProxyWorker:inf] GET /cdn-cgi/ProxyWorker/play 204 No Content (14ms)

A log of pause requests, but the very last one is play.

Please provide any relevant error logs

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    awaiting-response:reporterAwaiting response from OP for further clarificationregressionBreak in existing functionality as a result of a recent change

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions