Skip to content

Fix WebSocket BatchLimit Test Timeout Error on MacOS#16431

Closed
grandchildrice wants to merge 1 commit into
erigontech:mainfrom
grandchildrice:gohan/fix-ws-batchlimit-closing-error-on-mac
Closed

Fix WebSocket BatchLimit Test Timeout Error on MacOS#16431
grandchildrice wants to merge 1 commit into
erigontech:mainfrom
grandchildrice:gohan/fix-ws-batchlimit-closing-error-on-mac

Conversation

@grandchildrice

Copy link
Copy Markdown
Contributor

Issue

#16382

Cause

The WebSocket batch limit test was timing out on macOS because when the batch limit was exceeded, the server would close the connection abruptly without sending a proper WebSocket close frame. This caused the client to hang waiting for a response, leading to test timeouts.

Fix

Added proper WebSocket connection closing when batch limit is exceeded:

  1. Send the batch limit error response before closing
  2. For WebSocket connections specifically, send a proper close message with websocket.CloseInternalServerErr status
  3. Set a write deadline to ensure the close message doesn't block
  4. Then close the connection as before

Test Execution Command:

go test -v ./rpc -run TestBatchLimit_WebSocket_Exceeded

Note

Since I didn't have a Mac OS environment at hand, I am conducting operational checks on the CI of the forked repository. If anyone has a Mac OS environment, I think it would be best to test it on your own machine.

https://github.com/grandchildrice/erigon/actions/runs/16712583476

@yperbasis yperbasis requested review from anacrolix and canepat August 5, 2025 12:35
@anacrolix

Copy link
Copy Markdown
Contributor

I have no capacity to check this right now, can you look @canepat ?

@canepat

canepat commented Aug 14, 2025

Copy link
Copy Markdown
Member

I have no capacity to check this right now, can you look @canepat ?

Yep, I'm on it

@anacrolix anacrolix removed their request for review August 15, 2025 02:55
@canepat

canepat commented Aug 26, 2025

Copy link
Copy Markdown
Member

I've been able to reproduce the deadlock in TestBatchLimit_WebSocket_Exceeded on macOS, both with and without the changes of this PR, by adding a delay at the beginning of the requestOp.wait function:

func (op *requestOp) wait(ctx context.Context, c *Client) (*jsonrpcMessage, error) {
	time.Sleep(30 * time.Millisecond)
...
}

and executing the test repeatedly:

-test.count=10000 -test.timeout=30m

This is the output:

=== RUN   TestBatchLimit_WebSocket_Exceeded
=== PAUSE TestBatchLimit_WebSocket_Exceeded
=== CONT  TestBatchLimit_WebSocket_Exceeded
[WARN] [08-26|12:27:53.991] Cannot register RPC callback [invalidRets1] - error must the last return value
[WARN] [08-26|12:27:53.991] Cannot register RPC callback [invalidRets2] - error must the last return value
[WARN] [08-26|12:27:53.991] Cannot register RPC callback [invalidRets3] - maximum 2 return values are allowed, got 3
[WARN] [08-26|12:27:53.996] [rpc] batch limit exceeded               limit=10 requested=20
[WARN] [08-26|12:27:53.996] [rpc handler] cancelAllRequests called   err="batch limit 10 exceeded (can increase by --rpc.batch.limit). Requested: 20" pending_requests=0
[WARN] [08-26|12:27:53.996] [rpc handler] cancelAllRequests called   err="client is closed" pending_requests=0
[WARN] [08-26|12:27:53.996] [rpc handler] cancelAllRequests called   err="websocket: close 1011 (internal server error): batch limit exceeded" pending_requests=20
panic: test timed out after 30m0s
	running tests:
		TestBatchLimit_WebSocket_Exceeded (25m51s)

goroutine 68154 [running]:
testing.(*M).startAlarm.func1()
	<go_install_dir>/pkg/mod/golang.org/toolchain@v0.0.1-go1.24.2.darwin-arm64/src/testing/testing.go:2484 +0x4d8
created by time.goFunc
	<go_install_dir>/pkg/mod/golang.org/toolchain@v0.0.1-go1.24.2.darwin-arm64/src/time/sleep.go:215 +0x44

goroutine 1 [chan receive, 25 minutes]:
testing.tRunner.func1()
	<go_install_dir>/pkg/mod/golang.org/toolchain@v0.0.1-go1.24.2.darwin-arm64/src/testing/testing.go:1753 +0x660
testing.tRunner(0xc0005c48c0, 0xc000013ad8)
	<go_install_dir>/pkg/mod/golang.org/toolchain@v0.0.1-go1.24.2.darwin-arm64/src/testing/testing.go:1798 +0x1ac
testing.runTests(0xc0000fc9a8, {0x1011f1360, 0x28, 0x28}, {0xfd5947c7c1f95ee4?, 0x0?, 0x10124a480?})
	<go_install_dir>/pkg/mod/golang.org/toolchain@v0.0.1-go1.24.2.darwin-arm64/src/testing/testing.go:2277 +0x780
testing.(*M).Run(0xc0001da6e0)
	<go_install_dir>/pkg/mod/golang.org/toolchain@v0.0.1-go1.24.2.darwin-arm64/src/testing/testing.go:2142 +0xb6c
main.main()
	_testmain.go:129 +0x114

goroutine 68371 [select, 25 minutes]:
github.com/erigontech/erigon/rpc.(*Client).dispatch(0xc0004c5860, {0x100d57f18, 0xc000171050})
	<erigon_repo_dir>/rpc/client.go:567 +0x338
created by github.com/erigontech/erigon/rpc.initClient in goroutine 68314
	<erigon_repo_dir>/rpc/client.go:235 +0x570

goroutine 68314 [select, 25 minutes]:
github.com/erigontech/erigon/rpc.(*requestOp).wait(0xc00046ee80, {0x100d55d48, 0x1021b84c0}, 0xc0004c5860)
	<erigon_repo_dir>/rpc/client.go:147 +0xc4
github.com/erigontech/erigon/rpc.(*Client).BatchCallContext(0xc0004c5860, {0x100d55d48, 0x1021b84c0}, {0xc000470008, 0x14, 0x0?})
	<erigon_repo_dir>/rpc/client.go:381 +0x3c4
github.com/erigontech/erigon/rpc.(*Client).BatchCall(...)
	<erigon_repo_dir>/rpc/client.go:344
github.com/erigontech/erigon/rpc.TestBatchLimit_WebSocket_Exceeded(0xc0005c5dc0)
	<erigon_repo_dir>/rpc/batch_limit_ws_test.go:47 +0x460
testing.tRunner(0xc0005c5dc0, 0x100d48a70)
	<go_install_dir>/pkg/mod/golang.org/toolchain@v0.0.1-go1.24.2.darwin-arm64/src/testing/testing.go:1792 +0x184
created by testing.(*T).Run in goroutine 1
	<go_install_dir>/pkg/mod/golang.org/toolchain@v0.0.1-go1.24.2.darwin-arm64/src/testing/testing.go:1851 +0x688

goroutine 68315 [IO wait, 25 minutes]:
internal/poll.runtime_pollWait(0x10288aa80, 0x72)
	<go_install_dir>/pkg/mod/golang.org/toolchain@v0.0.1-go1.24.2.darwin-arm64/src/runtime/netpoll.go:351 +0xa0
internal/poll.(*pollDesc).wait(0xc0005dea20, 0x72, 0x0)
	<go_install_dir>/pkg/mod/golang.org/toolchain@v0.0.1-go1.24.2.darwin-arm64/src/internal/poll/fd_poll_runtime.go:84 +0xb8
internal/poll.(*pollDesc).waitRead(...)
	<go_install_dir>/pkg/mod/golang.org/toolchain@v0.0.1-go1.24.2.darwin-arm64/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Accept(0xc0005dea00)
	<go_install_dir>/pkg/mod/golang.org/toolchain@v0.0.1-go1.24.2.darwin-arm64/src/internal/poll/fd_unix.go:620 +0x3a8
net.(*netFD).accept(0xc0005dea00)
	<go_install_dir>/pkg/mod/golang.org/toolchain@v0.0.1-go1.24.2.darwin-arm64/src/net/fd_unix.go:172 +0x38
net.(*TCPListener).accept(0xc000284d40)
	<go_install_dir>/pkg/mod/golang.org/toolchain@v0.0.1-go1.24.2.darwin-arm64/src/net/tcpsock_posix.go:159 +0x40
net.(*TCPListener).Accept(0xc000284d40)
	<go_install_dir>/pkg/mod/golang.org/toolchain@v0.0.1-go1.24.2.darwin-arm64/src/net/tcpsock.go:380 +0x68
net/http.(*Server).Serve(0xc000224700, {0x100d54d10, 0xc000284d40})
	<go_install_dir>/pkg/mod/golang.org/toolchain@v0.0.1-go1.24.2.darwin-arm64/src/net/http/server.go:3424 +0x488
net/http/httptest.(*Server).goServe.func1()
	<go_install_dir>/pkg/mod/golang.org/toolchain@v0.0.1-go1.24.2.darwin-arm64/src/net/http/httptest/server.go:311 +0xb4
created by net/http/httptest.(*Server).goServe in goroutine 68314
	<go_install_dir>/pkg/mod/golang.org/toolchain@v0.0.1-go1.24.2.darwin-arm64/src/net/http/httptest/server.go:309 +0x9c

[WARN] [08-26|12:23:55.553] [rpc handler] cancelAllRequests called   err="read tcp 127.0.0.1:59759->127.0.0.1:59761: use of closed network connection" pending_requests=0
[WARN] [08-26|12:24:23.800] [rpc handler] cancelAllRequests called   err="client is closed" pending_requests=0
[WARN] [08-26|12:24:47.357] [rpc handler] cancelAllRequests called   err="client is closed" pending_requests=0
[WARN] [08-26|12:25:29.978] [rpc handler] cancelAllRequests called   err="read tcp 127.0.0.1:63873->127.0.0.1:63875: use of closed network connection" pending_requests=0
[WARN] [08-26|12:25:33.462] [rpc handler] cancelAllRequests called   err="client is closed" pending_requests=0
[WARN] [08-26|12:25:36.588] [rpc handler] cancelAllRequests called   err="read tcp 127.0.0.1:64158->127.0.0.1:64160: use of closed network connection" pending_requests=0
[WARN] [08-26|12:25:59.092] [rpc handler] cancelAllRequests called   err="read tcp 127.0.0.1:65133->127.0.0.1:65135: use of closed network connection" pending_requests=0
[WARN] [08-26|12:26:40.976] [rpc handler] cancelAllRequests called   err="read tcp 127.0.0.1:50578->127.0.0.1:50580: use of closed network connection" pending_requests=0

Process finished with the exit code 1

which also contains an additional log added in handler.cancelAllRequests function to track down the number of pending requests i.e. len(h.respWait) on both client and server sides.

@canepat

canepat commented Aug 26, 2025

Copy link
Copy Markdown
Member

Replaced by #16829

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants