-
Notifications
You must be signed in to change notification settings - Fork 897
Flaky Test: TestAllocationMetrics fails with "address already in use" on port 3001 #4444
Description
The TestAllocationMetrics test in pkg/gameserverallocations/metrics_test.go intermittently fails due to a port conflict on port 3001.
Date: 2026-02-11
Error Details
{"error":"Could not listen on :3001: listen tcp :3001: bind: address already in use","message":"","severity":"error","test":"TestAllocationMetrics","time":"2026-02-11T18:08:21.059633335Z"}
--- FAIL: TestAllocationMetrics (1.41s)
metrics_test.go:209:
Error Trace:/go/src/agones.dev/agones/pkg/gameserverallocations/metrics_test.go:209
Error: Received unexpected error:
Get "http://localhost:3001/metrics": dial tcp 127.0.0.1:3001: connect: connection refused
Test: TestAllocationMetrics
Messages: Failed to GET metrics endpoint
Root Cause
The test has multiple issues that contribute to its flakiness:
-
Hardcoded port (
3001): The HTTP server uses a fixed port that may already be in use from:- Previous test runs that didn't clean up properly
- Other concurrent test processes (if tests run in parallel)
- Other services on the CI/test environment
-
Race condition in server startup: The test uses a naive
time.Sleep(300 * time.Millisecond)to wait for the server to start (line 206), which is unreliable:- If the system is under load, 300ms may not be enough for the server to bind
- If the port is already in use, the server fails immediately but the test still waits and then tries to connect
- There's no verification that the server actually started successfully
-
No cleanup verification: While
defer cancelHTTP()is used, there's no guarantee the server has fully released the port before the test exits
Affected Code
// pkg/gameserverallocations/metrics_test.go:148-206
server := &httpserver.Server{
Port: "3001", // ❌ Hardcoded port
Logger: framework.TestLogger(t),
}
// ...
go func() {
_ = server.Run(ctxHTTP, 0)
}()
time.Sleep(300 * time.Millisecond) // ❌ Race condition
resp, err := http.Get("http://localhost:3001/metrics")
require.NoError(t, err, "Failed to GET metrics endpoint")Suggested Fixes
Option 1: Use Dynamic Port Allocation (Recommended)
Use port 0 to let the OS assign an available port, then retrieve the actual port from the listener:
// Modify httpserver.Server to support dynamic port and expose the listener
listener, err := net.Listen("tcp", ":0")
require.NoError(t, err)
defer listener.Close()
actualPort := listener.Addr().(*net.TCPAddr).Port
server := &httpserver.Server{
Port: fmt.Sprintf("%d", actualPort),
Logger: framework.TestLogger(t),
}
// Start server with the pre-created listener
go func() {
srv := &http.Server{Handler: server}
_ = srv.Serve(listener)
}()
// Now we know the server is ready because Listen succeeded
resp, err := http.Get(fmt.Sprintf("http://localhost:%d/metrics", actualPort))Option 2: Poll for Server Readiness
Replace the sleep with a retry loop that polls the endpoint:
go func() {
_ = server.Run(ctxHTTP, 0)
}()
// Wait for server to be ready
err := wait.PollUntilContextTimeout(context.Background(), 50*time.Millisecond, 5*time.Second, true,
func(_ context.Context) (bool, error) {
resp, err := http.Get("http://localhost:3001/metrics")
if err != nil {
return false, nil // Keep trying
}
resp.Body.Close()
return true, nil
})
require.NoError(t, err, "Server failed to become ready")Option 3: Use t.Parallel() with Unique Ports
If tests are meant to run in parallel, assign unique ports per test or use a port allocation mechanism.
Option 4: Enhance httpserver.Server
Add a readiness channel to the Server struct that closes when ListenAndServe successfully starts:
type Server struct {
http.ServeMux
Port string
Logger *logrus.Entry
Ready chan struct{} // New field
}
func (s *Server) Run(ctx context.Context, _ int) error {
// ...
srv := &http.Server{Addr: ":" + s.Port, Handler: s}
go func() {
<-ctx.Done()
_ = srv.Shutdown(context.Background())
}()
listener, err := net.Listen("tcp", srv.Addr)
if err != nil {
return err
}
close(s.Ready) // Signal that we're ready
return srv.Serve(listener)
}Priority
This is a flaky test that can cause false negatives in CI, potentially blocking legitimate PRs.
Additional Context
The test validates that allocation metrics are properly exported to the Prometheus endpoint. The functionality being tested is important, but the test implementation needs to be more robust to handle various CI/test environments.