Skip to content

Flaky Test: TestAllocationMetrics fails with "address already in use" on port 3001 #4444

@markmandel

Description

@markmandel

The TestAllocationMetrics test in pkg/gameserverallocations/metrics_test.go intermittently fails due to a port conflict on port 3001.

Build Link: https://console.cloud.google.com/cloud-build/builds/4704c5a8-7a2a-452c-82b8-9676ddae0327\;step\=20\?project\=agones-images

Date: 2026-02-11

Error Details

{"error":"Could not listen on :3001: listen tcp :3001: bind: address already in use","message":"","severity":"error","test":"TestAllocationMetrics","time":"2026-02-11T18:08:21.059633335Z"}
--- FAIL: TestAllocationMetrics (1.41s)
    metrics_test.go:209: 
        Error Trace:/go/src/agones.dev/agones/pkg/gameserverallocations/metrics_test.go:209
        Error:      Received unexpected error:
                    Get "http://localhost:3001/metrics": dial tcp 127.0.0.1:3001: connect: connection refused
        Test:       TestAllocationMetrics
        Messages:   Failed to GET metrics endpoint

Root Cause

The test has multiple issues that contribute to its flakiness:

  1. Hardcoded port (3001): The HTTP server uses a fixed port that may already be in use from:

    • Previous test runs that didn't clean up properly
    • Other concurrent test processes (if tests run in parallel)
    • Other services on the CI/test environment
  2. Race condition in server startup: The test uses a naive time.Sleep(300 * time.Millisecond) to wait for the server to start (line 206), which is unreliable:

    • If the system is under load, 300ms may not be enough for the server to bind
    • If the port is already in use, the server fails immediately but the test still waits and then tries to connect
    • There's no verification that the server actually started successfully
  3. No cleanup verification: While defer cancelHTTP() is used, there's no guarantee the server has fully released the port before the test exits

Affected Code

// pkg/gameserverallocations/metrics_test.go:148-206
server := &httpserver.Server{
    Port:   "3001",  // ❌ Hardcoded port
    Logger: framework.TestLogger(t),
}
// ...
go func() {
    _ = server.Run(ctxHTTP, 0)
}()
time.Sleep(300 * time.Millisecond)  // ❌ Race condition

resp, err := http.Get("http://localhost:3001/metrics")
require.NoError(t, err, "Failed to GET metrics endpoint")

Suggested Fixes

Option 1: Use Dynamic Port Allocation (Recommended)

Use port 0 to let the OS assign an available port, then retrieve the actual port from the listener:

// Modify httpserver.Server to support dynamic port and expose the listener
listener, err := net.Listen("tcp", ":0")
require.NoError(t, err)
defer listener.Close()

actualPort := listener.Addr().(*net.TCPAddr).Port
server := &httpserver.Server{
    Port:   fmt.Sprintf("%d", actualPort),
    Logger: framework.TestLogger(t),
}

// Start server with the pre-created listener
go func() {
    srv := &http.Server{Handler: server}
    _ = srv.Serve(listener)
}()

// Now we know the server is ready because Listen succeeded
resp, err := http.Get(fmt.Sprintf("http://localhost:%d/metrics", actualPort))

Option 2: Poll for Server Readiness

Replace the sleep with a retry loop that polls the endpoint:

go func() {
    _ = server.Run(ctxHTTP, 0)
}()

// Wait for server to be ready
err := wait.PollUntilContextTimeout(context.Background(), 50*time.Millisecond, 5*time.Second, true, 
    func(_ context.Context) (bool, error) {
        resp, err := http.Get("http://localhost:3001/metrics")
        if err != nil {
            return false, nil // Keep trying
        }
        resp.Body.Close()
        return true, nil
    })
require.NoError(t, err, "Server failed to become ready")

Option 3: Use t.Parallel() with Unique Ports

If tests are meant to run in parallel, assign unique ports per test or use a port allocation mechanism.

Option 4: Enhance httpserver.Server

Add a readiness channel to the Server struct that closes when ListenAndServe successfully starts:

type Server struct {
    http.ServeMux
    Port   string
    Logger *logrus.Entry
    Ready  chan struct{}  // New field
}

func (s *Server) Run(ctx context.Context, _ int) error {
    // ...
    srv := &http.Server{Addr: ":" + s.Port, Handler: s}
    
    go func() {
        <-ctx.Done()
        _ = srv.Shutdown(context.Background())
    }()
    
    listener, err := net.Listen("tcp", srv.Addr)
    if err != nil {
        return err
    }
    close(s.Ready)  // Signal that we're ready
    
    return srv.Serve(listener)
}

Priority

This is a flaky test that can cause false negatives in CI, potentially blocking legitimate PRs.

Additional Context

The test validates that allocation metrics are properly exported to the Prometheus endpoint. The functionality being tested is important, but the test implementation needs to be more robust to handle various CI/test environments.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/testsUnit tests, e2e tests, anything to make sure things don't breakgood first issueThese are great first issues. If you are looking for a place to start, start here!help wantedWe would love help on these issues. Please come help us!kind/bugThese are bugs.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions