Skip to content

Livelog "address already in use" issue #8318

@matt-boris

Description

@matt-boris

Summary

Livelog intermittently fails with listen tcp :60099: bind: address already in use, caused by an orphaned connectInputStream goroutine from a previous failed livelog.New() call sending a second PUT request to the next livelog process.

Root Cause

There are two bugs that work together to produce this failure:

Bug 1: Goroutine leak in livelog.New() (primary)

In workers/generic-worker/livelog/livelog.go, when New() returns via the case pr := <-putResult: path (the livelog binary exits early, e.g. because the port is in TIME_WAIT from a previous task), the connectInputStream() goroutine is not cancelled. That goroutine continues calling WaitForLocalTCPListener(60098, 1 minute), polling every 100ms.

When the next task starts a new livelog process that successfully binds to port 60098, the orphaned goroutine connects and sends a second PUT request to the new process.

Bug 2: Broken handlingPut guard in livelog binary (secondary)

In tools/livelog/main.go, the handlingPut flag is checked but never set to true when the first PUT is accepted:

mutex.Lock()
if handlingPut {
    // reject...
    mutex.Unlock()
    return
}
mutex.Unlock() // handlingPut is never set to true!

So the second PUT from the orphaned goroutine passes the guard. Each PUT triggers go startLogServe(stream, getAddr), and the second call to bind port 60099 fails with "address already in use".

Additionally: orphaned process on connectInputStream failure

When connectInputStream() returns an error in the case err := <-inputStreamConnectionResult: path, New() returns nil without killing the already-started livelog process, leaving it holding the ports.

Failure Sequence

  1. Previous task finishes → livelog process killed → ports 60098/60099 enter TCP TIME_WAIT (~60s)
  2. Task A starts → New() starts livelog binary → binary fails to bind port 60098 (TIME_WAIT) → binary exits → New() returns error → orphaned goroutine keeps polling port 60098 for up to 1 minute
  3. Task B starts (~30s later) → New() starts a new livelog binary → binary binds port 60098 successfully → two goroutines race to PUT: task B's connectInputStream + the orphaned goroutine from task A → livelog binary accepts both PUTs (broken guard) → second startLogServe fails to bind port 60099

Evidence from Logs

The livelog output shows two PUT requests with two different temp directories from a single livelog process:

input PUT /log
created at path /tmp/livelog3099053653/stream    ← first PUT
Begin consuming...
Output server listening...  (without TLS)        ← GET server binds 60099 OK

input PUT /log
created at path /tmp/livelog1891219331/stream    ← second PUT (orphaned goroutine!)
Begin consuming...
Output server listening...  (without TLS)
listen tcp :60099: bind: address already in use  ← second GET server fails

Fix

Three changes:

  1. Cancel orphaned goroutine (livelog.go): Create a cancellable context in New() and pass it to connectInputStream(). Cancel the context when the livelog process exits early, stopping the goroutine from polling and connecting to a future livelog process.

  2. Fix handlingPut guard (tools/livelog/main.go): Set handlingPut = true after the check passes, so concurrent/duplicate PUT requests are rejected.

  3. Kill orphaned process (livelog.go): When connectInputStream() fails, kill the livelog process before returning to avoid leaving it holding the ports.

Metadata

Metadata

Assignees

Labels

bugSomething isn't workinggeneric workerGeneric Worker

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions