-
Notifications
You must be signed in to change notification settings - Fork 270
Description
Summary
Livelog intermittently fails with listen tcp :60099: bind: address already in use, caused by an orphaned connectInputStream goroutine from a previous failed livelog.New() call sending a second PUT request to the next livelog process.
Root Cause
There are two bugs that work together to produce this failure:
Bug 1: Goroutine leak in livelog.New() (primary)
In workers/generic-worker/livelog/livelog.go, when New() returns via the case pr := <-putResult: path (the livelog binary exits early, e.g. because the port is in TIME_WAIT from a previous task), the connectInputStream() goroutine is not cancelled. That goroutine continues calling WaitForLocalTCPListener(60098, 1 minute), polling every 100ms.
When the next task starts a new livelog process that successfully binds to port 60098, the orphaned goroutine connects and sends a second PUT request to the new process.
Bug 2: Broken handlingPut guard in livelog binary (secondary)
In tools/livelog/main.go, the handlingPut flag is checked but never set to true when the first PUT is accepted:
mutex.Lock()
if handlingPut {
// reject...
mutex.Unlock()
return
}
mutex.Unlock() // handlingPut is never set to true!So the second PUT from the orphaned goroutine passes the guard. Each PUT triggers go startLogServe(stream, getAddr), and the second call to bind port 60099 fails with "address already in use".
Additionally: orphaned process on connectInputStream failure
When connectInputStream() returns an error in the case err := <-inputStreamConnectionResult: path, New() returns nil without killing the already-started livelog process, leaving it holding the ports.
Failure Sequence
- Previous task finishes → livelog process killed → ports 60098/60099 enter TCP TIME_WAIT (~60s)
- Task A starts →
New()starts livelog binary → binary fails to bind port 60098 (TIME_WAIT) → binary exits →New()returns error → orphaned goroutine keeps polling port 60098 for up to 1 minute - Task B starts (~30s later) →
New()starts a new livelog binary → binary binds port 60098 successfully → two goroutines race to PUT: task B'sconnectInputStream+ the orphaned goroutine from task A → livelog binary accepts both PUTs (broken guard) → secondstartLogServefails to bind port 60099
Evidence from Logs
The livelog output shows two PUT requests with two different temp directories from a single livelog process:
input PUT /log
created at path /tmp/livelog3099053653/stream ← first PUT
Begin consuming...
Output server listening... (without TLS) ← GET server binds 60099 OK
input PUT /log
created at path /tmp/livelog1891219331/stream ← second PUT (orphaned goroutine!)
Begin consuming...
Output server listening... (without TLS)
listen tcp :60099: bind: address already in use ← second GET server fails
Fix
Three changes:
-
Cancel orphaned goroutine (
livelog.go): Create a cancellable context inNew()and pass it toconnectInputStream(). Cancel the context when the livelog process exits early, stopping the goroutine from polling and connecting to a future livelog process. -
Fix
handlingPutguard (tools/livelog/main.go): SethandlingPut = trueafter the check passes, so concurrent/duplicate PUT requests are rejected. -
Kill orphaned process (
livelog.go): WhenconnectInputStream()fails, kill the livelog process before returning to avoid leaving it holding the ports.