-
Notifications
You must be signed in to change notification settings - Fork 1.5k
spanner: read deadlock on session initialization #14212
Description
Client
Spanner
Description
A service using the spanner library version v1.87.0 failed to start up. After investigating the goroutine dump we discovered that one goroutine performing a readonly transaction was stuck inside the spanner library for multiple hours at:
goroutine 7138 [chan send, 1058 minutes]:
cloud.google.com/go/spanner.(*sessionPool).takeMultiplexed(0x4001be6600, {0x4a77190, 0x40098a5650})
cloud.google.com/go/spanner@v1.87.0/session.go:1309 +0x10c
cloud.google.com/go/spanner.(*ReadOnlyTransaction).acquireSingleUse(0x4001c42680, {0x4a77190, 0x40098a5650})
cloud.google.com/go/spanner@v1.87.0/transaction.go:1010 +0x278
cloud.google.com/go/spanner.(*ReadOnlyTransaction).acquire(0x4001c42680, {0x4a77190, 0x40098a5650})
cloud.google.com/go/spanner@v1.87.0/transaction.go:987 +0x4c
cloud.google.com/go/spanner.(*txReadOnly).prepareExecuteSQL(0x4001c42688, {0x4a77190?, 0x40098a5650?}, {{0x400af2f4a0, 0x1d8}, 0x40098a55c0}, {0x0, 0x400054f950, 0x3, {0x4002396090, ...}, ...})
cloud.google.com/go/spanner@v1.87.0/transaction.go:750 +0x58
cloud.google.com/go/spanner.(*txReadOnly).query(0x4001c42688, {0x4a77190?, 0x40098a5590?}, {{0x400af2f4a0?, 0x2c?}, 0x40098a55c0?}, {0x0, 0x400054f950, 0x3, {0x4002396090, ...}, ...})
cloud.google.com/go/spanner@v1.87.0/transaction.go:689 +0xf8
cloud.google.com/go/spanner.(*txReadOnly).QueryWithOptions(0x4001c42688, {0x4a77190, 0x40098a5590}, {{0x400af2f4a0?, 0x43c87483d?}, 0x40098a55c0?}, {0x0, 0x0, 0x3, {0x4002396090, ...}, ...})
cloud.google.com/go/spanner@v1.87.0/transaction.go:645 +0x134
We suspect that the cause is that the goroutine responsible for creating multiplexed sessions exited here. There was no corresponding goroutine waiting on the channel for the same sessionPool instance. This does reproduce in a test detailed below.
Environment
Doesn't seem relevant
Code and Dependencies
// TestMultiplexedSessionCreationGoroutineDeadlockOnContextCancel verifies that
// the createMultiplexedSession goroutine does not exit when a request's context
// is cancelled. If the goroutine exits, the multiplexedSessionReq channel loses
// its sole consumer and all subsequent sends deadlock.
//
// The scenario reproduces a production issue observed at startup:
// 1. The initial multiplexed session creation is delayed (slow CreateSession RPC).
// 2. During this delay, two reads are issued concurrently. Both call
// takeMultiplexed, see multiplexedSession==nil, and block sending on the
// unbuffered multiplexedSessionReq channel behind the in-flight force request.
// 3. The first read uses an already-cancelled context. When the force request
// completes (setting the session), the goroutine receives this request,
// sees sess!=nil, and enters a select. Because the context is cancelled,
// it takes the Done branch and returns — permanently killing the goroutine.
// 4. The second read (valid context) is still blocked at the channel send.
// With no consumer goroutine, it stays blocked forever — a deadlock.
func TestMultiplexedSessionCreationGoroutineDeadlockOnContextCancel(t *testing.T) {
t.Parallel()
// Create the server and add a delay to CreateSession so the initial
// multiplexed session creation takes long enough for us to queue
// requests behind it.
server, opts, serverTeardown := NewMockedSpannerInMemTestServer(t)
defer serverTeardown()
server.TestSpanner.PutExecutionTime(MethodCreateSession,
SimulatedExecutionTime{MinimumExecutionTime: 500 * time.Millisecond})
ctx := context.Background()
db := fmt.Sprintf("projects/%s/instances/%s/databases/%s", "[PROJECT]", "[INSTANCE]", "[DATABASE]")
client, err := NewClientWithConfig(ctx, db, ClientConfig{
DisableNativeMetrics: true,
}, opts...)
if err != nil {
t.Fatal(err)
}
defer client.Close()
// Issue a read with an already-cancelled context. The call reaches
// takeMultiplexed, sees multiplexedSession==nil (the initial creation is
// still in flight), and sends a non-force request on multiplexedSessionReq.
// That send blocks until the goroutine finishes the initial force request
// (~500ms), regardless of the cancelled context.
cancelledCtx, cancel := context.WithCancel(ctx)
cancel()
go client.Single().ReadRow(cancelledCtx, "Albums", Key{"foo"}, []string{"SingerId", "AlbumId", "AlbumTitle"})
// Wait briefly so the cancelled ReadRow blocks on the channel send
// first (FIFO). This ensures the goroutine receives it before the
// valid ReadRow below.
time.Sleep(50 * time.Millisecond)
// Issue a second read with a valid context. This also blocks at the
// channel send behind the cancelled request. If the goroutine dies
// after processing the cancelled request, this read is stuck forever.
done := make(chan error, 1)
go func() {
defer func() {
if r := recover(); r != nil {
// The goroutine may panic with "send on closed channel"
// when client.Close() closes multiplexedSessionReq while
// this goroutine is still blocked on the send.
done <- fmt.Errorf("panic: %v", r)
}
}()
readCtx, readCancel := context.WithTimeout(ctx, 15*time.Second)
defer readCancel()
_, err := client.Single().ReadRow(readCtx, "Albums", Key{"foo"}, []string{"SingerId", "AlbumId", "AlbumTitle"})
done <- err
}()
// Remove the delay after the initial session is created so the second
// read's session creation (if needed) is not slowed down.
waitFor(t, func() error {
client.sm.mu.Lock()
defer client.sm.mu.Unlock()
if client.sm.multiplexedSession == nil {
return errInvalidSession
}
return nil
})
server.TestSpanner.Freeze()
server.TestSpanner.PutExecutionTime(MethodCreateSession, SimulatedExecutionTime{})
server.TestSpanner.Unfreeze()
select {
case err := <-done:
if err != nil {
t.Fatalf("ReadRow returned unexpected error: %v", err)
}
case <-time.After(5 * time.Second):
t.Fatal("ReadRow deadlocked: the createMultiplexedSession goroutine " +
"exited after processing a request with a cancelled context, " +
"leaving a concurrent reader permanently blocked on the " +
"multiplexedSessionReq channel send")
}
}go.mod
module modname
go 1.25.0
require (
// ...
cloud.google.com/go/spanner v1.87.0
// ...
)
Expected behavior
No deadlock, the read call returning.
Actual behavior
Deadlock, the read call never returns.
Additional context
We observed this one time after updating from v1.85.1 to v1.87.0. We did have multiplexed sessions enabled by setting the GOOGLE_CLOUD_SPANNER_MULTIPLEXED_SESSIONS env variable to true for the container.