Skip to content

spanner: read deadlock on session initialization #14212

@laurynaslubys

Description

@laurynaslubys

Client

Spanner

Description

A service using the spanner library version v1.87.0 failed to start up. After investigating the goroutine dump we discovered that one goroutine performing a readonly transaction was stuck inside the spanner library for multiple hours at:

goroutine 7138 [chan send, 1058 minutes]:
cloud.google.com/go/spanner.(*sessionPool).takeMultiplexed(0x4001be6600, {0x4a77190, 0x40098a5650})
        cloud.google.com/go/spanner@v1.87.0/session.go:1309 +0x10c
cloud.google.com/go/spanner.(*ReadOnlyTransaction).acquireSingleUse(0x4001c42680, {0x4a77190, 0x40098a5650})
        cloud.google.com/go/spanner@v1.87.0/transaction.go:1010 +0x278
cloud.google.com/go/spanner.(*ReadOnlyTransaction).acquire(0x4001c42680, {0x4a77190, 0x40098a5650})
        cloud.google.com/go/spanner@v1.87.0/transaction.go:987 +0x4c
cloud.google.com/go/spanner.(*txReadOnly).prepareExecuteSQL(0x4001c42688, {0x4a77190?, 0x40098a5650?}, {{0x400af2f4a0, 0x1d8}, 0x40098a55c0}, {0x0, 0x400054f950, 0x3, {0x4002396090, ...}, ...})
        cloud.google.com/go/spanner@v1.87.0/transaction.go:750 +0x58
cloud.google.com/go/spanner.(*txReadOnly).query(0x4001c42688, {0x4a77190?, 0x40098a5590?}, {{0x400af2f4a0?, 0x2c?}, 0x40098a55c0?}, {0x0, 0x400054f950, 0x3, {0x4002396090, ...}, ...})
        cloud.google.com/go/spanner@v1.87.0/transaction.go:689 +0xf8
cloud.google.com/go/spanner.(*txReadOnly).QueryWithOptions(0x4001c42688, {0x4a77190, 0x40098a5590}, {{0x400af2f4a0?, 0x43c87483d?}, 0x40098a55c0?}, {0x0, 0x0, 0x3, {0x4002396090, ...}, ...})
        cloud.google.com/go/spanner@v1.87.0/transaction.go:645 +0x134

We suspect that the cause is that the goroutine responsible for creating multiplexed sessions exited here. There was no corresponding goroutine waiting on the channel for the same sessionPool instance. This does reproduce in a test detailed below.

Environment

Doesn't seem relevant

Code and Dependencies

// TestMultiplexedSessionCreationGoroutineDeadlockOnContextCancel verifies that
// the createMultiplexedSession goroutine does not exit when a request's context
// is cancelled. If the goroutine exits, the multiplexedSessionReq channel loses
// its sole consumer and all subsequent sends deadlock.
//
// The scenario reproduces a production issue observed at startup:
//  1. The initial multiplexed session creation is delayed (slow CreateSession RPC).
//  2. During this delay, two reads are issued concurrently. Both call
//     takeMultiplexed, see multiplexedSession==nil, and block sending on the
//     unbuffered multiplexedSessionReq channel behind the in-flight force request.
//  3. The first read uses an already-cancelled context. When the force request
//     completes (setting the session), the goroutine receives this request,
//     sees sess!=nil, and enters a select. Because the context is cancelled,
//     it takes the Done branch and returns — permanently killing the goroutine.
//  4. The second read (valid context) is still blocked at the channel send.
//     With no consumer goroutine, it stays blocked forever — a deadlock.
func TestMultiplexedSessionCreationGoroutineDeadlockOnContextCancel(t *testing.T) {
	t.Parallel()

	// Create the server and add a delay to CreateSession so the initial
	// multiplexed session creation takes long enough for us to queue
	// requests behind it.
	server, opts, serverTeardown := NewMockedSpannerInMemTestServer(t)
	defer serverTeardown()
	server.TestSpanner.PutExecutionTime(MethodCreateSession,
		SimulatedExecutionTime{MinimumExecutionTime: 500 * time.Millisecond})

	ctx := context.Background()
	db := fmt.Sprintf("projects/%s/instances/%s/databases/%s", "[PROJECT]", "[INSTANCE]", "[DATABASE]")
	client, err := NewClientWithConfig(ctx, db, ClientConfig{
		DisableNativeMetrics: true,
	}, opts...)
	if err != nil {
		t.Fatal(err)
	}
	defer client.Close()

	// Issue a read with an already-cancelled context. The call reaches
	// takeMultiplexed, sees multiplexedSession==nil (the initial creation is
	// still in flight), and sends a non-force request on multiplexedSessionReq.
	// That send blocks until the goroutine finishes the initial force request
	// (~500ms), regardless of the cancelled context.
	cancelledCtx, cancel := context.WithCancel(ctx)
	cancel()
	go client.Single().ReadRow(cancelledCtx, "Albums", Key{"foo"}, []string{"SingerId", "AlbumId", "AlbumTitle"})

	// Wait briefly so the cancelled ReadRow blocks on the channel send
	// first (FIFO). This ensures the goroutine receives it before the
	// valid ReadRow below.
	time.Sleep(50 * time.Millisecond)

	// Issue a second read with a valid context. This also blocks at the
	// channel send behind the cancelled request. If the goroutine dies
	// after processing the cancelled request, this read is stuck forever.
	done := make(chan error, 1)
	go func() {
		defer func() {
			if r := recover(); r != nil {
				// The goroutine may panic with "send on closed channel"
				// when client.Close() closes multiplexedSessionReq while
				// this goroutine is still blocked on the send.
				done <- fmt.Errorf("panic: %v", r)
			}
		}()
		readCtx, readCancel := context.WithTimeout(ctx, 15*time.Second)
		defer readCancel()
		_, err := client.Single().ReadRow(readCtx, "Albums", Key{"foo"}, []string{"SingerId", "AlbumId", "AlbumTitle"})
		done <- err
	}()

	// Remove the delay after the initial session is created so the second
	// read's session creation (if needed) is not slowed down.
	waitFor(t, func() error {
		client.sm.mu.Lock()
		defer client.sm.mu.Unlock()
		if client.sm.multiplexedSession == nil {
			return errInvalidSession
		}
		return nil
	})
	server.TestSpanner.Freeze()
	server.TestSpanner.PutExecutionTime(MethodCreateSession, SimulatedExecutionTime{})
	server.TestSpanner.Unfreeze()

	select {
	case err := <-done:
		if err != nil {
			t.Fatalf("ReadRow returned unexpected error: %v", err)
		}
	case <-time.After(5 * time.Second):
		t.Fatal("ReadRow deadlocked: the createMultiplexedSession goroutine " +
			"exited after processing a request with a cancelled context, " +
			"leaving a concurrent reader permanently blocked on the " +
			"multiplexedSessionReq channel send")
	}
}
go.mod
module modname

go 1.25.0

require (
   // ...
	cloud.google.com/go/spanner v1.87.0
   // ...
)

Expected behavior

No deadlock, the read call returning.

Actual behavior

Deadlock, the read call never returns.

Additional context

We observed this one time after updating from v1.85.1 to v1.87.0. We did have multiplexed sessions enabled by setting the GOOGLE_CLOUD_SPANNER_MULTIPLEXED_SESSIONS env variable to true for the container.

Metadata

Metadata

Assignees

Labels

api: spannerIssues related to the Spanner API.priority: p1Important issue which blocks shipping the next release. Will be fixed prior to next release.triage meI really want to be triaged.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions