Skip to content

fix(spanner): replace multiplexed session request loop with shared in-flight creation#14215

Merged
rahul2393 merged 6 commits intomainfrom
fix-mux-session-deadlock
Mar 20, 2026
Merged

fix(spanner): replace multiplexed session request loop with shared in-flight creation#14215
rahul2393 merged 6 commits intomainfrom
fix-mux-session-deadlock

Conversation

@rahul2393
Copy link
Copy Markdown
Contributor

@rahul2393 rahul2393 commented Mar 19, 2026

Fixes: #14212

Reworked multiplexed session initialization to use a shared in-flight creation latch instead of the old request
goroutine and unbuffered channels. This fixes the startup deadlock triggered by a cancelled read-only transaction
waiter and adds regression coverage for cancellation and client shutdown while session creation is in flight.

sessionManager used an unbuffered multiplexedSessionReq channel and a single goroutine to coordinate multiplexed session creation. A cancelled waiter could cause that goroutine to return, leaving future callers blocked forever trying to send on the request channel during takeMultiplexed.

Replace that request/ack flow with a shared in-flight creation record on sessionManager. When no multiplexed session exists, the manager starts one background creation with a manager-owned context and exposes a done channel for all waiters in that creation epoch. Waiters now block on done or their own context, and completion is broadcast by closing the shared channel. This removes the deadlock where one cancelled context could kill shared coordination, and it also lets Client.Close release any waiters blocked behind an in-flight initial create.

@rahul2393 rahul2393 requested review from a team as code owners March 19, 2026 17:39
@product-auto-label product-auto-label bot added the api: spanner Issues related to the Spanner API. label Mar 19, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds two new tests to verify a fix for multiplexed session creation, specifically addressing potential deadlocks when a request is cancelled or the client is closed. The tests cover important concurrency scenarios. I've found a logical issue in one of the new tests, TestMultiplexedSessionCreationCloseReleasesWaiters, where the test setup would lead to a timeout instead of correctly testing the intended scenario. My feedback includes a correction for this test.

@rahul2393 rahul2393 requested a review from olavloite March 19, 2026 17:42
@rahul2393 rahul2393 added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Mar 19, 2026
@kokoro-team kokoro-team removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Mar 19, 2026
@rahul2393
Copy link
Copy Markdown
Contributor Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request provides a solid fix for the deadlock issue in multiplexed session creation. The move from a channel-based request loop to a shared in-flight creation latch (multiplexedSessionCreation) significantly simplifies the concurrency logic and makes it more robust against context cancellations and client shutdown. The new implementation is much easier to follow and reason about.

I have one suggestion regarding a new test case in spanner/session_test.go, which appears to have some stale comments and a misleading name from a previous implementation. Cleaning this up would improve the long-term maintainability of the test suite.

Overall, this is an excellent improvement.

Copy link
Copy Markdown
Contributor

@olavloite olavloite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with an open question regarding the upper bound for the timeout for CreateSession. I'm fine with leaving this as it is for now, as CreateSession should never take more than 30s. But I do think that we should at least do a follow-up to fix it, as it is not logical.

if !force && p.multiplexedSession != nil {
return nil
}
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will effectively make 30 seconds the upper bound for a timeout for the CreateSession RPC. That is: If someone configures CreateSession to have a 60 second timeout, then that will be ignored, and this 30s value will effectively be used instead. Is that intentional?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out, this got copied from the old code which used 30s for the startup/refresh force create of mux session, the new code used 30s for every shared multiplexed-session creation epoch.

Removed it now,

  • the manager still owns cancellation for shutdown
  • CreateSession timeout policy now comes only from the configured GAPIC call options
  • default behavior remains 30s(default gax), but a user-configured longer CreateSession timeout is no longer capped at 30s by the manager layer.

@rahul2393 rahul2393 requested a review from olavloite March 20, 2026 08:50
@rahul2393 rahul2393 merged commit 3e3bd2d into main Mar 20, 2026
10 checks passed
@rahul2393 rahul2393 deleted the fix-mux-session-deadlock branch March 20, 2026 09:38
rahul2393 added a commit that referenced this pull request Mar 26, 2026
PR created by the Librarian CLI to initialize a release. Merging this PR
will auto trigger a release.

Librarian Version: v0.8.4
Language Image:
us-central1-docker.pkg.dev/cloud-sdk-librarian-prod/images-prod/librarian-go@sha256:ac1efa3ad3c6d99efed878535b3a0fe63d0cce6701c335f03811d8b49004d652
<details><summary>spanner: v1.89.0</summary>

##
[v1.89.0](spanner/v1.88.0...spanner/v1.89.0)
(2026-03-26)

### Features

* Add E2E fallback to the spanner client. (#13518)
([16af6a1](16af6a1c))

* include cache updates and routing hint into BeginTransaction and
Commit request/response respectively (PiperOrigin-RevId: 878019893)
([9c80b8b](9c80b8b4))

* add SI, adapt, split point related proto (PiperOrigin-RevId:
871366927)
([d3eb851](d3eb851d))

* Add gRPC A66/A94 metrics (#13825)
([d695802](d695802a))

* support Scan from string to NullUUID (#14128)
([d897b6d](d897b6db))

### Bug Fixes

* replace multiplexed session request loop with shared in-flight
creation (#14215)
([3e3bd2d](3e3bd2d3))

* guard rollback when aborted commit cleared session handle (#14218)
([6315105](63151055))

### Documentation

* A comment for field `routing_hint` in messages
`.google.spanner.v1.ResultSet` and `.google.spanner.v1.PartialResultSet`
are changed (PiperOrigin-RevId: 878019893)
([9c80b8b](9c80b8b4))

* A comment in message
`.google.spanner.v1.TransactionOptions.ReadWrite.ReadLockMode` is
changed (PiperOrigin-RevId: 878019893)
([9c80b8b](9c80b8b4))

* A comment for message `ListCloudInstancesAction` is changed
(PiperOrigin-RevId: 871366927)
([d3eb851](d3eb851d))

* A comment for field `execution_options` in message
`.google.spanner.executor.v1.StartTransactionAction` is changed
(PiperOrigin-RevId: 871366927)
([d3eb851](d3eb851d))

* A comment for message `TransactionExecutionOptions` is changed
(PiperOrigin-RevId: 871366927)
([d3eb851](d3eb851d))

</details>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api: spanner Issues related to the Spanner API.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

spanner: read deadlock on session initialization

3 participants