fix(datastore): enable gRPC keepalives and expand retries by bhshkh · Pull Request #14411 · googleapis/google-cloud-go

bhshkh · 2026-04-09T19:38:31Z

This change improves the library's resilience to abrupt connection losses, such as those caused by Cloud Front End (CFE) host failures or kernel panics. These events can currently lead to extended application hangs (e.g., the 6-minute outage observed in b/473841984) because the library lacks proactive health checks and does not retry on certain transient errors

…ent calls

gemini-code-assist

Code Review

This pull request updates the datastore client to improve reliability and connectivity. It introduces idempotent retry logic for specific gRPC error codes (DeadlineExceeded, Internal, and ResourceExhausted) and configures gRPC keepalive parameters to maintain active connections. I have no further feedback to provide.

shollyman · 2026-04-09T20:13:24Z

+	if code == codes.Unavailable {
+		return true
+	}
+	if isIdempotent && (code == codes.DeadlineExceeded || code == codes.Internal || code == codes.ResourceExhausted) {


This feels like a missed opportunity for the datastore team to provide a more correct retry policy e.g. https://github.com/googleapis/googleapis/blob/master/google/datastore/v1/datastore_grpc_service_config.json

I agree that the service config is the ideal place for these policies.

However, because the Go Datastore client's data plane is currently a handwritten veneer and not a GAPIC-generated client, it does not automatically consume the service config's retry parameters for data operations

shollyman · 2026-04-09T20:15:49Z

 			internaloption.EnableNewAuthLibrary(),
 			option.WithScopes(ScopeDatastore),
 			option.WithUserAgent(userAgent),
+			option.WithGRPCDialOption(grpc.WithKeepaliveParams(keepalive.ClientParameters{


Can we do this automatically as part of gapic generation and set default options? The interesting thing is how we pick the timeout values, but maybe the retry config can be consumed here as well to find the slowest RPC and set based on that?

That would be the right long-term approach.

Regarding the timeout values: the service config currently specifies a 60s timeout for data RPCs, and the proposed 20s keepalive.Timeout ensures we detect connection loss well before a typical RPC would time out.

shollyman

Thanks for slogging through the iterations here. Something for a future FR, but we might want to investigate making the retry predicate changes upstream and finding a way to propagate the service config file automatically.

**Summary** This PR refines the Datastore client's resilience logic to fix regressions identified during TAP testing, specifically regarding emulator startup races, Unix domain sockets, and IPv6 literals. It ensures that the resilience benefits for bug **[b/473841984](http://b/473841984)** are applied consistently to both production and emulator environments without breaking existing tests. **Why these changes are needed** 1. **Handling Emulator Startup Races**: Many tests (e.g., `exchange:datastore_test`) experience a "connection refused" error if an RPC is attempted immediately after starting the emulator process. By adding **`"waitForReady": true`** to the gRPC Service Config, the client will now block and wait for the emulator to finish binding its port instead of failing the RPC immediately. 2. **Preventing Infinite Hangs**: With `waitForReady` enabled, an unreachable host could cause a test to hang indefinitely. This PR adds a canonical **`"timeout": "60s"`** to the generic `methodConfig` to ensure RPCs eventually fail with a clear error. 3. **Robust Scheme Handling (Unix Sockets & IPv6)**: Internal test environments like `tin` often use **Unix domain sockets** (`unix:///`) or IPv6 literals (`[::1]`) for emulator addresses. Previous logic would mangle these into invalid targets (e.g., `passthrough:///unix:///...`), leading to "too many colons" errors or connection hangs. The updated logic defensively checks for existing valid schemes before applying the `passthrough:///` optimization. 4. **Idempotent Retries for `Commit`**: A catch-all configuration block is added to the Service Config to ensure that **`UNAVAILABLE`** errors are retried for all methods, including `Commit` (used by `PutMulti`), which was previously omitted. **Why PR #14411 was insufficient** While PR #14411 introduced Keepalives and expanded retries for idempotent calls, it had several gaps: * It only applied the new configuration to the **production** code path, leaving the emulator path without resilience logic or standardized retries. * By removing the manual 100ms retry loop without adding `waitForReady: true` to the Service Config, it removed the "wait" period needed for emulators to become ready. * The aggressive prepending of `passthrough:///` did not account for `unix:///` schemes or bracketed IPv6 literals, causing dialer errors in specialized test environments. (Even though the passthrough was not added in PR #14411, the current google3 imported code did not have this logic. While trying to import [cl/906545069](http://cl/906545069) latest code with psssthrough, this issue was discovered. ) **Related Bug** [b/473841984](http://b/473841984)

fix(datastore): enable gRPC keepalives and expand retries for idempot…

7d94303

…ent calls

bhshkh requested review from a team as code owners April 9, 2026 19:38

product-auto-label Bot added the api: datastore Issues related to the Datastore API. label Apr 9, 2026

gemini-code-assist Bot reviewed Apr 9, 2026

View reviewed changes

BeginTransaction is not idempotent

50c9340

shollyman reviewed Apr 9, 2026

View reviewed changes

bhshkh requested a review from shollyman April 16, 2026 18:19

bhshkh enabled auto-merge (squash) April 16, 2026 18:20

refactor: remove manual retry logic in favor of gRPC service config

f76dcbf

bhshkh force-pushed the perf/datastore-keepalive branch from 4a7138a to f76dcbf Compare April 22, 2026 20:49

shollyman approved these changes Apr 22, 2026

View reviewed changes

bhshkh merged commit 4c489a6 into googleapis:main Apr 22, 2026
11 checks passed

bhshkh deleted the perf/datastore-keepalive branch April 22, 2026 20:56

csilvers mentioned this pull request May 1, 2026

datastore: rejected grpc credentials results in hang #14459

Closed

bhshkh mentioned this pull request May 14, 2026

fix(datastore): add retries to emulator #14591

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(datastore): enable gRPC keepalives and expand retries #14411

fix(datastore): enable gRPC keepalives and expand retries #14411
bhshkh merged 3 commits into
googleapis:mainfrom
bhshkh:perf/datastore-keepalive

bhshkh commented Apr 9, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

shollyman Apr 9, 2026

Uh oh!

bhshkh Apr 16, 2026

Uh oh!

shollyman Apr 9, 2026

Uh oh!

bhshkh Apr 16, 2026

Uh oh!

shollyman left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bhshkh commented Apr 9, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

shollyman Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

bhshkh Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

shollyman Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

bhshkh Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

shollyman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants