Skip to content

fix(datastore): enable gRPC keepalives and expand retries #14411

Merged
bhshkh merged 3 commits into
googleapis:mainfrom
bhshkh:perf/datastore-keepalive
Apr 22, 2026
Merged

fix(datastore): enable gRPC keepalives and expand retries #14411
bhshkh merged 3 commits into
googleapis:mainfrom
bhshkh:perf/datastore-keepalive

Conversation

@bhshkh

@bhshkh bhshkh commented Apr 9, 2026

Copy link
Copy Markdown
Contributor

This change improves the library's resilience to abrupt connection losses, such as those caused by Cloud Front End (CFE) host failures or kernel panics. These events can currently lead to extended application hangs (e.g., the 6-minute outage observed in b/473841984) because the library lacks proactive health checks and does not retry on certain transient errors

@bhshkh bhshkh requested review from a team as code owners April 9, 2026 19:38
@product-auto-label product-auto-label Bot added the api: datastore Issues related to the Datastore API. label Apr 9, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the datastore client to improve reliability and connectivity. It introduces idempotent retry logic for specific gRPC error codes (DeadlineExceeded, Internal, and ResourceExhausted) and configures gRPC keepalive parameters to maintain active connections. I have no further feedback to provide.

Comment thread datastore/client.go Outdated
if code == codes.Unavailable {
return true
}
if isIdempotent && (code == codes.DeadlineExceeded || code == codes.Internal || code == codes.ResourceExhausted) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like a missed opportunity for the datastore team to provide a more correct retry policy e.g. https://github.com/googleapis/googleapis/blob/master/google/datastore/v1/datastore_grpc_service_config.json

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the service config is the ideal place for these policies.

However, because the Go Datastore client's data plane is currently a handwritten veneer and not a GAPIC-generated client, it does not automatically consume the service config's retry parameters for data operations

Comment thread datastore/datastore.go
internaloption.EnableNewAuthLibrary(),
option.WithScopes(ScopeDatastore),
option.WithUserAgent(userAgent),
option.WithGRPCDialOption(grpc.WithKeepaliveParams(keepalive.ClientParameters{

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do this automatically as part of gapic generation and set default options? The interesting thing is how we pick the timeout values, but maybe the retry config can be consumed here as well to find the slowest RPC and set based on that?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be the right long-term approach.

Regarding the timeout values: the service config currently specifies a 60s timeout for data RPCs, and the proposed 20s keepalive.Timeout ensures we detect connection loss well before a typical RPC would time out.

@bhshkh bhshkh requested a review from shollyman April 16, 2026 18:19
@bhshkh bhshkh enabled auto-merge (squash) April 16, 2026 18:20
@bhshkh bhshkh force-pushed the perf/datastore-keepalive branch from 4a7138a to f76dcbf Compare April 22, 2026 20:49

@shollyman shollyman left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for slogging through the iterations here. Something for a future FR, but we might want to investigate making the retry predicate changes upstream and finding a way to propagate the service config file automatically.

@bhshkh bhshkh merged commit 4c489a6 into googleapis:main Apr 22, 2026
11 checks passed
@bhshkh bhshkh deleted the perf/datastore-keepalive branch April 22, 2026 20:56
bhshkh added a commit that referenced this pull request May 15, 2026
**Summary**
This PR refines the Datastore client's resilience logic to fix
regressions identified during TAP testing, specifically regarding
emulator startup races, Unix domain sockets, and IPv6 literals. It
ensures that the resilience benefits for bug
**[b/473841984](http://b/473841984)** are applied consistently to both
production and emulator environments without breaking existing tests.

**Why these changes are needed**
1. **Handling Emulator Startup Races**: Many tests (e.g.,
`exchange:datastore_test`) experience a "connection refused" error if an
RPC is attempted immediately after starting the emulator process. By
adding **`"waitForReady": true`** to the gRPC Service Config, the client
will now block and wait for the emulator to finish binding its port
instead of failing the RPC immediately.
2. **Preventing Infinite Hangs**: With `waitForReady` enabled, an
unreachable host could cause a test to hang indefinitely. This PR adds a
canonical **`"timeout": "60s"`** to the generic `methodConfig` to ensure
RPCs eventually fail with a clear error.
3. **Robust Scheme Handling (Unix Sockets & IPv6)**: Internal test
environments like `tin` often use **Unix domain sockets** (`unix:///`)
or IPv6 literals (`[::1]`) for emulator addresses. Previous logic would
mangle these into invalid targets (e.g., `passthrough:///unix:///...`),
leading to "too many colons" errors or connection hangs. The updated
logic defensively checks for existing valid schemes before applying the
`passthrough:///` optimization.
4. **Idempotent Retries for `Commit`**: A catch-all configuration block
is added to the Service Config to ensure that **`UNAVAILABLE`** errors
are retried for all methods, including `Commit` (used by `PutMulti`),
which was previously omitted.

**Why PR #14411 was insufficient**
While PR #14411 introduced Keepalives and expanded retries for
idempotent calls, it had several gaps:
* It only applied the new configuration to the **production** code path,
leaving the emulator path without resilience logic or standardized
retries.
* By removing the manual 100ms retry loop without adding `waitForReady:
true` to the Service Config, it removed the "wait" period needed for
emulators to become ready.
* The aggressive prepending of `passthrough:///` did not account for
`unix:///` schemes or bracketed IPv6 literals, causing dialer errors in
specialized test environments. (Even though the passthrough was not
added in PR #14411, the current google3 imported code did not have this
logic. While trying to import [cl/906545069](http://cl/906545069) latest
code with psssthrough, this issue was discovered. )

**Related Bug**
[b/473841984](http://b/473841984)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api: datastore Issues related to the Datastore API.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants