fix(datastore): enable gRPC keepalives and expand retries #14411
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the datastore client to improve reliability and connectivity. It introduces idempotent retry logic for specific gRPC error codes (DeadlineExceeded, Internal, and ResourceExhausted) and configures gRPC keepalive parameters to maintain active connections. I have no further feedback to provide.
| if code == codes.Unavailable { | ||
| return true | ||
| } | ||
| if isIdempotent && (code == codes.DeadlineExceeded || code == codes.Internal || code == codes.ResourceExhausted) { |
There was a problem hiding this comment.
This feels like a missed opportunity for the datastore team to provide a more correct retry policy e.g. https://github.com/googleapis/googleapis/blob/master/google/datastore/v1/datastore_grpc_service_config.json
There was a problem hiding this comment.
I agree that the service config is the ideal place for these policies.
However, because the Go Datastore client's data plane is currently a handwritten veneer and not a GAPIC-generated client, it does not automatically consume the service config's retry parameters for data operations
| internaloption.EnableNewAuthLibrary(), | ||
| option.WithScopes(ScopeDatastore), | ||
| option.WithUserAgent(userAgent), | ||
| option.WithGRPCDialOption(grpc.WithKeepaliveParams(keepalive.ClientParameters{ |
There was a problem hiding this comment.
Can we do this automatically as part of gapic generation and set default options? The interesting thing is how we pick the timeout values, but maybe the retry config can be consumed here as well to find the slowest RPC and set based on that?
There was a problem hiding this comment.
That would be the right long-term approach.
Regarding the timeout values: the service config currently specifies a 60s timeout for data RPCs, and the proposed 20s keepalive.Timeout ensures we detect connection loss well before a typical RPC would time out.
4a7138a to
f76dcbf
Compare
shollyman
left a comment
There was a problem hiding this comment.
Thanks for slogging through the iterations here. Something for a future FR, but we might want to investigate making the retry predicate changes upstream and finding a way to propagate the service config file automatically.
**Summary** This PR refines the Datastore client's resilience logic to fix regressions identified during TAP testing, specifically regarding emulator startup races, Unix domain sockets, and IPv6 literals. It ensures that the resilience benefits for bug **[b/473841984](http://b/473841984)** are applied consistently to both production and emulator environments without breaking existing tests. **Why these changes are needed** 1. **Handling Emulator Startup Races**: Many tests (e.g., `exchange:datastore_test`) experience a "connection refused" error if an RPC is attempted immediately after starting the emulator process. By adding **`"waitForReady": true`** to the gRPC Service Config, the client will now block and wait for the emulator to finish binding its port instead of failing the RPC immediately. 2. **Preventing Infinite Hangs**: With `waitForReady` enabled, an unreachable host could cause a test to hang indefinitely. This PR adds a canonical **`"timeout": "60s"`** to the generic `methodConfig` to ensure RPCs eventually fail with a clear error. 3. **Robust Scheme Handling (Unix Sockets & IPv6)**: Internal test environments like `tin` often use **Unix domain sockets** (`unix:///`) or IPv6 literals (`[::1]`) for emulator addresses. Previous logic would mangle these into invalid targets (e.g., `passthrough:///unix:///...`), leading to "too many colons" errors or connection hangs. The updated logic defensively checks for existing valid schemes before applying the `passthrough:///` optimization. 4. **Idempotent Retries for `Commit`**: A catch-all configuration block is added to the Service Config to ensure that **`UNAVAILABLE`** errors are retried for all methods, including `Commit` (used by `PutMulti`), which was previously omitted. **Why PR #14411 was insufficient** While PR #14411 introduced Keepalives and expanded retries for idempotent calls, it had several gaps: * It only applied the new configuration to the **production** code path, leaving the emulator path without resilience logic or standardized retries. * By removing the manual 100ms retry loop without adding `waitForReady: true` to the Service Config, it removed the "wait" period needed for emulators to become ready. * The aggressive prepending of `passthrough:///` did not account for `unix:///` schemes or bracketed IPv6 literals, causing dialer errors in specialized test environments. (Even though the passthrough was not added in PR #14411, the current google3 imported code did not have this logic. While trying to import [cl/906545069](http://cl/906545069) latest code with psssthrough, this issue was discovered. ) **Related Bug** [b/473841984](http://b/473841984)
This change improves the library's resilience to abrupt connection losses, such as those caused by Cloud Front End (CFE) host failures or kernel panics. These events can currently lead to extended application hangs (e.g., the 6-minute outage observed in b/473841984) because the library lacks proactive health checks and does not retry on certain transient errors