Enhance cache rate limiting to prevent registry overload during rolling deployments by Copilot · Pull Request #2824 · micro/go-micro

Copilot · 2026-01-13T07:01:37Z

Enhanced Rate Limiting Implementation

Addressed @asim's feedback on PR #2823 to improve cache penetration protection.

Changes Made

Understand current throttling limitations
Implement rate limiting for ALL cache refresh attempts (not just errors)
Always return stale cache when available during rate limit period
Handle rolling deployment scenario
Fix test expectations to match new behavior
Update documentation with detailed scenarios
Remove unused failedAttempts and consecutiveFailures fields

Key Improvements

1. Universal Rate Limiting

Before: Throttling only applied when registry returned errors AND no stale cache
After: Throttling applies to ALL refresh attempts, regardless of registry state

2. Stale Cache Strategy

If stale cache exists → return it (even if expired), skip registry call
If no cache exists → return ErrNotFound, rely on gRPC retry

3. Rolling Deployment Protection

New scenario handled: When downstream service does rolling deployment, all upstream
caches expire simultaneously. Under high QPS, this prevents stampede:

First request refreshes from registry
Subsequent requests (within MinimumRetryInterval) return stale cache
No registry overload

4. Code Cleanup

Removed unused failedAttempts and consecutiveFailures fields since the new
rate limiting strategy uses lastRefreshAttempt for all tracking.

Test Results

✓ TestSingleflightPreventsStampede (concurrent deduplication)
✓ TestCachePenetrationPrevention (rate limiting + stale cache)
✓ TestThrottlingWithoutStaleCache
✓ TestThrottlingMultipleConcurrentRequests
✓ TestThrottlingDoesNotAffectSuccessfulLookups
✓ TestThrottlingClearedOnSuccess
✓ TestStaleCacheOnError
✓ TestSingleflightWithError

All 8 tests passing ✅ (including with race detector)

Documentation

Updated registry/cache/README.md with:

Rolling deployment scenario example
Registry failure with stale cache example
No cache available example
Clear explanation of rate limiting behavior

Original prompt

This section details on the original issue you should resolve

<issue_title>[BUG] etcd authentication performance issue and registry cache penetration</issue_title>
<issue_description>## Describe the bug
The backgound is that we need to enable etcd server auth due to some security concerns.

Etcd server has a special authentication design which causes a serious performance problem of /etcdserverpb.Auth/Authenticate API.

From our observation, a normal 3-nodes etcd cluster whose spec like 64C 256G HDD can only support less than around 100 QPS for authentication requests.

In current default go-micro registry plugin setting and gRPC server setting, gRPC server will do a registry by using KeepAliveOnce every RegisterInterval(default 30s). KeepAliveOnce will use /etcdserverpb.Auth/Authenticate once to establish the stream.

In our Prod env, we have a k8s cluster which contains over 4000 service pods results in a steady /etcdserverpb.Auth/Authenticate QPS around 110.

When we start to enable the etcd auth, the etcd cluster can not handle such /etcdserverpb.Auth/Authenticate QPS causes the business KeepAliveOnce failed and deregister from etcd server after registryTTL.

The upstream Watch the change and Delete downstream server nodes from registry cache which result in an empty cache finally.

When the cache has been cleared, causes another cache penetration problem which is that a gRPC call will call etcd to get downstream nodes from etcd when cache is empty or invalid, but etcd server does not have the info in this time because downstream can not keep the registry hearbeat due to the /etcdserverpb.Auth/Authenticate problem.

The result is that all gRPC request penetrate to etcd and failed at last.

We wanna handle these two problems:

Limit the request to etcd when cache is empty to avoid the penetration issue.

Use "KeepAlive" instead of "KeepAliveOnce" to address /etcdserverpb.Auth/Authenticate QPS issue.

To Reproduce

Steps to reproduce the behavior:

create a 4000+ service pods which use default go-micro registry settings.

preconfigure the etcd username and passwd.

enable etcd cluster auth.

Environment

Go Micro version:

github.com/go-micro/plugins/v4/client/grpc v1.2.1

github.com/go-micro/plugins/v4/registry/etcd v1.2.0

github.com/go-micro/plugins/v4/server/grpc v1.2.0

go-micro.dev/v4 v4.9.0

go.etcd.io/etcd/client/v3 v3.5.2

Go version: 1.18

OS: Ubuntu 20.04

Plugins used:

etcd registry

Logs

server side monitor

Resources

Documentation

Examples

API Reference
</issue_description>

<agent_instructions>See latest comment </agent_instructions>

Comments on the Issue (you are @copilot in this section)
@asim Could you look at this PR and see if it satisfies your requirements?
#2822</comment_new>
<comment_new>@asim
This PR #2822 is merged. If you are satisfied with the fix please close the issue.</comment_new>
<comment_new>@asim
Does this satisfy the requirements? #2823</comment_new>

Fixes [BUG] etcd authentication performance issue and registry cache penetration #2821

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

asim · 2026-01-13T07:31:07Z

@copilot Did you read the users last comment

Copilot · 2026-01-13T07:32:49Z

@copilot Did you read the users last comment

I don't see any comments from the original issue reporter in the context provided - only your comments asking if PR #2822 and #2823 satisfy requirements. Could you clarify which user comment you're referring to? If there are additional comments on the issue that aren't showing in my context, I'll need to see them to properly address the request.

asim · 2026-01-13T07:46:53Z

@copilot #2821 (comment)

Copilot · 2026-01-13T07:48:53Z

@copilot #2821 (comment)

I see - the user's original comment requested two features:

Limit requests to etcd when cache is empty (cache penetration protection)
Use KeepAlive instead of KeepAliveOnce (reduce auth QPS)

Both are already fully implemented in PR #2823 (commit 2869cc1). This verification PR is unnecessary since that work is complete and tested. Should I close this PR?

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

https://api.github.com/repos/micro/go-micro/issues/2821/comments
- Triggering command: /usr/bin/curl curl -s REDACTED (http block)

If you need me to access, download, or install something from one of these locations, you can either:

Configure Actions setup steps to set up my environment, which run before the firewall is enabled
Add the appropriate URLs or hosts to the custom allowlist in this repository's Copilot coding agent settings (admins only)

asim · 2026-01-13T08:54:36Z

@copilot For #2823 , I think we should direct limit the access rate instead of in the "return error" scene. Because most range request will not return error but return empty in our case.

And we have found another case which happened long time ago which also needs this limiter.

The detail is that when downstream service launches a new release version, all downstream service will do a rolling deployment. When the last pod is deployed, all upstream services watch the change and update their cache simultaneously which means all upstream services will expire their cache at the same time.

If we encounter a very high pressure QPS at this moment, the cache penetration issue will still happens.

Regarding the handling of rate-limited requests, I believe the strategy should be as follows:

If a stale cache entry exists, return it even if it has expired.
If no cache entry is found, return a "not found" error and rely on the gRPC call's retry mechanism to retry the request.

@asim

…narios Per @asim's feedback, rate limiting now applies to ALL cache refresh attempts (not just error cases) to prevent registry overload during rolling deployments. Changes: - Rate limit ALL refresh attempts using lastRefreshAttempt tracking - Always return stale cache if available (even if expired) when rate limiting - Only return ErrNotFound when no cache exists during rate limit period - Rate limiting happens inside singleflight to avoid race conditions - Update test to reflect new behavior (no retry when stale cache + rate limit) - Enhanced documentation with rolling deployment scenario examples This addresses the scenario where all upstream services expire their cache simultaneously during a downstream rolling deployment, which would previously cause a stampede to the registry under high QPS. Co-authored-by: asim <17530+asim@users.noreply.github.com>

Copilot · 2026-01-13T09:04:57Z

@copilot For #2823 , I think we should direct limit the access rate instead of in the "return error" scene. Because most range request will not return error but return empty in our case.

And we have ...

Implemented the enhanced rate limiting strategy you requested in commit 99a6659.

Changes:

Rate limiting now applies to ALL cache refresh attempts (not just errors)
Always returns stale cache if available, even when expired and rate limited
Only returns ErrNotFound when no cache exists during rate limit period
Handles rolling deployment scenario where all caches expire simultaneously

The implementation prevents registry overload in both error scenarios and the rolling deployment case you identified, while maintaining efficient concurrent request handling via singleflight.

DingYuan0118 · 2026-01-14T02:42:03Z

@copilot can remove the failedAttempts since we do not use it any more

asim · 2026-01-14T05:36:41Z

@copilot can remove the failedAttempts since we do not use it any more

@asim

Per @asim's feedback, these fields are no longer needed since the new rate limiting strategy uses lastRefreshAttempt to track ALL refresh attempts, not just failed ones. Removed: - failedAttempts map[string]time.Time - consecutiveFailures int - All code that sets these fields The rate limiting logic now only uses lastRefreshAttempt, making these fields redundant. All tests continue to pass. Co-authored-by: asim <17530+asim@users.noreply.github.com>

Copilot · 2026-01-14T05:40:17Z

@copilot can remove the failedAttempts since we do not use it any more

Removed failedAttempts and consecutiveFailures fields in commit 84f8e9f. The rate limiting now only uses lastRefreshAttempt to track all refresh attempts.

Initial plan

01d86cd

Copilot AI assigned Copilot and asim Jan 13, 2026

Copilot started work on behalf of asim January 13, 2026 07:02 View session

Copilot AI changed the title ~~[WIP] Fix etcd authentication performance issue and registry cache penetration~~ Verify etcd authentication performance and cache penetration fixes are complete Jan 13, 2026

Copilot AI requested a review from asim January 13, 2026 07:12

Copilot finished work on behalf of asim January 13, 2026 07:12

Copilot started work on behalf of asim January 13, 2026 07:31 View session

Copilot finished work on behalf of asim January 13, 2026 07:33

Copilot started work on behalf of asim January 13, 2026 07:47 View session

Copilot finished work on behalf of asim January 13, 2026 07:49

Copilot started work on behalf of asim January 13, 2026 08:55 View session

Copilot AI changed the title ~~Verify etcd authentication performance and cache penetration fixes are complete~~ Enhance cache rate limiting to prevent registry overload during rolling deployments Jan 13, 2026

Copilot finished work on behalf of asim January 13, 2026 09:06

asim approved these changes Jan 13, 2026

View reviewed changes

asim mentioned this pull request Jan 13, 2026

[BUG] etcd authentication performance issue and registry cache penetration #2821

Closed

Copilot started work on behalf of asim January 14, 2026 05:37 View session

Copilot AI requested a review from asim January 14, 2026 05:41

Copilot finished work on behalf of asim January 14, 2026 05:41

asim marked this pull request as ready for review January 14, 2026 10:56

asim merged commit baeb282 into master Jan 14, 2026
2 checks passed

Copilot AI mentioned this pull request Feb 3, 2026

Apply rate limiting before singleflight to prevent goroutine blocking on etcd timeout #2841

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance cache rate limiting to prevent registry overload during rolling deployments#2824

Enhance cache rate limiting to prevent registry overload during rolling deployments#2824
asim merged 3 commits intomasterfrom
copilot/fix-etcd-authentication-performance-again

Copilot AI commented Jan 13, 2026 •

edited

Loading

Uh oh!

asim commented Jan 13, 2026

Uh oh!

Copilot AI commented Jan 13, 2026

Uh oh!

asim commented Jan 13, 2026

Uh oh!

Copilot AI commented Jan 13, 2026 •

edited

Loading

I tried to connect to the following addresses, but was blocked by firewall rules:

Uh oh!

asim commented Jan 13, 2026

Uh oh!

Copilot AI commented Jan 13, 2026

Uh oh!

DingYuan0118 commented Jan 14, 2026

Uh oh!

asim commented Jan 14, 2026

Uh oh!

Copilot AI commented Jan 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Copilot AI commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Enhanced Rate Limiting Implementation

Changes Made

Key Improvements

1. Universal Rate Limiting

2. Stale Cache Strategy

3. Rolling Deployment Protection

4. Code Cleanup

Test Results

Documentation

To Reproduce

Environment

Logs

Resources

Comments on the Issue (you are @copilot in this section)

Uh oh!

asim commented Jan 13, 2026

Uh oh!

Copilot AI commented Jan 13, 2026

Uh oh!

asim commented Jan 13, 2026

Uh oh!

Copilot AI commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

I tried to connect to the following addresses, but was blocked by firewall rules:

Uh oh!

asim commented Jan 13, 2026

Uh oh!

Copilot AI commented Jan 13, 2026

Uh oh!

DingYuan0118 commented Jan 14, 2026

Uh oh!

asim commented Jan 14, 2026

Uh oh!

Copilot AI commented Jan 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Jan 13, 2026 •

edited

Loading

Copilot AI commented Jan 13, 2026 •

edited

Loading