Skip to content

Enhance cache rate limiting to prevent registry overload during rolling deployments#2824

Merged
asim merged 3 commits intomasterfrom
copilot/fix-etcd-authentication-performance-again
Jan 14, 2026
Merged

Enhance cache rate limiting to prevent registry overload during rolling deployments#2824
asim merged 3 commits intomasterfrom
copilot/fix-etcd-authentication-performance-again

Conversation

Copy link
Contributor

Copilot AI commented Jan 13, 2026

Enhanced Rate Limiting Implementation

Addressed @asim's feedback on PR #2823 to improve cache penetration protection.

Changes Made

  • Understand current throttling limitations
  • Implement rate limiting for ALL cache refresh attempts (not just errors)
  • Always return stale cache when available during rate limit period
  • Handle rolling deployment scenario
  • Fix test expectations to match new behavior
  • Update documentation with detailed scenarios
  • Remove unused failedAttempts and consecutiveFailures fields

Key Improvements

1. Universal Rate Limiting

Before: Throttling only applied when registry returned errors AND no stale cache
After: Throttling applies to ALL refresh attempts, regardless of registry state

2. Stale Cache Strategy

  • If stale cache exists → return it (even if expired), skip registry call
  • If no cache exists → return ErrNotFound, rely on gRPC retry

3. Rolling Deployment Protection

New scenario handled: When downstream service does rolling deployment, all upstream
caches expire simultaneously. Under high QPS, this prevents stampede:

  • First request refreshes from registry
  • Subsequent requests (within MinimumRetryInterval) return stale cache
  • No registry overload

4. Code Cleanup

Removed unused failedAttempts and consecutiveFailures fields since the new
rate limiting strategy uses lastRefreshAttempt for all tracking.

Test Results

✓ TestSingleflightPreventsStampede (concurrent deduplication)
✓ TestCachePenetrationPrevention (rate limiting + stale cache)
✓ TestThrottlingWithoutStaleCache
✓ TestThrottlingMultipleConcurrentRequests
✓ TestThrottlingDoesNotAffectSuccessfulLookups
✓ TestThrottlingClearedOnSuccess
✓ TestStaleCacheOnError
✓ TestSingleflightWithError

All 8 tests passing ✅ (including with race detector)

Documentation

Updated registry/cache/README.md with:

  • Rolling deployment scenario example
  • Registry failure with stale cache example
  • No cache available example
  • Clear explanation of rate limiting behavior
Original prompt

This section details on the original issue you should resolve

<issue_title>[BUG] etcd authentication performance issue and registry cache penetration</issue_title>
<issue_description>## Describe the bug
The backgound is that we need to enable etcd server auth due to some security concerns.

Etcd server has a special authentication design which causes a serious performance problem of /etcdserverpb.Auth/Authenticate API.

From our observation, a normal 3-nodes etcd cluster whose spec like 64C 256G HDD can only support less than around 100 QPS for authentication requests.

In current default go-micro registry plugin setting and gRPC server setting, gRPC server will do a registry by using KeepAliveOnce every RegisterInterval(default 30s). KeepAliveOnce will use /etcdserverpb.Auth/Authenticate once to establish the stream.

In our Prod env, we have a k8s cluster which contains over 4000 service pods results in a steady /etcdserverpb.Auth/Authenticate QPS around 110.

When we start to enable the etcd auth, the etcd cluster can not handle such /etcdserverpb.Auth/Authenticate QPS causes the business KeepAliveOnce failed and deregister from etcd server after registryTTL.

The upstream Watch the change and Delete downstream server nodes from registry cache which result in an empty cache finally.

When the cache has been cleared, causes another cache penetration problem which is that a gRPC call will call etcd to get downstream nodes from etcd when cache is empty or invalid, but etcd server does not have the info in this time because downstream can not keep the registry hearbeat due to the /etcdserverpb.Auth/Authenticate problem.

The result is that all gRPC request penetrate to etcd and failed at last.

We wanna handle these two problems:

    1. Limit the request to etcd when cache is empty to avoid the penetration issue.
    1. Use "KeepAlive" instead of "KeepAliveOnce" to address /etcdserverpb.Auth/Authenticate QPS issue.

To Reproduce

Steps to reproduce the behavior:

  1. create a 4000+ service pods which use default go-micro registry settings.
  2. preconfigure the etcd username and passwd.
  3. enable etcd cluster auth.

Environment

  • Go Micro version:
    • github.com/go-micro/plugins/v4/client/grpc v1.2.1
    • github.com/go-micro/plugins/v4/registry/etcd v1.2.0
    • github.com/go-micro/plugins/v4/server/grpc v1.2.0
    • go-micro.dev/v4 v4.9.0
    • go.etcd.io/etcd/client/v3 v3.5.2
  • Go version: 1.18
  • OS: Ubuntu 20.04
  • Plugins used:
    • etcd registry

Logs

server side monitor

Image

Resources

<agent_instructions>See latest comment </agent_instructions>

Comments on the Issue (you are @copilot in this section)

@asim Could you look at this PR and see if it satisfies your requirements?

#2822</comment_new>
<comment_new>@asim
This PR #2822 is merged. If you are satisfied with the fix please close the issue.</comment_new>
<comment_new>@asim
Does this satisfy the requirements? #2823</comment_new>


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot AI changed the title [WIP] Fix etcd authentication performance issue and registry cache penetration Verify etcd authentication performance and cache penetration fixes are complete Jan 13, 2026
Copilot AI requested a review from asim January 13, 2026 07:12
@asim
Copy link
Member

asim commented Jan 13, 2026

@copilot Did you read the users last comment

Copy link
Contributor Author

Copilot AI commented Jan 13, 2026

@copilot Did you read the users last comment

I don't see any comments from the original issue reporter in the context provided - only your comments asking if PR #2822 and #2823 satisfy requirements. Could you clarify which user comment you're referring to? If there are additional comments on the issue that aren't showing in my context, I'll need to see them to properly address the request.

@asim
Copy link
Member

asim commented Jan 13, 2026

@copilot #2821 (comment)

Copy link
Contributor Author

Copilot AI commented Jan 13, 2026

@copilot #2821 (comment)

I see - the user's original comment requested two features:

  1. Limit requests to etcd when cache is empty (cache penetration protection)
  2. Use KeepAlive instead of KeepAliveOnce (reduce auth QPS)

Both are already fully implemented in PR #2823 (commit 2869cc1). This verification PR is unnecessary since that work is complete and tested. Should I close this PR?

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • https://api.github.com/repos/micro/go-micro/issues/2821/comments
    • Triggering command: /usr/bin/curl curl -s REDACTED (http block)

If you need me to access, download, or install something from one of these locations, you can either:

@asim
Copy link
Member

asim commented Jan 13, 2026

@copilot For #2823 , I think we should direct limit the access rate instead of in the "return error" scene. Because most range request will not return error but return empty in our case.

And we have found another case which happened long time ago which also needs this limiter.

The detail is that when downstream service launches a new release version, all downstream service will do a rolling deployment. When the last pod is deployed, all upstream services watch the change and update their cache simultaneously which means all upstream services will expire their cache at the same time.

If we encounter a very high pressure QPS at this moment, the cache penetration issue will still happens.

Regarding the handling of rate-limited requests, I believe the strategy should be as follows:

If a stale cache entry exists, return it even if it has expired.
If no cache entry is found, return a "not found" error and rely on the gRPC call's retry mechanism to retry the request.

…narios

Per @asim's feedback, rate limiting now applies to ALL cache refresh attempts
(not just error cases) to prevent registry overload during rolling deployments.

Changes:
- Rate limit ALL refresh attempts using lastRefreshAttempt tracking
- Always return stale cache if available (even if expired) when rate limiting
- Only return ErrNotFound when no cache exists during rate limit period
- Rate limiting happens inside singleflight to avoid race conditions
- Update test to reflect new behavior (no retry when stale cache + rate limit)
- Enhanced documentation with rolling deployment scenario examples

This addresses the scenario where all upstream services expire their cache
simultaneously during a downstream rolling deployment, which would previously
cause a stampede to the registry under high QPS.

Co-authored-by: asim <17530+asim@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Jan 13, 2026

@copilot For #2823 , I think we should direct limit the access rate instead of in the "return error" scene. Because most range request will not return error but return empty in our case.

And we have ...

Implemented the enhanced rate limiting strategy you requested in commit 99a6659.

Changes:

  • Rate limiting now applies to ALL cache refresh attempts (not just errors)
  • Always returns stale cache if available, even when expired and rate limited
  • Only returns ErrNotFound when no cache exists during rate limit period
  • Handles rolling deployment scenario where all caches expire simultaneously

The implementation prevents registry overload in both error scenarios and the rolling deployment case you identified, while maintaining efficient concurrent request handling via singleflight.

Copilot AI changed the title Verify etcd authentication performance and cache penetration fixes are complete Enhance cache rate limiting to prevent registry overload during rolling deployments Jan 13, 2026
@DingYuan0118
Copy link

@copilot can remove the failedAttempts since we do not use it any more

1 similar comment
@asim
Copy link
Member

asim commented Jan 14, 2026

@copilot can remove the failedAttempts since we do not use it any more

Per @asim's feedback, these fields are no longer needed since the new
rate limiting strategy uses lastRefreshAttempt to track ALL refresh
attempts, not just failed ones.

Removed:
- failedAttempts map[string]time.Time
- consecutiveFailures int
- All code that sets these fields

The rate limiting logic now only uses lastRefreshAttempt, making these
fields redundant. All tests continue to pass.

Co-authored-by: asim <17530+asim@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Jan 14, 2026

@copilot can remove the failedAttempts since we do not use it any more

Removed failedAttempts and consecutiveFailures fields in commit 84f8e9f. The rate limiting now only uses lastRefreshAttempt to track all refresh attempts.

Copilot AI requested a review from asim January 14, 2026 05:41
@asim asim marked this pull request as ready for review January 14, 2026 10:56
@asim asim merged commit baeb282 into master Jan 14, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] etcd authentication performance issue and registry cache penetration

3 participants