Skip to content

Add adaptive throttling to registry cache to prevent penetration when no stale cache exists#2823

Merged
asim merged 4 commits intomasterfrom
copilot/fix-etcd-authentication-performance
Jan 9, 2026
Merged

Add adaptive throttling to registry cache to prevent penetration when no stale cache exists#2823
asim merged 4 commits intomasterfrom
copilot/fix-etcd-authentication-performance

Conversation

Copy link
Contributor

Copilot AI commented Jan 9, 2026

Enhanced Cache Penetration Protection for Registry

Problem

While PR #2822 added singleflight protection, it's insufficient when:

  1. Cache is empty (no stale data to return)
  2. etcd is overloaded/failing
  3. Multiple sequential requests arrive after singleflight completes

Current behavior: Each singleflight attempt fails → returns error → next request tries again → continuous etcd hammering

Solution: Adaptive Request Throttling

Add rate limiting when registry is in error state to prevent cache penetration:

  • Analyze current cache implementation and understand failure scenarios
  • Add rate limiting/throttling mechanism to prevent excessive etcd calls during failures
  • Implement adaptive backoff when registry is in error state
  • Add configuration options for rate limit parameters (MinimumRetryInterval)
  • Write comprehensive tests for the new protection mechanism
  • Fix unrelated WaitGroup race condition in events tests
  • Update documentation
  • Run existing tests to ensure backward compatibility
  • Request code review (no issues)
  • Run security scan (no vulnerabilities)

Changes Summary

Registry Cache Throttling (registry/cache/)

  • Added MinimumRetryInterval option (default 5s)
  • Track last failed attempt time per service
  • Throttle requests within retry interval when no cache exists
  • Clear throttling state on successful lookup
  • Comprehensive test coverage (5 new tests)

Bug Fix (events/stream_test.go)

  • Fixed WaitGroup race condition causing "negative WaitGroup counter" panics
  • Moved wg.Add() calls before goroutine starts

Documentation

  • Updated README with feature overview and usage examples
  • Documented throttling behavior and configuration options
Original prompt

This section details on the original issue you should resolve

<issue_title>[BUG] etcd authentication performance issue and registry cache penetration</issue_title>
<issue_description>## Describe the bug
The backgound is that we need to enable etcd server auth due to some security concerns.

Etcd server has a special authentication design which causes a serious performance problem of /etcdserverpb.Auth/Authenticate API.

From our observation, a normal 3-nodes etcd cluster whose spec like 64C 256G HDD can only support less than around 100 QPS for authentication requests.

In current default go-micro registry plugin setting and gRPC server setting, gRPC server will do a registry by using KeepAliveOnce every RegisterInterval(default 30s). KeepAliveOnce will use /etcdserverpb.Auth/Authenticate once to instablish the stream.

In our Prod env, we have a k8s cluster which contains over 4000 service pods results in a steady /etcdserverpb.Auth/Authenticate QPS around 110.

When we start to enable the etcd auth, the etcd cluster can not handle such /etcdserverpb.Auth/Authenticate QPS causes the business KeepAliveOnce failed and deregister from etcd server after registryTTL.

The upstream Watch the change and Delete downstream server nodes from registry cache which result in an empty cache finally.

When the cache has been cleared, causes another cache penetration problem which is that a gRPC call will call etcd to get downstream nodes from etcd when cache is empty or invalid, but etcd server does not have the info in this time because downstream can not keep the registry hearbeat due to the /etcdserverpb.Auth/Authenticate problem.

The result is that all gRPC request penetrate to etcd and failed at last.

We wanna handle these two problems:

    1. Limit the request to etcd when cache is empty to avoid the penetration issue.
    1. Use "KeepAlive" instead of "KeepAliveOnce" to address /etcdserverpb.Auth/Authenticate QPS issue.

To Reproduce

Steps to reproduce the behavior:

  1. create a 4000+ service pods which use default go-micro registry settings.
  2. preconfigure the etcd username and passwd.
  3. enable etcd cluster auth.

Environment

  • Go Micro version:
    • github.com/go-micro/plugins/v4/client/grpc v1.2.1
    • github.com/go-micro/plugins/v4/registry/etcd v1.2.0
    • github.com/go-micro/plugins/v4/server/grpc v1.2.0
    • go-micro.dev/v4 v4.9.0
    • go.etcd.io/etcd/client/v3 v3.5.2
  • Go version: 1.18
  • OS: Ubuntu 20.04
  • Plugins used:
    • etcd registry

Logs

server side monitor

Image

Resources

<agent_instructions>The user is now saying single flight protection is not enough</agent_instructions>

Comments on the Issue (you are @copilot in this section)

@asim Could you look at this PR and see if it satisfies your requirements?

#2822</comment_new>
<comment_new>@asim
This PR #2822 is merged. If you are satisfied with the fix please close the issue.</comment_new>


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Implement rate limiting mechanism to protect etcd when:
- Cache is empty (no stale data available)
- Registry is failing/overloaded
- Multiple sequential requests arrive

Changes:
- Add MinimumRetryInterval option (default 5s)
- Track last failed attempt time per service
- Throttle requests within retry interval when no cache exists
- Clear throttling on successful lookup
- Add comprehensive tests validating throttling behavior

Co-authored-by: asim <17530+asim@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix etcd authentication performance issue Add adaptive throttling to registry cache to prevent penetration when no stale cache exists Jan 9, 2026
Copilot AI requested a review from asim January 9, 2026 12:02
@asim
Copy link
Member

asim commented Jan 9, 2026

@copilot unit test fails

Copilot stopped work on behalf of asim due to an error January 9, 2026 12:06
@asim
Copy link
Member

asim commented Jan 9, 2026

@copilot Unit test fails:

START| Stream
START| Stream/memory
START| Stream/memory/TestMissingTopic
START| Stream/memory/TestConsumeTopic
| panic: sync: negative WaitGroup counter
| goroutine 22 [running]:
| sync.(*WaitGroup).Add(0xc00018a540, 0xffffffffffffffff)
| /opt/hostedtoolcache/go/1.24.11/x64/src/sync/waitgroup.go:64 +0x19a
| sync.(*WaitGroup).Done(0xc00018a540)
| /opt/hostedtoolcache/go/1.24.11/x64/src/sync/waitgroup.go:89 +0x2e
| go-micro.dev/v5/events.runTestStream.func2.1()
| /home/runner/work/go-micro/go-micro/events/stream_test.go:70 +0x505
| created by go-micro.dev/v5/events.runTestStream.func2 in goroutine 21
| /home/runner/work/go-micro/go-micro/events/stream_test.go:57 +0x32e
FAIL | go-micro.dev/v5/events 0.032s

The WaitGroup.Add() calls were happening after goroutines started,
causing a race where Done() could be called before Add(), resulting
in "negative WaitGroup counter" panics.

Fixes:
- Move wg.Add(1) before goroutine in TestConsumeTopic
- Move wg.Add(2) before first goroutine in TestConsumeGroup

Co-authored-by: asim <17530+asim@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Jan 9, 2026

@copilot Unit test fails:

START| Stream...

Fixed in 2850d50. The WaitGroup.Add() calls were racing with goroutine execution - moved them before the goroutine starts.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • 224.0.0.251
    • Triggering command: REDACTED, pid is -1 (packet block)

If you need me to access, download, or install something from one of these locations, you can either:

Document the new adaptive throttling feature that prevents cache
penetration when registry is failing and no stale cache exists.

Includes:
- Feature overview
- Usage examples with configuration options
- Explanation of throttling behavior
- Example scenario demonstrating protection

Co-authored-by: asim <17530+asim@users.noreply.github.com>
@asim asim marked this pull request as ready for review January 9, 2026 12:27
@asim asim merged commit 2869cc1 into master Jan 9, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] etcd authentication performance issue and registry cache penetration

2 participants