Skip to content

Apply rate limiting before singleflight to prevent goroutine blocking on etcd timeout#2841

Merged
asim merged 3 commits intomasterfrom
copilot/fix-etcd-authentication-performance-another-one
Feb 3, 2026
Merged

Apply rate limiting before singleflight to prevent goroutine blocking on etcd timeout#2841
asim merged 3 commits intomasterfrom
copilot/fix-etcd-authentication-performance-another-one

Conversation

Copy link
Contributor

Copilot AI commented Feb 3, 2026

At scale (4000+ pods), etcd authentication bottlenecks cause registry heartbeat failures. When services deregister, cache clears and all goroutines block in singleflight waiting for slow/failing etcd, even when stale cache is available.

Changes

Move rate limiting check before singleflight entry

  • Check rate limit while holding read lock
  • If rate-limited AND stale cache exists → return immediately
  • Otherwise proceed to singleflight for stampede prevention

Re-check rate limiting inside singleflight

  • Handles race where another goroutine just completed refresh
  • Prevents redundant registry calls within retry interval

Impact

Before:

// All goroutines block in singleflight, waiting for etcd timeout
c.sg.Do(service, func() {
    if rateLimited() { /* check happens here - too late */ }
    return c.Registry.GetService(service) // 5s timeout blocks everyone
})

After:

// Rate-limited goroutines with stale cache return immediately
if rateLimited() && len(staleCache) > 0 {
    return staleCache // no blocking
}
c.sg.Do(service, func() { ... }) // only non-rate-limited requests wait

Goroutines can serve stale cache without blocking on etcd timeouts during outages.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • 224.0.0.251
    • Triggering command: REDACTED, pid is -1 (packet block)

If you need me to access, download, or install something from one of these locations, you can either:

Original prompt

This section details on the original issue you should resolve

<issue_title>[BUG] etcd authentication performance issue and registry cache penetration</issue_title>
<issue_description>## Describe the bug
The backgound is that we need to enable etcd server auth due to some security concerns.

Etcd server has a special authentication design which causes a serious performance problem of /etcdserverpb.Auth/Authenticate API.

From our observation, a normal 3-nodes etcd cluster whose spec like 64C 256G HDD can only support less than around 100 QPS for authentication requests.

In current default go-micro registry plugin setting and gRPC server setting, gRPC server will do a registry by using KeepAliveOnce every RegisterInterval(default 30s). KeepAliveOnce will use /etcdserverpb.Auth/Authenticate once to establish the stream.

In our Prod env, we have a k8s cluster which contains over 4000 service pods results in a steady /etcdserverpb.Auth/Authenticate QPS around 110.

When we start to enable the etcd auth, the etcd cluster can not handle such /etcdserverpb.Auth/Authenticate QPS causes the business KeepAliveOnce failed and deregister from etcd server after registryTTL.

The upstream Watch the change and Delete downstream server nodes from registry cache which result in an empty cache finally.

When the cache has been cleared, causes another cache penetration problem which is that a gRPC call will call etcd to get downstream nodes from etcd when cache is empty or invalid, but etcd server does not have the info in this time because downstream can not keep the registry hearbeat due to the /etcdserverpb.Auth/Authenticate problem.

The result is that all gRPC request penetrate to etcd and failed at last.

We wanna handle these two problems:

    1. Limit the request to etcd when cache is empty to avoid the penetration issue.
    1. Use "KeepAlive" instead of "KeepAliveOnce" to address /etcdserverpb.Auth/Authenticate QPS issue.

To Reproduce

Steps to reproduce the behavior:

  1. create a 4000+ service pods which use default go-micro registry settings.
  2. preconfigure the etcd username and passwd.
  3. enable etcd cluster auth.

Environment

  • Go Micro version:
    • github.com/go-micro/plugins/v4/client/grpc v1.2.1
    • github.com/go-micro/plugins/v4/registry/etcd v1.2.0
    • github.com/go-micro/plugins/v4/server/grpc v1.2.0
    • go-micro.dev/v4 v4.9.0
    • go.etcd.io/etcd/client/v3 v3.5.2
  • Go version: 1.18
  • OS: Ubuntu 20.04
  • Plugins used:
    • etcd registry

Logs

server side monitor

Image

Resources

<agent_instructions>For #2824, I think that the rate limiting should be applied BEFORE singleflight to avoid multi-goutine blocking due to dial time out when etcd is down to make it can use stale cache directly.</agent_instructions>

Comments on the Issue (you are @copilot in this section)

@asim Could you look at this PR and see if it satisfies your requirements?

#2822</comment_new>
<comment_new>@asim
This PR #2822 is merged. If you are satisfied with the fix please close the issue.</comment_new>
<comment_new>@asim
Does this satisfy the requirements? #2823</comment_new>
<comment_new>@asim
New pr to address the issues you raised #2824

I think it highlights that you or anyone could do this via an LLM. My contribution is limited here.</comment_new>
<comment_new>@asim
I will merge the pr and that will be my last edit</comment_new>
<comment_new>@asim
Ok feel free to PR now. All the pieces are there</comment_new>


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 2 commits February 3, 2026 06:56
- Check rate limiting BEFORE entering singleflight
- If rate-limited AND stale cache exists, return stale cache immediately
- This prevents all goroutines from blocking when etcd is down/slow
- Maintains stampede prevention via singleflight for non-rate-limited requests
- All existing tests pass

Co-authored-by: asim <17530+asim@users.noreply.github.com>
- Rename shadowed variables to currentLastRefresh and currentMinimumRetryInterval
- Improves code clarity and prevents potential bugs
- All tests still pass

Co-authored-by: asim <17530+asim@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix etcd authentication performance issue and registry cache penetration Apply rate limiting before singleflight to prevent goroutine blocking on etcd timeout Feb 3, 2026
Copilot AI requested a review from asim February 3, 2026 07:01
@asim asim marked this pull request as ready for review February 3, 2026 11:35
@asim asim merged commit adc90b4 into master Feb 3, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] etcd authentication performance issue and registry cache penetration

2 participants