Apply rate limiting before singleflight to prevent goroutine blocking on etcd timeout by Copilot · Pull Request #2841 · micro/go-micro

Copilot · 2026-02-03T06:49:35Z

At scale (4000+ pods), etcd authentication bottlenecks cause registry heartbeat failures. When services deregister, cache clears and all goroutines block in singleflight waiting for slow/failing etcd, even when stale cache is available.

Changes

Move rate limiting check before singleflight entry

Check rate limit while holding read lock
If rate-limited AND stale cache exists → return immediately
Otherwise proceed to singleflight for stampede prevention

Re-check rate limiting inside singleflight

Handles race where another goroutine just completed refresh
Prevents redundant registry calls within retry interval

Impact

Before:

// All goroutines block in singleflight, waiting for etcd timeout
c.sg.Do(service, func() {
    if rateLimited() { /* check happens here - too late */ }
    return c.Registry.GetService(service) // 5s timeout blocks everyone
})

After:

// Rate-limited goroutines with stale cache return immediately
if rateLimited() && len(staleCache) > 0 {
    return staleCache // no blocking
}
c.sg.Do(service, func() { ... }) // only non-rate-limited requests wait

Goroutines can serve stale cache without blocking on etcd timeouts during outages.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

224.0.0.251
- Triggering command: REDACTED, pid is -1 (packet block)

If you need me to access, download, or install something from one of these locations, you can either:

Configure Actions setup steps to set up my environment, which run before the firewall is enabled
Add the appropriate URLs or hosts to the custom allowlist in this repository's Copilot coding agent settings (admins only)

Original prompt

This section details on the original issue you should resolve

<issue_title>[BUG] etcd authentication performance issue and registry cache penetration</issue_title>
<issue_description>## Describe the bug
The backgound is that we need to enable etcd server auth due to some security concerns.

Etcd server has a special authentication design which causes a serious performance problem of /etcdserverpb.Auth/Authenticate API.

From our observation, a normal 3-nodes etcd cluster whose spec like 64C 256G HDD can only support less than around 100 QPS for authentication requests.

In current default go-micro registry plugin setting and gRPC server setting, gRPC server will do a registry by using KeepAliveOnce every RegisterInterval(default 30s). KeepAliveOnce will use /etcdserverpb.Auth/Authenticate once to establish the stream.

In our Prod env, we have a k8s cluster which contains over 4000 service pods results in a steady /etcdserverpb.Auth/Authenticate QPS around 110.

When we start to enable the etcd auth, the etcd cluster can not handle such /etcdserverpb.Auth/Authenticate QPS causes the business KeepAliveOnce failed and deregister from etcd server after registryTTL.

The upstream Watch the change and Delete downstream server nodes from registry cache which result in an empty cache finally.

When the cache has been cleared, causes another cache penetration problem which is that a gRPC call will call etcd to get downstream nodes from etcd when cache is empty or invalid, but etcd server does not have the info in this time because downstream can not keep the registry hearbeat due to the /etcdserverpb.Auth/Authenticate problem.

The result is that all gRPC request penetrate to etcd and failed at last.

We wanna handle these two problems:

Limit the request to etcd when cache is empty to avoid the penetration issue.

Use "KeepAlive" instead of "KeepAliveOnce" to address /etcdserverpb.Auth/Authenticate QPS issue.

To Reproduce

Steps to reproduce the behavior:

create a 4000+ service pods which use default go-micro registry settings.

preconfigure the etcd username and passwd.

enable etcd cluster auth.

Environment

Go Micro version:

github.com/go-micro/plugins/v4/client/grpc v1.2.1

github.com/go-micro/plugins/v4/registry/etcd v1.2.0

github.com/go-micro/plugins/v4/server/grpc v1.2.0

go-micro.dev/v4 v4.9.0

go.etcd.io/etcd/client/v3 v3.5.2

Go version: 1.18

OS: Ubuntu 20.04

Plugins used:

etcd registry

Logs

server side monitor

Resources

Documentation

Examples

API Reference
</issue_description>

<agent_instructions>For #2824, I think that the rate limiting should be applied BEFORE singleflight to avoid multi-goutine blocking due to dial time out when etcd is down to make it can use stale cache directly.</agent_instructions>

Comments on the Issue (you are @copilot in this section)
@asim Could you look at this PR and see if it satisfies your requirements?
#2822</comment_new>
<comment_new>@asim
This PR #2822 is merged. If you are satisfied with the fix please close the issue.</comment_new>
<comment_new>@asim
Does this satisfy the requirements? #2823</comment_new>
<comment_new>@asim
New pr to address the issues you raised #2824

I think it highlights that you or anyone could do this via an LLM. My contribution is limited here.</comment_new>
<comment_new>@asim
I will merge the pr and that will be my last edit</comment_new>
<comment_new>@asim
Ok feel free to PR now. All the pieces are there</comment_new>

Fixes [BUG] etcd authentication performance issue and registry cache penetration #2821

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

- Check rate limiting BEFORE entering singleflight - If rate-limited AND stale cache exists, return stale cache immediately - This prevents all goroutines from blocking when etcd is down/slow - Maintains stampede prevention via singleflight for non-rate-limited requests - All existing tests pass Co-authored-by: asim <17530+asim@users.noreply.github.com>

- Rename shadowed variables to currentLastRefresh and currentMinimumRetryInterval - Improves code clarity and prevents potential bugs - All tests still pass Co-authored-by: asim <17530+asim@users.noreply.github.com>

Initial plan

993f7db

Copilot AI assigned Copilot and asim Feb 3, 2026

Copilot started work on behalf of asim February 3, 2026 06:49 View session

Copilot AI and others added 2 commits February 3, 2026 06:56

Fix variable shadowing in rate limiting check

840640c

- Rename shadowed variables to currentLastRefresh and currentMinimumRetryInterval - Improves code clarity and prevents potential bugs - All tests still pass Co-authored-by: asim <17530+asim@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Fix etcd authentication performance issue and registry cache penetration~~ Apply rate limiting before singleflight to prevent goroutine blocking on etcd timeout Feb 3, 2026

Copilot AI requested a review from asim February 3, 2026 07:01

Copilot finished work on behalf of asim February 3, 2026 07:01

asim marked this pull request as ready for review February 3, 2026 11:35

asim merged commit adc90b4 into master Feb 3, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apply rate limiting before singleflight to prevent goroutine blocking on etcd timeout#2841

Apply rate limiting before singleflight to prevent goroutine blocking on etcd timeout#2841
asim merged 3 commits intomasterfrom
copilot/fix-etcd-authentication-performance-another-one

Copilot AI commented Feb 3, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Impact

I tried to connect to the following addresses, but was blocked by firewall rules:

To Reproduce

Environment

Logs

Resources

Comments on the Issue (you are @copilot in this section)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Feb 3, 2026 •

edited

Loading