Skip to content

Memory leak when redis.Client is configured with StreamingCredentialsProvider #3812

@Kagamia-MS

Description

@Kagamia-MS

Expected Behavior

When redis.Options.StreamingCredentialsProvider is set on a client, the client should not retain any state for connections that have been closed and removed from the pool.

Concretely, when a connection is removed from the pool through any path (including stale-conn discard, idle/lifetime expiry, and health-check failure), registered pool hooks should observe the removal via PoolHook.OnRemove. In particular, streaming.ReAuthPoolHook.OnRemove should fire so that streaming.Manager.RemoveListener(connID) runs and the corresponding entry in streamingCredentialsManager.credentialsListeners.listeners is deleted. The map size should track the live pool size, not the cumulative number of stale-conn closures.

Current Behavior

Any client created with StreamingCredentialsProvider accumulates one orphan *ConnReAuthCredentialsListener (and the closed *pool.Conn it captures) per stale-conn close, because (*ConnPool).CloseConn does not invoke PoolHook.OnRemove.

(*ConnPool).CloseConn (internal/pool/pool.go:1449) does not call the pool hook’s ProcessOnRemove. As a result, every connection that exits the pool through CloseConn (idle expiry, lifetime expiry, health-check failure, stale-conn discard from Get) leaves a permanent entry in streaming.Manager.credentialsListeners.listeners.

removeConnInternal and the Remove* paths do invoke the hook, so the gap is specifically CloseConn. The two functions sit next to each other in the same file:

// internal/pool/pool.go (v9.19.0)
func (p *ConnPool) removeConnInternal(ctx context.Context, cn *Conn, reason error, freeTurn bool) {
    hookManager := p.hookManager.Load()
    if hookManager != nil {
        hookManager.ProcessOnRemove(ctx, cn, reason)   // ✅ fires
    }
    ...
}

func (p *ConnPool) CloseConn(ctx context.Context, cn *Conn, reason string, fromState string) error {
    // ❌ ProcessOnRemove never called
    removed := p.removeConnWithLock(cn)
    if removed {
        p.recordConnectionMetrics(ctx, cn, reason, fromState)
    }
    return p.closeConn(cn)
}

Because ReAuthPoolHook.OnRemove is the only path that calls Manager.RemoveListener(connID), each leaked map entry is a *ConnReAuthCredentialsListener that holds a conn *pool.Conn field — so the closed connection’s read buffer + write buffer + TLS state stay reachable for the lifetime of the process.

Possible Solution

One-hunk fix that mirrors the existing call in removeConnInternal:

diff --git a/internal/pool/pool.go b/internal/pool/pool.go
--- a/internal/pool/pool.go
+++ b/internal/pool/pool.go
@@ -1447,6 +1447,10 @@ func (p *ConnPool) RemoveWithoutTurn(ctx context.Context, cn *Conn, reason error
 //   - reason: why the connection is being closed (use CloseReason* constants)
 //   - fromState: the metric state the connection was in (use MetricState* constants)
 func (p *ConnPool) CloseConn(ctx context.Context, cn *Conn, reason string, fromState string) error {
+	if hookManager := p.hookManager.Load(); hookManager != nil {
+		hookManager.ProcessOnRemove(ctx, cn, errors.New(reason))
+	}
+
     removed := p.removeConnWithLock(cn)

     // Only emit UpDownCounter decrements if we actually removed the connection.

The errors.New(reason) matches the error argument expected by PoolHook.OnRemove. After the patch, internal_listener_count stays equal to the live pool size across thousands of stale-conn cycles in the repro below.

Steps to Reproduce

A minimum single file repro environment is below.

package main

import (
	"context"
	"fmt"
	"reflect"
	"runtime"
	"sync"
	"time"

	"github.com/redis/go-redis/v9"
	"github.com/redis/go-redis/v9/auth"
)

type provider struct{}

func (provider) Subscribe(auth.CredentialsListener) (auth.Credentials, auth.UnsubscribeFunc, error) {
	return auth.NewBasicCredentials("", ""), func() error { return nil }, nil
}

func listeners(c *redis.Client) int {
	m := reflect.ValueOf(c).Elem().FieldByName("baseClient").Elem().FieldByName("streamingCredentialsManager")
	if m.IsNil() {
		return -1
	}
	return m.Elem().FieldByName("credentialsListeners").Elem().FieldByName("listeners").Len()
}

func main() {
	c := redis.NewClient(&redis.Options{
		Addr:                         "redis:6379",
		StreamingCredentialsProvider: provider{},
		PoolSize:                     20,
		MinIdleConns:                 20,
		ConnMaxIdleTime:              150 * time.Millisecond,
		ReadBufferSize:               32 * 1024,
		WriteBufferSize:              32 * 1024,
	})
	defer c.Close()

	ctx := context.Background()
	start := time.Now()
	var wg sync.WaitGroup
	for i := 1; i <= 500; i++ {
		wg.Add(20)
		for j := 0; j < 20; j++ {
			go func() {
				defer wg.Done()
				_ = c.Set(ctx, "k", "v", time.Minute).Err()
			}()
		}
		wg.Wait()
		time.Sleep(250 * time.Millisecond)

		if i == 1 || i%25 == 0 || i == 500 {
			var m runtime.MemStats
			runtime.ReadMemStats(&m)
			s := c.PoolStats()
			fmt.Printf("iter=%d stale=%d total=%d listeners=%d heap_MB=%d secs=%.0f\n",
				i, s.StaleConns, s.TotalConns, listeners(c), m.HeapInuse/1024/1024, time.Since(start).Seconds())
		}
	}
}

example output

repro-1  | iter=1 stale=0 total=20 listeners=20 heap_MB=3 secs=0
repro-1  | iter=25 stale=482 total=20 listeners=482 heap_MB=35 secs=6
repro-1  | iter=50 stale=982 total=20 listeners=982 heap_MB=66 secs=13
... ...
repro-1  | iter=475 stale=9365 total=20 listeners=9330 heap_MB=610 secs=120
repro-1  | iter=500 stale=9865 total=20 listeners=9830 heap_MB=644 secs=127

Context (Environment)

  • go-redis version: v9.19.0 (also master at the time of writing)
  • Go runtime: go1.25.x
  • Auth path: redis.Options.StreamingCredentialsProvider (we use https://github.com/redis/go-redis-entraid actually)
  • Trigger: any workload that uses streaming credentials together with stale-conn churn (short ConnMaxIdleTime, ConnMaxLifetime expiry, or MinIdleConns warming after health-check failures).

The bug is independent of #3772 / #3785. Those fixed wrappedOnClose chain growth on baseClient.onClose in v9.19.0, removing the dominant leak in the streaming-credentials path. This issue is a separate retainer:

For long-running services using streaming credentials, the residual leak is large enough that pprof + map inspection are necessary to detect it after the v9.19.0 upgrade.

Detailed Description

The streaming-credentials manager owns a credentialsListeners.listeners map keyed by connection ID, populated in Manager.Listener(...) during (*baseClient).initConn. The corresponding cleanup is wired through the pool hook contract:

*ConnPool.removeConnInternal --(hookManager.ProcessOnRemove)--> ReAuthPoolHook.OnRemove --> Manager.RemoveListener(connID)

(*ConnPool).CloseConn is the only removal path that does not invoke this hook chain, even though it's the path used by:

  • stale-conn discard inside (*ConnPool).Get when isHealthyConn returns false
  • ConnMaxIdleTime expiry
  • ConnMaxLifetime expiry
  • TCP probe / health-check failure

This is asymmetric with the surrounding code:

  • removeConnInternal and RemoveWithoutTurn invoke ProcessOnRemove.
  • Remove, CloseStaleConns-driven removals, and pool shutdown all flow through one of the above and therefore call the hook.
  • Only CloseConn skips it.

The behavioral consequence is that hook-managed state (currently: streaming credentials listeners; in principle: any future per-connection bookkeeping that uses PoolHook) is leaked whenever CloseConn is the exit path.

Possible Implementation

I think the explanation above is already quite clear.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions