Skip to content

Zombie connection in ConnPool.idleConns #3703

@alexey-s-aksenov

Description

@alexey-s-aksenov

Connections closed on +switch-master notification from Sentinel remains in ConnPool.idleConns forever

Expected Behavior

Idle connections with state "CLOSED" should be cleaned up from ConnPool.idleConns and new connections should be reestablished.

Current Behavior

Idle connections remains in ConnPool.idleConns forever.
Currently we use v9.17.2.

Possible Solution

On v9.16.0 it works as expected.
First, on +switch-master event for every connection in pool p.closeConn(cn) is called. The difference between versions is a new line cn.stateMachine.Transition(StateClosed) in Conn.Close() method, which changes the state to "CLOSED".
Second, ConnPool.popIdle() behavior was changed.
In v9.16.0 the following code returns cn, which later is being tested in p.isHealthyConn(cn, now). As a result closed connection is being removed from pool.


		if cn.CompareAndSwapUsed(false, true) {
			if cn.IsUsable() {
				p.idleConnsLen.Add(-1)
				break
			}
			cn.SetUsed(false)
		}

In v9.17.2 the code above was replaced with the following:


		// Hot path optimization: try IDLE → IN_USE or CREATED → IN_USE transition
		// Using inline TryAcquire() method for better performance (avoids pointer dereference)
		if cn.TryAcquire() {
			// Successfully acquired the connection
			p.idleConnsLen.Add(-1)
			break
		}

cn.TryAcquire() will give True if and only if the stated is in "IDLE" or "CREATED". We have state "CLOSED" here, so as a result we have cn=nil. There are no remaining available connection in pool, so we go through the branch of code where for every upcoming commands a new connection is being created.

I believe we need to return closed cn from popIdle method to be cleared in isHealthyConn as before, or remove from pool here.

Steps to Reproduce

  1. Run redis in sentinel mode
  2. Create a Failover Client with poolSize=1

        client = goredis.NewFailoverClient(&goredis.FailoverOptions{
		MasterName:      sentinelName,
		SentinelAddrs:   strings.Split(sentinels, ","),
		PoolSize:        1,
		DisableIdentity: true,
	})

  1. Send at least one command to fill the pool
  2. Trigger switch master
  3. To easier catch the bug we can use multi-exec
	
	set, err := client.Set(ctx, "testRadishKeyMulti2", "5", 0).Result()
	assert.NoError(t, err)
	assert.Equal(t, "OK", set)

	multi, err := client.Do(ctx, "MULTI").Result()
	assert.NoError(t, err)
	assert.Equal(t, "OK", multi.(string))

	incr, err := client.Do(ctx, "INCR", "testRadishKeyMulti2").Result()
	assert.NoError(t, err)
	assert.Equal(t, "QUEUED", incr.(string))

	decr, err := client.Do(ctx, "DECR", "testRadishKeyMulti2").Result()
	assert.NoError(t, err)
	assert.Equal(t, "QUEUED", decr.(string))

	exec, err := client.Do(ctx, "EXEC").Result()
	assert.NoError(t, err)
	execArray, ok := exec.([]any)
	assert.True(t, ok)
	assert.EqualValues(t, int64(6), execArray[0])
	assert.EqualValues(t, int64(5), execArray[1])

Context (Environment)

We moved from v9.16.0 due to that maintenance bug

Detailed Description

Possible Implementation

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions