Skip to content

Dial timeout error is not redirectable #3786

@vladisa88

Description

@vladisa88

Issue tracker is used for reporting bugs and discussing new features. Please use
stackoverflow for supporting issues.

The MaxRedirects mechanism does not work if a dial tcp: i/o timeout error is received

Expected Behavior

When using ClusterClient, it seems to me that a strange behavior occurs. When accessing a server that is unavailable, I get the error dial tcp: I/o timeout. I expect the library to use the MaxRedirects mechanism, repeat the request, and then, in case of an error, redirect the request to another slave.

Current Behavior

However, now the shouldRetry function returns false in case of a dial tcp: i/o timeout error and the request is not repeated.

The possible problem is here

// error.go
func isTimeoutError(err error) (isTimeout bool, hasTimeoutFlag bool) {
	// Check for timeoutError interface (works with wrapped errors)
	var te timeoutError
	if errors.As(err, &te) {
		return true, te.Timeout()
	}

	// Check for net.Error specifically (common case for network timeouts)
	var netErr net.Error
	if errors.As(err, &netErr) {
		return true, netErr.Timeout()
	}

	return false, false
}

func shouldRetry(err error, retryTimeout bool) bool {
	...

	// Check for timeout errors (works with wrapped errors)
	if isTimeout, hasTimeoutFlag := isTimeoutError(err); isTimeout {
		if hasTimeoutFlag {
			return retryTimeout
		}
		return true
	}

//osscluster.go
func (c *ClusterClient) process(ctx context.Context, cmd Cmder) error {
...
if shouldRetry(lastErr, cmd.readTimeout() == nil) && !cmd.NoRetry() {
			// First retry the same node.
			if attempt == 0 {
				continue
			}

			// Second try another node.
			node.MarkAsFailing()
			node = nil
			continue
		}
...
}

So when you have readTimeout you always will get false on dial tcp errors and request will not be repeated because this function from net package will always return true

func (e *OpError) Timeout() bool {
	if ne, ok := e.Err.(*os.SyscallError); ok {
		t, ok := ne.Err.(timeout)
		return ok && t.Timeout()
	}
	t, ok := e.Err.(timeout)
	return ok && t.Timeout()
}

Possible Solution

Add an additional check for the dial tcp: i/o timeout error to the shouldRetry method and repeat it on one and then on the other node.

As a fix I can

// errors.go

// isDialError reports whether err is a connection-establishment error,
// meaning the TCP dial failed before any data was sent to the server.
func isDialError(err error) bool {
	var opErr *net.OpError
	return errors.As(err, &opErr) && opErr.Op == "dial"
}

func shouldRetry(err error, retryTimeout bool) bool {
	...

	// Check for timeout errors (works with wrapped errors)
	if isTimeout, hasTimeoutFlag := isTimeoutError(err); isTimeout {
		if hasTimeoutFlag {
			// A dial error means the TCP connection was never established and the
			// command was never sent to the server, so retry is always safe
			if isDialError(err) {
				return true
			}
			return retryTimeout
		}
		return true
        ...
}

Steps to Reproduce

Use a redis cluster, make key requests from certain slaves, and disable one of them so that requests start falling by error dial tcp: i/o timeout

Context (Environment)

In the case of updating redis machines or in the case when one slave crashes for some reason, I expect that the MaxRedirects mechanism will turn on, requests will be repeated first to the same slave, then to another (healthy) slave. But in the case of a dial tcp: i/o timeout error, this does not happen and the code has to go to the database, thereby increasing the load on it and reducing the overall response time. Although the dial tcp: i/o timeout error seems completely safe to repeat.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions