Issue tracker is used for reporting bugs and discussing new features. Please use
stackoverflow for supporting issues.
The MaxRedirects mechanism does not work if a dial tcp: i/o timeout error is received
Expected Behavior
When using ClusterClient, it seems to me that a strange behavior occurs. When accessing a server that is unavailable, I get the error dial tcp: I/o timeout. I expect the library to use the MaxRedirects mechanism, repeat the request, and then, in case of an error, redirect the request to another slave.
Current Behavior
However, now the shouldRetry function returns false in case of a dial tcp: i/o timeout error and the request is not repeated.
The possible problem is here
// error.go
func isTimeoutError(err error) (isTimeout bool, hasTimeoutFlag bool) {
// Check for timeoutError interface (works with wrapped errors)
var te timeoutError
if errors.As(err, &te) {
return true, te.Timeout()
}
// Check for net.Error specifically (common case for network timeouts)
var netErr net.Error
if errors.As(err, &netErr) {
return true, netErr.Timeout()
}
return false, false
}
func shouldRetry(err error, retryTimeout bool) bool {
...
// Check for timeout errors (works with wrapped errors)
if isTimeout, hasTimeoutFlag := isTimeoutError(err); isTimeout {
if hasTimeoutFlag {
return retryTimeout
}
return true
}
//osscluster.go
func (c *ClusterClient) process(ctx context.Context, cmd Cmder) error {
...
if shouldRetry(lastErr, cmd.readTimeout() == nil) && !cmd.NoRetry() {
// First retry the same node.
if attempt == 0 {
continue
}
// Second try another node.
node.MarkAsFailing()
node = nil
continue
}
...
}
So when you have readTimeout you always will get false on dial tcp errors and request will not be repeated because this function from net package will always return true
func (e *OpError) Timeout() bool {
if ne, ok := e.Err.(*os.SyscallError); ok {
t, ok := ne.Err.(timeout)
return ok && t.Timeout()
}
t, ok := e.Err.(timeout)
return ok && t.Timeout()
}
Possible Solution
Add an additional check for the dial tcp: i/o timeout error to the shouldRetry method and repeat it on one and then on the other node.
As a fix I can
// errors.go
// isDialError reports whether err is a connection-establishment error,
// meaning the TCP dial failed before any data was sent to the server.
func isDialError(err error) bool {
var opErr *net.OpError
return errors.As(err, &opErr) && opErr.Op == "dial"
}
func shouldRetry(err error, retryTimeout bool) bool {
...
// Check for timeout errors (works with wrapped errors)
if isTimeout, hasTimeoutFlag := isTimeoutError(err); isTimeout {
if hasTimeoutFlag {
// A dial error means the TCP connection was never established and the
// command was never sent to the server, so retry is always safe
if isDialError(err) {
return true
}
return retryTimeout
}
return true
...
}
Steps to Reproduce
Use a redis cluster, make key requests from certain slaves, and disable one of them so that requests start falling by error dial tcp: i/o timeout
Context (Environment)
In the case of updating redis machines or in the case when one slave crashes for some reason, I expect that the MaxRedirects mechanism will turn on, requests will be repeated first to the same slave, then to another (healthy) slave. But in the case of a dial tcp: i/o timeout error, this does not happen and the code has to go to the database, thereby increasing the load on it and reducing the overall response time. Although the dial tcp: i/o timeout error seems completely safe to repeat.
Issue tracker is used for reporting bugs and discussing new features. Please use
stackoverflow for supporting issues.
The MaxRedirects mechanism does not work if a dial tcp: i/o timeout error is received
Expected Behavior
When using ClusterClient, it seems to me that a strange behavior occurs. When accessing a server that is unavailable, I get the error dial tcp: I/o timeout. I expect the library to use the MaxRedirects mechanism, repeat the request, and then, in case of an error, redirect the request to another slave.
Current Behavior
However, now the shouldRetry function returns false in case of a dial tcp: i/o timeout error and the request is not repeated.
The possible problem is here
So when you have readTimeout you always will get false on dial tcp errors and request will not be repeated because this function from net package will always return true
Possible Solution
Add an additional check for the dial tcp: i/o timeout error to the shouldRetry method and repeat it on one and then on the other node.
As a fix I can
Steps to Reproduce
Use a redis cluster, make key requests from certain slaves, and disable one of them so that requests start falling by error dial tcp: i/o timeout
Context (Environment)
In the case of updating redis machines or in the case when one slave crashes for some reason, I expect that the MaxRedirects mechanism will turn on, requests will be repeated first to the same slave, then to another (healthy) slave. But in the case of a dial tcp: i/o timeout error, this does not happen and the code has to go to the database, thereby increasing the load on it and reducing the overall response time. Although the dial tcp: i/o timeout error seems completely safe to repeat.