Skip to content

Lettuce cannot recover from connection problems #1428

@adrianpasternak

Description

@adrianpasternak

Bug Report

Current Behavior

During troubleshooting of our production issues with Lettuce and Redis Cluster, we have discovered issues with re-connection of Pub/Sub subscriptions after network problems.

Lettuce is not sending any keep-alive packets on TCP connections dedicated to Pub/Sub subscriptions. Without keep-alives in a rare case of a sudden connection loss to a Redis node, Lettuce is not able to detect that the connection is no longer working. With default OS configuration it will be waiting for hours until OS will close the connection. In the meantime all messages published to a channel will be lost.

Input Code

Minimal code from Lettuce docs is enough to reproduce the issue.

        RedisClusterClient clusterClient = RedisClusterClient.create(Arrays.asList(node1, node2, node3));

        ClusterTopologyRefreshOptions topologyRefreshOptions = ClusterTopologyRefreshOptions.builder()
                .enablePeriodicRefresh(Duration.ofSeconds(15))
                .enableAllAdaptiveRefreshTriggers()
                .build();

        clusterClient.setOptions(ClusterClientOptions.builder()
                .topologyRefreshOptions(topologyRefreshOptions)
                .build());

        StatefulRedisPubSubConnection<String, String> connection = clusterClient.connectPubSub();
        connection.addListener(new RedisPubSubListener<String, String>() { ... } );

        RedisPubSubCommands<String, String> sync = connection.sync();
        sync.subscribe("broadcast");

To reproduce the issue:

  • Start Redis Cluster.
  • Connect to the cluster ans subscribe to the channel using the above code.
  • Find to which server the client is connected using tcpdump or by checking with redis-cli PUBSUB CHANNELS *.
  • Block all network traffic on that server using iptables (killing Redis process is not enough - OS will send FIN packets, and Lettuce will detect a problem and recover the subscription).
  • Redis Cluster will recover the cluster by promoting one of the replicas to the master.
  • Lettuce will not detect that connection is not longer working. And won't receive messages published to channels. Unused connection will be closed by OS after couple hours, and then Lettuce might me able to fix the problem.

We've been able to find issue also in Redis Standalone:

  • Connect to Pub/Sub using Lettuce.
  • Kill traffic on master using iptables. Restart VM with Redis and restore traffic.
  • Lettuce is not detecting an issue and is listening on a dead connection.

Expected behavior/code

Lettuce should be able to detect a broken connection to fix Pub/Sub subscriptions.

Environment

  • Lettuce version(s): 5.3.4.RELEASE
  • Redis version: 5.0.5

Possible Solution

We've made similar tests using redis-cli client. The official client is sending keep-alive packets every 15 seconds, and is able to detect connection loss.

It would be best if Lettuce could send keep-alive packets on a Pub/Sub connection to detect network problems. That should enable Lettuce to fix Pub/Sub subscriptions.

Workarounds

We've found a workaround for this problem by tweaking OS params (tcp_keepalive_time, tcp_keepalive_intvl, tcp_keepalive_probes), but we would want to avoid changing OS params on all our machines that use Lettuce as a Redis client.

Metadata

Metadata

Assignees

No one assigned

    Labels

    for: stackoverflowA question that is better suited to stackoverflow.com

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions