Skip to content

[BUG] Migrate hashes with million of keys timeout, and causes failover #13122

@javedsha

Description

@javedsha

Describe the bug

While re-shard operation, if the key to migrate is an hashset, depending on the size and number of fields in hashset, the migrate operation time out and causes Redis to failover.

In our case, we have a hashset of size 300 MB and 3 Million fields.

The main problem is as migrate is blocking command, it blocks the primary, due to which Redis thinks the primary is down and causes a failover.

StackExchange Error Logs:

TimeStamp: 2024-03-08T16:53:32.534223Z
Timeout awaiting response (outbound=0KiB, inbound=0KiB, 4100ms elapsed, timeout is 4000ms), command=MIGRATE, next: some_random_key, inst: 0, qu: 0, qs: 0, aw: False, bw: SpinningDown, rs: ReadAsync, ws: Idle, in: 0, last-in: 2, cur-in: 0, sync-ops: 0, async-ops: 2193844, serverEndpoint: 172.20.0.6:6380, conn-sec: 670.01, aoc: 0, mc: 1/1/0, mgr: 10 of 10 available, clientName: mtcache000002(SE.Redis-v2.6.116.40240), PerfCounterHelperkeyHashSlot: 9271, IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=1,Free=32766,Min=8,Max=32767), POOL: (Threads=6,QueuedItems=0,CompletedItems=6635139), v: 2.6.116.40240 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts),

We get two of such errors before Redis failover, in redis, the node_timeout is set to 5 seconds.

Redis Logs:

16:53:35.041 * FAIL message received from 1a52537ed371931ec4436e02afdaae61fd061c17 about 42b37d2039622543514545a6cba3807e4db0b776
16:53:35.133 # Start of election delayed for 805 milliseconds (rank #0, offset 1269560724769).
16:53:36.736 # Configuration change detected. Reconfiguring myself as a replica of 42ac8718f2670e442fe598923f6069a66148e717
16:53:36.739 * Before turning into a replica, using my own master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.

SLOWLOG output, showing that migrate took 9.3 seconds

image

We are looking for:

  • How to migrate large hashes/hashset
  • How to avoid migrate time out to cause Redis Primary failover?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions