Skip to content

[BUG] valkey-server slows down with spop <key> and srandmember <key> spinning in hashtableScanDefrag after a few hours #2019

Description

@Fusl

After a few hours of uptime, one of our valkey-server processes (Docker 8.1.1-alpine and 8.1.1 images) starts consuming 100% CPU usage and only sporadically responds to application requests.

It appears that one of our applications continuously running SPOP <set> <n> (while another application is continuously filling this same set with new data) is working just fine until a few hours later where CPU usage suddenly increases from the expected 25% to 100% and Valkey is responding to requests only very slowly and only after 30-60+ seconds.

I captured a couple gdb backtraces that show valkey is spending a lot of time in hashtableScanDefrag when this issue happens. All of the gdb backtraces I captured are the exact same except for the function argument pointer and cursor values varying:

#0  0x000059218e456671 in hashtableScanDefrag (defragfn=0x0, flags=0, privdata=<synthetic pointer>, fn=<optimized out>, cursor=1972326, ht=0x725cbcb93ab0) at /usr/src/valkey/src/hashtable.c:1722
#1  hashtableScanDefrag (flags=<optimized out>, defragfn=<optimized out>, privdata=<optimized out>, fn=<optimized out>, cursor=<optimized out>, ht=<optimized out>) at /usr/src/valkey/src/hashtable.c:1707
#2  hashtableScan (fn=<optimized out>, privdata=<synthetic pointer>, cursor=1972326, ht=0x725cbcb93ab0) at /usr/src/valkey/src/hashtable.c:1686
#3  hashtableScan (privdata=<synthetic pointer>, fn=<optimized out>, cursor=1972326, ht=0x725cbcb93ab0) at /usr/src/valkey/src/hashtable.c:1685
#4  hashtableSampleEntries (ht=<optimized out>, dst=dst@entry=0x7ffe67b59750, count=count@entry=280) at /usr/src/valkey/src/hashtable.c:2031
#5  0x000059218e4571c2 in hashtableFairRandomEntry (ht=<optimized out>, found=0x7ffe67b5a058) at /usr/src/valkey/src/hashtable.c:2007
#6  0x000059218e4cc7e5 in setTypeRandomElement (setobj=setobj@entry=0x725d06994530, str=str@entry=0x7ffe67b5a0c8, len=len@entry=0x7ffe67b5a0c0, llele=llele@entry=0x7ffe67b5a0b8) at /usr/src/valkey/src/t_set.c:424
#7  0x000059218e4cc97d in setTypePopRandom (set=set@entry=0x725d06994530) at /usr/src/valkey/src/t_set.c:463
#8  0x000059218e4ceb61 in spopWithCountCommand (c=0x726078f57700) at /usr/src/valkey/src/t_set.c:853
#9  0x000059218e4633de in call (c=c@entry=0x726078f57700, flags=flags@entry=3) at /usr/src/valkey/src/server.c:3742
#10 0x000059218e464b4c in processCommand (c=0x726078f57700) at /usr/src/valkey/src/server.c:4366
#11 0x000059218e48bdf1 in processCommandAndResetClient (c=0x726078f57700) at /usr/src/valkey/src/networking.c:3118
#12 processInputBuffer (c=c@entry=0x726078f57700) at /usr/src/valkey/src/networking.c:3246
#13 0x000059218e48d928 in readQueryFromClient (conn=<optimized out>) at /usr/src/valkey/src/networking.c:3351
#14 0x000059218e5ad375 in callHandler (handler=<optimized out>, conn=0x726049b26d80) at /usr/src/valkey/src/connhelpers.h:79
#15 connSocketEventHandler (el=<optimized out>, fd=<optimized out>, clientData=0x726049b26d80, mask=<optimized out>) at /usr/src/valkey/src/socket.c:300
#16 0x000059218e44e449 in aeProcessEvents (flags=27, eventLoop=0x72638b840780) at /usr/src/valkey/src/ae.c:486
#17 aeProcessEvents (flags=27, eventLoop=0x72638b840780) at /usr/src/valkey/src/ae.c:411
#18 aeMain (eventLoop=0x72638b840780) at /usr/src/valkey/src/ae.c:543
#19 0x000059218e442a4a in main (argc=<optimized out>, argv=<optimized out>) at /usr/src/valkey/src/server.c:7219

Some more information regarding the specific key this is happening to:

127.0.0.1:38542> debug object urls:unretrievable
Value at:0x725d06994530 refcount:1 encoding:hashtable serializedlength:1668169541 lru:1044411 lru_seconds_idle:0 lfu_freq:0 lfu_access_time_minutes:4079
(20.73s)
127.0.0.1:38542> debug htstats-key urls:unretrievable full
Hash table 0 stats (main hash table):
 table size: 14680064
 number of entries: 1850367
 top-level buckets: 2097152
 child buckets: 200170
 max chain length: 4
 avg chain length: 0.10
 chain length distribution:
   0: 1958960 (93.41%)
   1: 81106 (3.87%)
   2: 52290 (2.49%)
   3: 4700 (0.22%)
   4: 96 (0.00%)
Hash table 1 stats (rehashing target):
 table size: 29360128
 number of entries: 14611037
 top-level buckets: 4194304
 child buckets: 699571
 max chain length: 3
 avg chain length: 0.17
 chain length distribution:
   0: 3507259 (83.62%)
   1: 674541 (16.08%)
   2: 12482 (0.30%)
   3: 22 (0.00%)

config (note that we've also tried with setting activedefrag no and activerehashing no during runtime with no effect to combat this issue, but we're trying to re-run or tests with starting up valkey using activedefrag no now):

valkey-server
--activedefrag yes
--aof-timestamp-enabled yes
--appendfsync everysec
--appendonly yes
--bind ::
--bind 0.0.0.0
--client-output-buffer-limit normal 1gb 16mb 15
--client-output-buffer-limit pubsub 64mb 8mb 15
--client-output-buffer-limit replica 16gb 1gb 300
--client-query-buffer-limit 512mb
--databases 1
--dual-channel-replication-enabled yes
--enable-debug-command local
--enable-module-command local
--enable-protected-configs local
--hide-user-data-from-log no
--io-threads 8
--log-format logfmt
--loglevel verbose
--log-timestamp-format iso8601
--maxmemory 128gb
--maxmemory-clients 1%
--maxmemory-policy noeviction
--max-new-connections-per-cycle 100
--no-appendfsync-on-rewrite yes
--oom-score-adj relative
--port ${PORT}
--proto-max-bulk-len 512mb
--repl-backlog-size 64mb
--repl-diskless-load swapdb
--repl-ping-replica-period 1
--repl-timeout 300
--requirepass ${PASS}
--save ''
--set-proc-title yes
--proc-title-template "{title} ${PROJECT} {listen-addr} {server-mode}"
--shutdown-on-sigint nosave
--shutdown-on-sigterm nosave
--shutdown-timeout 300
--supervised no
--tcp-backlog 1024
--tcp-keepalive 30

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions