Skip to content

Redis Core Dump #1072

@dlecocq

Description

@dlecocq

I've been doing some updates to a set of lua scripts we use, and in the process I've been able to repeatably get malloc errors during my unit tests. This is on Mac 10.7.5 using gcc 4.4.6. When I first encountered this I was using an older version but the latest 2.6.12 release also evinces this issue. The details of these versions:

Redis server v=2.5.7 sha=7c5d96d9:0 malloc=libc bits=64
Redis server v=2.6.12 sha=00000000:0 malloc=libc bits=64

I've sent a link to the core dump to Antirez, but I figured I should post the description here.

The backtrace:

#0  0x00007fff85080ce2 in __pthread_kill ()
#1  0x00007fff88d197d2 in pthread_kill ()
#2  0x00007fff88d0aa7a in abort ()
#3  0x00007fff88d2c4ac in szone_error ()
#4  0x00007fff88d2c4e8 in free_list_checksum_botch ()
#5  0x00007fff88d3353e in tiny_malloc_from_free_list ()
#6  0x00007fff88d3400e in szone_malloc_should_clear ()
#7  0x00007fff88d35972 in szone_realloc ()
#8  0x00007fff88d69243 in malloc_zone_realloc ()
#9  0x00007fff88d6a032 in realloc ()
#10 0x000000010000a08b in zrealloc (ptr=0x102800430, size=561) at zmalloc.c:159
#11 0x0000000100008fff in sdscatlen () at sds.c:107
#12 0x0000000100033016 in luaRedisGenericCommand () at scripting.c:299
#13 0x00000001000470f2 in luaD_precall ()
#14 0x000000010005278a in luaV_execute ()
#15 0x00000001000475ed in luaD_call ()
#16 0x0000000100046c57 in luaD_rawrunprotected ()
#17 0x0000000100046ccf in luaD_pcall ()
#18 0x00000001000421c4 in lua_pcall ()
#19 0x00000001000346db in evalGenericCommand () at scripting.c:872
#20 0x00000001000068ab in call () at redis.c:1589
#21 0x0000000100006dfb in processCommand () at redis.c:1764
#22 0x000000010001164c in processInputBuffer () at networking.c:1013
#23 0x000000010000f064 in readQueryFromClient (el=<value temporarily unavailable, due to optimizations>, fd=<value temporarily unavailable, due to optimizations>, privdata=<value temporarily unavailable, due to optimizations>, mask=<value temporarily unavailable, due to optimizations>) at networking.c:1076
#24 0x0000000100001845 in aeProcessEvents () at ae.c:382
#25 0x0000000100001b1b in aeMain (eventLoop=0x100082c98) at ae.c:425
#26 0x0000000100008b6e in main (argc=<value temporarily unavailable, due to optimizations>, argv=<value temporarily unavailable, due to optimizations>) at redis.c:2711

And the register info:

rax            0x0  0
rbx            0x6  6
rcx            0x7fff5fbfecf8   140734799801592
rdx            0x0  0
rsi            0x6  6
rdi            0x1307   4871
rbp            0x7fff5fbfed20   0x7fff5fbfed20
rsp            0x7fff5fbfecf8   0x7fff5fbfecf8
r8             0x7fff74bb6fb8   140735151828920
r9             0x0  0
r10            0x7fff85080d0a   140735425285386
r11            0xffffff80002dad60   -549752820384
r12            0x1000c3000  4295766016
r13            0x1000f6000  4295974912
r14            0x7fff74bb9960   140735151839584
r15            0x1000f60c0  4295975104
rip            0x7fff85080ce2   0x7fff85080ce2 <__pthread_kill+10>
eflags         0x246    582
cs             0x7  7
ss             0x0  0
ds             0x0  0
es             0x0  0
fs             0x0  0
gs             0x0  0

Like I said, I've been able to reproducibly evoke this behavior from unit tests, though mysteriously I've not been able to reproduce it any other way. For instance, I tried capturing the commands issues from the lua script using monitor and then replying them through redis-cli. That approach didn't work (I imagine either because the breaking command didn't make it through, or that the bug specific to the lua code).

The next thing I tried was to replay to evalsha requests I'd recorded and I was still unable to reproduce the issue. Very mysterious, indeed. Lastly, I even tried pulling out both the Lua code and its invocation into its own lua script but again this proved less than fruitful.

I've found three separate chunks of code in my scripts that I can comment out and prevent the malloc error, though they seem to be completely unrelated and I imagine they are not necessarily meaningful.

Until now, we've kept about a dozen discrete lua scripts to implement the core functionality of this particular library but we're trying to pull together related pieces of code into Lua classes so as to reduce repeated code across the scripts. It's been going well so far, until I began to move the code from put.lua into the unified script. It was at this point that problems arose.

The repos in question are:

The core scripts are a submodule of the python bindings, and the only dependency should be redis-py. With this checked out, I've been able to repeatably get this to occur with a fresh empty redis instance by invoking:

# From within the qless-py repo
./test.py TestFail.test_complete_failed

It doesn't always break on the first invocation, usually does within the first few runs, and 10 consecutive runs has thus far guaranteed evincing this behavior.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions