Skip to content

Conversation

@filipecosta90
Copy link
Contributor

@filipecosta90 filipecosta90 commented Jul 8, 2020

This PR enables tracking time of the background tasks and on replies, opening the door for properly tracking commands that rely on blocking/background work via the slowlog, latency history, and commandstats.

Some notes:

  • The time spent blocked waiting for key changes, or blocked on synchronous replication is not accounted for.

  • This PR does not affect latency tracking of commands that are non-blocking or do not have background work. ( meaning that it all stays the same with exception to BZPOPMIN,BZPOPMAX,BRPOP,BLPOP, etc... and module's commands that rely on background threads ).

  • Specifically for latency history command we've added a new event class named command-unblocking that will enable latency monitoring on commands that spawn background threads to do the work.

  • For blocking commands we're now considering the total time of a command as the time spent on call() + the time spent on replying when unblocked.

  • For Modules commands that rely on background threads we're now considering the total time of a command as the time spent on call (main thread) + the time spent on the background thread ( if marked within RedisModule_MeasureTimeStart() and RedisModule_MeasureTimeEnd() ) + the time spent on replying (main thread)

test

To test for this feature we've added a unit/moduleapi/blockonbackground test that relies on a module that blocks the client and sleeps on the background for a given time.

  • check blocked command that uses RedisModule_MeasureTimeStart() is tracking background time
  • check blocked command that uses RedisModule_MeasureTimeStart() is tracking background time even in timeout
  • check blocked command with multiple calls RedisModule_MeasureTimeStart() is tracking the total background time
  • check blocked command without calling RedisModule_MeasureTimeStart() is not reporting background time

Samples

current behaviour ( not tracking time spent on background ):

127.0.0.1:6379> config set slowlog-log-slower-than 100000
OK
127.0.0.1:6379> slowlog len
(integer) 0
127.0.0.1:6379> block.debug 0 10000
(integer) 343176910
127.0.0.1:6379> slowlog len
(integer) 0
127.0.0.1:6379> block.debug 2000 10000
(integer) 1207730440
(2.00s)
127.0.0.1:6379> slowlog len
(integer) 0

commandstats example

127.0.0.1:6379> block.debug 2000 10000
(integer) 1768881473
(2.00s)
127.0.0.1:6379> block.debug 4000 10000
(integer) 630830182
(4.00s)
127.0.0.1:6379> info commandstats
# Commandstats
cmdstat_block.debug:calls=2,usec=226,usec_per_call=113.00

new behaviour ( tracking time spend on background ):

127.0.0.1:6379> config set slowlog-log-slower-than 100000
OK
127.0.0.1:6379> slowlog len
(integer) 0
127.0.0.1:6379> block.debug 0 10000
(integer) 1471490433
127.0.0.1:6379> slowlog len
(integer) 0
127.0.0.1:6379> block.debug 2000 10000
(integer) 1127835047
(2.00s)
127.0.0.1:6379> slowlog len
(integer) 1
127.0.0.1:6379> slowlog get 1
1) 1) (integer) 0
   2) (integer) 1593712320
   3) (integer) 2053204
   4) 1) "block.debug"
      2) "2000"
      3) "10000"
   5) "127.0.0.1:51818"
   6) ""

commandstats example

127.0.0.1:6379> block.debug 2000 10000
(integer) 1052512224
(2.00s)
127.0.0.1:6379> block.debug 4000 10000
(integer) 215961991
(4.00s)
127.0.0.1:6379> info commandstats
# Commandstats
cmdstat_block.debug:calls=2,usec=6099877,usec_per_call=3049938.50

latency monitor example

127.0.0.1:6379> config set latency-monitor-threshold 200
OK
127.0.0.1:6379> block.debug 500 1000
(integer) 1887208743
(0.50s)
127.0.0.1:6379> latency history command-unblocking
1) 1) (integer) 1611265472
   2) (integer) 500

Copy link
Member

@oranagra oranagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd like more feedback about this.
@guybe7 @yossigo @soloestoy at least one or two of you have a look.

@yossigo
Copy link
Collaborator

yossigo commented Jul 12, 2020

@oranagra @filipecosta90 I think there's an interesting use case discussion to make here, as this diverges a bit from the current SLOWLOG purpose.

For the sake of the argument, we could claim that if we want SLOWLOG to have more visibility on this, we also want to query socket-level information like RTT from the OS, try to measure end-to-end latency (i.e. ACK on last reply byte), etc.

@oranagra
Copy link
Member

the (maybe subjective) way i see it, the LATENCY command is about any delay in the main thread (be it fsync and such), and the SLOWLOG command is about the command execution time (and it's reply).

however, the shortcoming of LATENCY does luck any detail of the specific command that caused latency (just says command or fast-command).

the SLOWLOG output can indeed include details on various types of measurement (network time, main thread time, total time, etc), but then we need to decide which one we match to the threshold that decide if it is logged or not (i suppose the highest one).

…ods within reply time tracking. [add] tracking reply time on module clients blocked on keys
…ents background time. Reducede useless time on blocked test from 2secs to 100ms. tracking blocked time value on test also.
Copy link
Member

@oranagra oranagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i discussed this PR with the core team this morning.
we feel it may not be right to log the entire time the command was blocked in the slowlog and and cmdstat.
i suppose modules block clients in order to execute the command in the background thread and reply later.
but that doesn't mean the entire blocked time is the processing time, some time can be lost on waiting for a worker thread to pick up the job.
i think we may need find a way the module can help us distinguish between the blocked time and the processing time, and we only want to include the processing time in the slowlog and cmdstats.
let's try to get @MeirShpilraien and @swilly22 input here.

@MeirShpilraien
Copy link

@oranagra I agree it should not be in the slow log or commands stat because those indicate the actual processing time, maybe we can introduce a new RedisModuleApi that takes the block client and add execution time to it? Something like:

RM_MeasureTimeStart(bc);
...
RM_MeasureTimeEnd(bc);

The module can call those multiple times (without acquiring the lock!!) and this allows adding the time measure to the command execution time.
Needless to say, those names are debatable :)
WDYT?

@oranagra
Copy link
Member

@MeirShpilraien since this is a voluntary call, i wonder if many modules will use it.
so maybe in addition to that we need to indeed add a field indicating the blocked time to be used for modules that didn't bother to use this new API.
@yossigo @madolson WDYT?

@filipecosta90
Copy link
Contributor Author

filipecosta90 commented Jan 19, 2021

@oranagra I agree it should not be in the slow log or commands stat because those indicate the actual processing time, maybe we can introduce a new RedisModuleApi that takes the block client and add execution time to it? Something like:

RM_MeasureTimeStart(bc);
...
RM_MeasureTimeEnd(bc);

The module can call those multiple times (without acquiring the lock!!) and this allows adding the time measure to the command execution time.
Needless to say, those names are debatable :)
WDYT?

@MeirShpilraien , @oranagra I tend to disagree with the extra RM API and not tracking the entire background time of that thread. When we block the client ( via RM_BlockClient ) by all means we're making the client wait while executing that command, meaning all work required to reply should be accounted for given that is is seen as execution time from the client perspective ( for me as a client I don't care if you put it into two queues on the BG, or wait 1sec and then start working, the effective wait time of that client should start counting when we block ).

If we do otherwise we will be making it hard to detect queue times and make it even harder for this feature to take effect given people will need to explictly call a new API. IMHO all background time belongs to the command time. WDYT?

@MeirShpilraien
Copy link

MeirShpilraien commented Jan 19, 2021

@filipecosta90 I agree we should track the entire time as well but not in the slow log because the slow log shows only the processing time. Imagine the module implement the background operation with a single background thread that takes the next job from a queue, one long operation might cause all the others to also be a delay and enter the slow log (although their actual execution time is very short). This is why I suggested the new API so we can measure the actual time it takes to process the command...

@filipecosta90
Copy link
Contributor Author

@filipecosta90 I agree we should track the entire time as well but not in the slow log because the slow log shows only the processing time. Imagine the module implement the background operation with a single background thread that takes the next job from a queue, one long operation might cause all the others to also be a delay and enter the slow log (although their actual execution time is very short). This is why I suggested the new API so we can measure the actual time it takes to process the command...

@MeirShpilraien I believe that by default all background time should be taken into account and only on opt-in usage of the new API you would be allowed to have more control on time tracking ( meaning the module developer will acknowledge that there are times he does not want to be present on the slowlog ) by using RM_MeasureTimeStart(bc); In that manner we would have out of the box latencies that at most are overzealous and not the opposite. WDYT?

@madolson
Copy link
Contributor

@filipecosta90 I don't think the default time should be total background processing time. If someone is using a queue mechanism like meir mentioned, the slowlog will report commands that are slow simply if they are blocked behind other commands, and that will probably be confusing to end users since that isn't normally what the slowlog shows. I think having an option or a command to opt in a command to total background time tracking, that would be fine though.

@soloestoy
Copy link
Contributor

command-unblocking latency makes sense to me, but not sure if it is appropriate to push the original block command to slowlog, now it means the unblocking time instead of execution time right?

@filipecosta90
Copy link
Contributor Author

filipecosta90 commented Jan 28, 2021

now it means the unblocking time instead of execution time right?

@soloestoy it still means overall execution time, but now includes not only the time until blocking but also the reply time ( for blocking commands that are not from modules ) and means time until blocking ( main thread) + thread execution time ( when using the new RM API ) + reply time (main thread).

https://github.com/redis/redis/pull/7491/files#diff-ddd6be3a7a9289fbf41890c34da16f1ea5b0a8d6d9206ae1ff5303e50bae1970R107

@soloestoy
Copy link
Contributor

@filipecosta90 OK, I got it, and it would be nice to document this change about slowlog https://github.com/redis/redis-doc/blob/master/commands/slowlog.md

@filipecosta90
Copy link
Contributor Author

@filipecosta90 OK, I got it, and it would be nice to document this change about slowlog https://github.com/redis/redis-doc/blob/master/commands/slowlog.md

will do so @soloestoy :)

@itamarhaber itamarhaber added the state:needs-doc-pr requires a PR to redis-doc repository label Jan 29, 2021
@oranagra oranagra merged commit f0c5052 into redis:unstable Jan 29, 2021
@oranagra oranagra mentioned this pull request Jan 31, 2021
@zuiderkwast
Copy link
Contributor

I compiled the modules test code on unstable just now and got the following warnings:

cc -I../../src  -W -Wall -fno-common -g -ggdb -std=c99 -O2 -fPIC -c blockonbackground.c -o blockonbackground.xo
blockonbackground.c: In function ‘BlockDebug_ThreadMain’:
blockonbackground.c:48:5: warning: implicit declaration of function ‘nanosleep’ [-Wimplicit-function-declaration]
   48 |     nanosleep(&ts, NULL);
      |     ^~~~~~~~~
blockonbackground.c: In function ‘HelloDoubleBlock_RedisCommand’:
blockonbackground.c:183:36: warning: ‘timeout’ may be used uninitialized in this function [-Wmaybe-uninitialized]
  183 |     RedisModuleBlockedClient *bc = RedisModule_BlockClient(ctx,HelloBlock_Reply,HelloBlock_Timeout,HelloBlock_FreeData,timeout);
      |                                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

@itamarhaber
Copy link
Member

@filipecosta90 care to look at it? I also remember some CI failing on blocking commands that may be related?!? (ubuntu latest IIRC)

@filipecosta90
Copy link
Contributor Author

@filipecosta90 care to look at it? I also remember some CI failing on blocking commands that may be related?!? (ubuntu latest IIRC)

will do so @itamarhaber @zuiderkwast :)

@filipecosta90
Copy link
Contributor Author

I compiled the modules test code on unstable just now and got the following warnings:

cc -I../../src  -W -Wall -fno-common -g -ggdb -std=c99 -O2 -fPIC -c blockonbackground.c -o blockonbackground.xo
blockonbackground.c: In function ‘BlockDebug_ThreadMain’:
blockonbackground.c:48:5: warning: implicit declaration of function ‘nanosleep’ [-Wimplicit-function-declaration]
   48 |     nanosleep(&ts, NULL);
      |     ^~~~~~~~~
blockonbackground.c: In function ‘HelloDoubleBlock_RedisCommand’:
blockonbackground.c:183:36: warning: ‘timeout’ may be used uninitialized in this function [-Wmaybe-uninitialized]
  183 |     RedisModuleBlockedClient *bc = RedisModule_BlockClient(ctx,HelloBlock_Reply,HelloBlock_Timeout,HelloBlock_FreeData,timeout);
      |                                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

@zuiderkwast #8454 should fix it :)

@zuiderkwast
Copy link
Contributor

Yes, perfect!

JackieXie168 pushed a commit to JackieXie168/redis that referenced this pull request Mar 2, 2021
… on background work clients (redis#7491)

This commit enables tracking time of the background tasks and on replies,
opening the door for properly tracking commands that rely on blocking / background
 work via the slowlog, latency history, and commandstats. 

Some notes:
- The time spent blocked waiting for key changes, or blocked on synchronous
  replication is not accounted for. 

- **This commit does not affect latency tracking of commands that are non-blocking
  or do not have background work.** ( meaning that it all stays the same with exception to
  `BZPOPMIN`,`BZPOPMAX`,`BRPOP`,`BLPOP`, etc... and module's commands that rely
  on background threads ). 

-  Specifically for latency history command we've added a new event class named
  `command-unblocking` that will enable latency monitoring on commands that spawn
  background threads to do the work.

- For blocking commands we're now considering the total time of a command as the
  time spent on call() + the time spent on replying when unblocked.

- For Modules commands that rely on background threads we're now considering the
  total time of a command as the time spent on call (main thread) + the time spent on
  the background thread ( if marked within `RedisModule_MeasureTimeStart()` and
  `RedisModule_MeasureTimeEnd()` ) + the time spent on replying (main thread)

To test for this feature we've added a `unit/moduleapi/blockonbackground` test that relies on
a module that blocks the client and sleeps on the background for a given time. 
- check blocked command that uses RedisModule_MeasureTimeStart() is tracking background time
- check blocked command that uses RedisModule_MeasureTimeStart() is tracking background time even in timeout
- check blocked command with multiple calls RedisModule_MeasureTimeStart()  is tracking the total background time
- check blocked command without calling RedisModule_MeasureTimeStart() is not reporting background time
oranagra added a commit that referenced this pull request Apr 20, 2022
…xecution time, to regain up to 4% execution time (#10502)

In #7491 (part of redis 6.2), we started using the monotonic timer instead of mstime to measure
command execution time for stats, apparently this meant sampling the clock 3 times per command
rather than two (wince we also need the wall-clock time).
In some cases this causes a significant overhead.

This PR fixes that by avoiding the use of monotonic timer, except for the cases were we know it
should be extremely fast.
This PR also adds a new INFO field called `monotonic_clock` that shows which clock redis is using.

Co-authored-by: Oran Agra <oran@redislabs.com>
oranagra added a commit that referenced this pull request Apr 27, 2022
…xecution time, to regain up to 4% execution time (#10502)

In #7491 (part of redis 6.2), we started using the monotonic timer instead of mstime to measure
command execution time for stats, apparently this meant sampling the clock 3 times per command
rather than two (wince we also need the wall-clock time).
In some cases this causes a significant overhead.

This PR fixes that by avoiding the use of monotonic timer, except for the cases were we know it
should be extremely fast.
This PR also adds a new INFO field called `monotonic_clock` that shows which clock redis is using.

Co-authored-by: Oran Agra <oran@redislabs.com>
(cherry picked from commit 3cd8baf)
enjoy-binbin pushed a commit to enjoy-binbin/redis that referenced this pull request Jul 31, 2023
…xecution time, to regain up to 4% execution time (redis#10502)

In redis#7491 (part of redis 6.2), we started using the monotonic timer instead of mstime to measure
command execution time for stats, apparently this meant sampling the clock 3 times per command
rather than two (wince we also need the wall-clock time).
In some cases this causes a significant overhead.

This PR fixes that by avoiding the use of monotonic timer, except for the cases were we know it
should be extremely fast.
This PR also adds a new INFO field called `monotonic_clock` that shows which clock redis is using.

Co-authored-by: Oran Agra <oran@redislabs.com>
oranagra added a commit that referenced this pull request Jan 19, 2024
…12817)

Fix #12785 and other race condition issues.
See the following isolated comments.

The following report was obtained using SANITIZER thread.
```sh
make SANITIZER=thread
./runtest-moduleapi --config io-threads 4 --config io-threads-do-reads yes --accurate
```

1. Fixed thread-safe issue in RM_UnblockClient()
Related discussion:
#12817 (comment)
* When blocking a client in a module using `RM_BlockClientOnKeys()` or
`RM_BlockClientOnKeysWithFlags()`
with a timeout_callback, calling RM_UnblockClient() in module threads
can lead to race conditions
     in `updateStatsOnUnblock()`.

     - Introduced: 
        Version: 6.2
        PR: #7491

     - Touch:
`server.stat_numcommands`, `cmd->latency_histogram`, `server.slowlog`,
and `server.latency_events`
     
     - Harm Level: High
Potentially corrupts the memory data of `cmd->latency_histogram`,
`server.slowlog`, and `server.latency_events`

     - Solution:
Differentiate whether the call to moduleBlockedClientTimedOut() comes
from the module or the main thread.
Since we can't know if RM_UnblockClient() comes from module threads, we
always assume it does and
let `updateStatsOnUnblock()` asynchronously update the unblock status.
     
* When error reply is called in timeout_callback(), ctx is not
thread-safe, eventually lead to race conditions in `afterErrorReply`.

     - Introduced: 
        Version: 6.2
        PR: #8217

     - Touch
       `server.stat_total_error_replies`, `server.errors`, 

     - Harm Level: High
       Potentially corrupts the memory data of `server.errors`
   
      - Solution: 
Make the ctx in `timeout_callback()` with `REDISMODULE_CTX_THREAD_SAFE`,
and asynchronously reply errors to the client.

2. Made RM_Reply*() family API thread-safe
Related discussion:
#12817 (comment)
Call chain: `RM_Reply*()` -> `_addReplyToBufferOrList()` -> touch
server.current_client

    - Introduced: 
       Version: 7.2.0
       PR: #12326

   - Harm Level: None
Since the module fake client won't have the `CLIENT_PUSHING` flag, even
if we touch server.current_client,
     we can still exit after `c->flags & CLIENT_PUSHING`.

   - Solution
      Checking `c->flags & CLIENT_PUSHING` earlier.

3. Made freeClient() thread-safe
    Fix #12785

    - Introduced: 
       Version: 4.0
Commit:
3fcf959

    - Harm Level: Moderate
       * Trigger assertion
It happens when the module thread calls freeClient while the io-thread
is in progress,
which just triggers an assertion, and doesn't make any race condiaions.

* Touch `server.current_client`, `server.stat_clients_type_memory`, and
`clientMemUsageBucket->clients`.
It happens between the main thread and the module threads, may cause
data corruption.
1. Error reset `server.current_client` to NULL, but theoretically this
won't happen,
because the module has already reset `server.current_client` to old
value before entering freeClient.
2. corrupts `clientMemUsageBucket->clients` in
updateClientMemUsageAndBucket().
3. Causes server.stat_clients_type_memory memory statistics to be
inaccurate.
    
    - Solution:
* No longer counts memory usage on fake clients, to avoid updating
`server.stat_clients_type_memory` in freeClient.
* No longer resetting `server.current_client` in unlinkClient, because
the fake client won't be evicted or disconnected in the mid of the
process.
* Judgment assertion `io_threads_op == IO_THREADS_OP_IDLE` only if c is
not a fake client.

4. Fixed free client args without GIL
Related discussion:
#12817 (comment)
When freeing retained strings in the module thread (refcount decr), or
using them in some way (refcount incr), we should do so while holding
the GIL,
otherwise, they might be simultaneously freed while the main thread is
processing the unblock client state.

    - Introduced: 
       Version: 6.2.0
       PR: #8141

   - Harm Level: Low
     Trigger assertion or double free or memory leak. 

   - Solution:
Documenting that module API users need to ensure any access to these
retained strings is done with the GIL locked

5. Fix adding fake client to server.clients_pending_write
    It will incorrectly log the memory usage for the fake client.
Related discussion:
#12817 (comment)

    - Introduced: 
       Version: 4.0
Commit:
9b01b64

    - Harm Level: None
      Only result in NOP

    - Solution:
       * Don't add fake client into server.clients_pending_write
* Add c->conn assertion for updateClientMemUsageAndBucket() and
updateClientMemoryUsage() to avoid same
         issue in the future.
So now it will be the responsibility of the caller of both of them to
avoid passing in fake client.

6. Fix calling RM_BlockedClientMeasureTimeStart() and
RM_BlockedClientMeasureTimeEnd() without GIL
    - Introduced: 
       Version: 6.2
       PR: #7491

   - Harm Level: Low
Causes inaccuracies in command latency histogram and slow logs, but does
not corrupt memory.

   - Solution:
Module API users, if know that non-thread-safe APIs will be used in
multi-threading, need to take responsibility for protecting them with
their own locks instead of the GIL, as using the GIL is too expensive.

### Other issue
1. RM_Yield is not thread-safe, fixed via #12905.

### Summarize
1. Fix thread-safe issues for `RM_UnblockClient()`, `freeClient()` and
`RM_Yield`, potentially preventing memory corruption, data disorder, or
assertion.
2. Updated docs and module test to clarify module API users'
responsibility for locking non-thread-safe APIs in multi-threading, such
as RM_BlockedClientMeasureTimeStart/End(), RM_FreeString(),
RM_RetainString(), and RM_HoldString().

### About backpot to 7.2
1. The implement of (1) is not too satisfying, would like to get more
eyes.
2. (2), (3) can be safely for backport
3. (4), (6) just modifying the module tests and updating the
documentation, no need for a backpot.
4. (5) is harmless, no need for a backpot.

---------

Co-authored-by: Oran Agra <oran@redislabs.com>
roggervalf pushed a commit to roggervalf/redis that referenced this pull request Feb 11, 2024
…edis#12817)

Fix redis#12785 and other race condition issues.
See the following isolated comments.

The following report was obtained using SANITIZER thread.
```sh
make SANITIZER=thread
./runtest-moduleapi --config io-threads 4 --config io-threads-do-reads yes --accurate
```

1. Fixed thread-safe issue in RM_UnblockClient()
Related discussion:
redis#12817 (comment)
* When blocking a client in a module using `RM_BlockClientOnKeys()` or
`RM_BlockClientOnKeysWithFlags()`
with a timeout_callback, calling RM_UnblockClient() in module threads
can lead to race conditions
     in `updateStatsOnUnblock()`.

     - Introduced: 
        Version: 6.2
        PR: redis#7491

     - Touch:
`server.stat_numcommands`, `cmd->latency_histogram`, `server.slowlog`,
and `server.latency_events`
     
     - Harm Level: High
Potentially corrupts the memory data of `cmd->latency_histogram`,
`server.slowlog`, and `server.latency_events`

     - Solution:
Differentiate whether the call to moduleBlockedClientTimedOut() comes
from the module or the main thread.
Since we can't know if RM_UnblockClient() comes from module threads, we
always assume it does and
let `updateStatsOnUnblock()` asynchronously update the unblock status.
     
* When error reply is called in timeout_callback(), ctx is not
thread-safe, eventually lead to race conditions in `afterErrorReply`.

     - Introduced: 
        Version: 6.2
        PR: redis#8217

     - Touch
       `server.stat_total_error_replies`, `server.errors`, 

     - Harm Level: High
       Potentially corrupts the memory data of `server.errors`
   
      - Solution: 
Make the ctx in `timeout_callback()` with `REDISMODULE_CTX_THREAD_SAFE`,
and asynchronously reply errors to the client.

2. Made RM_Reply*() family API thread-safe
Related discussion:
redis#12817 (comment)
Call chain: `RM_Reply*()` -> `_addReplyToBufferOrList()` -> touch
server.current_client

    - Introduced: 
       Version: 7.2.0
       PR: redis#12326

   - Harm Level: None
Since the module fake client won't have the `CLIENT_PUSHING` flag, even
if we touch server.current_client,
     we can still exit after `c->flags & CLIENT_PUSHING`.

   - Solution
      Checking `c->flags & CLIENT_PUSHING` earlier.

3. Made freeClient() thread-safe
    Fix redis#12785

    - Introduced: 
       Version: 4.0
Commit:
redis@3fcf959

    - Harm Level: Moderate
       * Trigger assertion
It happens when the module thread calls freeClient while the io-thread
is in progress,
which just triggers an assertion, and doesn't make any race condiaions.

* Touch `server.current_client`, `server.stat_clients_type_memory`, and
`clientMemUsageBucket->clients`.
It happens between the main thread and the module threads, may cause
data corruption.
1. Error reset `server.current_client` to NULL, but theoretically this
won't happen,
because the module has already reset `server.current_client` to old
value before entering freeClient.
2. corrupts `clientMemUsageBucket->clients` in
updateClientMemUsageAndBucket().
3. Causes server.stat_clients_type_memory memory statistics to be
inaccurate.
    
    - Solution:
* No longer counts memory usage on fake clients, to avoid updating
`server.stat_clients_type_memory` in freeClient.
* No longer resetting `server.current_client` in unlinkClient, because
the fake client won't be evicted or disconnected in the mid of the
process.
* Judgment assertion `io_threads_op == IO_THREADS_OP_IDLE` only if c is
not a fake client.

4. Fixed free client args without GIL
Related discussion:
redis#12817 (comment)
When freeing retained strings in the module thread (refcount decr), or
using them in some way (refcount incr), we should do so while holding
the GIL,
otherwise, they might be simultaneously freed while the main thread is
processing the unblock client state.

    - Introduced: 
       Version: 6.2.0
       PR: redis#8141

   - Harm Level: Low
     Trigger assertion or double free or memory leak. 

   - Solution:
Documenting that module API users need to ensure any access to these
retained strings is done with the GIL locked

5. Fix adding fake client to server.clients_pending_write
    It will incorrectly log the memory usage for the fake client.
Related discussion:
redis#12817 (comment)

    - Introduced: 
       Version: 4.0
Commit:
redis@9b01b64

    - Harm Level: None
      Only result in NOP

    - Solution:
       * Don't add fake client into server.clients_pending_write
* Add c->conn assertion for updateClientMemUsageAndBucket() and
updateClientMemoryUsage() to avoid same
         issue in the future.
So now it will be the responsibility of the caller of both of them to
avoid passing in fake client.

6. Fix calling RM_BlockedClientMeasureTimeStart() and
RM_BlockedClientMeasureTimeEnd() without GIL
    - Introduced: 
       Version: 6.2
       PR: redis#7491

   - Harm Level: Low
Causes inaccuracies in command latency histogram and slow logs, but does
not corrupt memory.

   - Solution:
Module API users, if know that non-thread-safe APIs will be used in
multi-threading, need to take responsibility for protecting them with
their own locks instead of the GIL, as using the GIL is too expensive.

### Other issue
1. RM_Yield is not thread-safe, fixed via redis#12905.

### Summarize
1. Fix thread-safe issues for `RM_UnblockClient()`, `freeClient()` and
`RM_Yield`, potentially preventing memory corruption, data disorder, or
assertion.
2. Updated docs and module test to clarify module API users'
responsibility for locking non-thread-safe APIs in multi-threading, such
as RM_BlockedClientMeasureTimeStart/End(), RM_FreeString(),
RM_RetainString(), and RM_HoldString().

### About backpot to 7.2
1. The implement of (1) is not too satisfying, would like to get more
eyes.
2. (2), (3) can be safely for backport
3. (4), (6) just modifying the module tests and updating the
documentation, no need for a backpot.
4. (5) is harmless, no need for a backpot.

---------

Co-authored-by: Oran Agra <oran@redislabs.com>
sundb added a commit to sundb/redis that referenced this pull request May 13, 2024
…edis#12817)

Fix redis#12785 and other race condition issues.
See the following isolated comments.

The following report was obtained using SANITIZER thread.
```sh
make SANITIZER=thread
./runtest-moduleapi --config io-threads 4 --config io-threads-do-reads yes --accurate
```

1. Fixed thread-safe issue in RM_UnblockClient()
Related discussion:
redis#12817 (comment)
* When blocking a client in a module using `RM_BlockClientOnKeys()` or
`RM_BlockClientOnKeysWithFlags()`
with a timeout_callback, calling RM_UnblockClient() in module threads
can lead to race conditions
     in `updateStatsOnUnblock()`.

     - Introduced:
        Version: 6.2
        PR: redis#7491

     - Touch:
`server.stat_numcommands`, `cmd->latency_histogram`, `server.slowlog`,
and `server.latency_events`

     - Harm Level: High
Potentially corrupts the memory data of `cmd->latency_histogram`,
`server.slowlog`, and `server.latency_events`

     - Solution:
Differentiate whether the call to moduleBlockedClientTimedOut() comes
from the module or the main thread.
Since we can't know if RM_UnblockClient() comes from module threads, we
always assume it does and
let `updateStatsOnUnblock()` asynchronously update the unblock status.

* When error reply is called in timeout_callback(), ctx is not
thread-safe, eventually lead to race conditions in `afterErrorReply`.

     - Introduced:
        Version: 6.2
        PR: redis#8217

     - Touch
       `server.stat_total_error_replies`, `server.errors`,

     - Harm Level: High
       Potentially corrupts the memory data of `server.errors`

      - Solution:
Make the ctx in `timeout_callback()` with `REDISMODULE_CTX_THREAD_SAFE`,
and asynchronously reply errors to the client.

2. Made RM_Reply*() family API thread-safe
Related discussion:
redis#12817 (comment)
Call chain: `RM_Reply*()` -> `_addReplyToBufferOrList()` -> touch
server.current_client

    - Introduced:
       Version: 7.2.0
       PR: redis#12326

   - Harm Level: None
Since the module fake client won't have the `CLIENT_PUSHING` flag, even
if we touch server.current_client,
     we can still exit after `c->flags & CLIENT_PUSHING`.

   - Solution
      Checking `c->flags & CLIENT_PUSHING` earlier.

3. Made freeClient() thread-safe
    Fix redis#12785

    - Introduced:
       Version: 4.0
Commit:
redis@3fcf959

    - Harm Level: Moderate
       * Trigger assertion
It happens when the module thread calls freeClient while the io-thread
is in progress,
which just triggers an assertion, and doesn't make any race condiaions.

* Touch `server.current_client`, `server.stat_clients_type_memory`, and
`clientMemUsageBucket->clients`.
It happens between the main thread and the module threads, may cause
data corruption.
1. Error reset `server.current_client` to NULL, but theoretically this
won't happen,
because the module has already reset `server.current_client` to old
value before entering freeClient.
2. corrupts `clientMemUsageBucket->clients` in
updateClientMemUsageAndBucket().
3. Causes server.stat_clients_type_memory memory statistics to be
inaccurate.

    - Solution:
* No longer counts memory usage on fake clients, to avoid updating
`server.stat_clients_type_memory` in freeClient.
* No longer resetting `server.current_client` in unlinkClient, because
the fake client won't be evicted or disconnected in the mid of the
process.
* Judgment assertion `io_threads_op == IO_THREADS_OP_IDLE` only if c is
not a fake client.

4. Fixed free client args without GIL
Related discussion:
redis#12817 (comment)
When freeing retained strings in the module thread (refcount decr), or
using them in some way (refcount incr), we should do so while holding
the GIL,
otherwise, they might be simultaneously freed while the main thread is
processing the unblock client state.

    - Introduced:
       Version: 6.2.0
       PR: redis#8141

   - Harm Level: Low
     Trigger assertion or double free or memory leak.

   - Solution:
Documenting that module API users need to ensure any access to these
retained strings is done with the GIL locked

5. Fix adding fake client to server.clients_pending_write
    It will incorrectly log the memory usage for the fake client.
Related discussion:
redis#12817 (comment)

    - Introduced:
       Version: 4.0
Commit:
redis@9b01b64

    - Harm Level: None
      Only result in NOP

    - Solution:
       * Don't add fake client into server.clients_pending_write
* Add c->conn assertion for updateClientMemUsageAndBucket() and
updateClientMemoryUsage() to avoid same
         issue in the future.
So now it will be the responsibility of the caller of both of them to
avoid passing in fake client.

6. Fix calling RM_BlockedClientMeasureTimeStart() and
RM_BlockedClientMeasureTimeEnd() without GIL
    - Introduced:
       Version: 6.2
       PR: redis#7491

   - Harm Level: Low
Causes inaccuracies in command latency histogram and slow logs, but does
not corrupt memory.

   - Solution:
Module API users, if know that non-thread-safe APIs will be used in
multi-threading, need to take responsibility for protecting them with
their own locks instead of the GIL, as using the GIL is too expensive.

1. RM_Yield is not thread-safe, fixed via redis#12905.

1. Fix thread-safe issues for `RM_UnblockClient()`, `freeClient()` and
`RM_Yield`, potentially preventing memory corruption, data disorder, or
assertion.
2. Updated docs and module test to clarify module API users'
responsibility for locking non-thread-safe APIs in multi-threading, such
as RM_BlockedClientMeasureTimeStart/End(), RM_FreeString(),
RM_RetainString(), and RM_HoldString().

1. The implement of (1) is not too satisfying, would like to get more
eyes.
2. (2), (3) can be safely for backport
3. (4), (6) just modifying the module tests and updating the
documentation, no need for a backpot.
4. (5) is harmless, no need for a backpot.

---------

Co-authored-by: Oran Agra <oran@redislabs.com>
YaacovHazan pushed a commit to YaacovHazan/redis that referenced this pull request Apr 22, 2025
…edis#12817)

Fix redis#12785 and other race condition issues.
See the following isolated comments.

The following report was obtained using SANITIZER thread.
```sh
make SANITIZER=thread
./runtest-moduleapi --config io-threads 4 --config io-threads-do-reads yes --accurate
```

1. Fixed thread-safe issue in RM_UnblockClient()
Related discussion:
redis#12817 (comment)
* When blocking a client in a module using `RM_BlockClientOnKeys()` or
`RM_BlockClientOnKeysWithFlags()`
with a timeout_callback, calling RM_UnblockClient() in module threads
can lead to race conditions
     in `updateStatsOnUnblock()`.

     - Introduced: 
        Version: 6.2
        PR: redis#7491

     - Touch:
`server.stat_numcommands`, `cmd->latency_histogram`, `server.slowlog`,
and `server.latency_events`
     
     - Harm Level: High
Potentially corrupts the memory data of `cmd->latency_histogram`,
`server.slowlog`, and `server.latency_events`

     - Solution:
Differentiate whether the call to moduleBlockedClientTimedOut() comes
from the module or the main thread.
Since we can't know if RM_UnblockClient() comes from module threads, we
always assume it does and
let `updateStatsOnUnblock()` asynchronously update the unblock status.
     
* When error reply is called in timeout_callback(), ctx is not
thread-safe, eventually lead to race conditions in `afterErrorReply`.

     - Introduced: 
        Version: 6.2
        PR: redis#8217

     - Touch
       `server.stat_total_error_replies`, `server.errors`, 

     - Harm Level: High
       Potentially corrupts the memory data of `server.errors`
   
      - Solution: 
Make the ctx in `timeout_callback()` with `REDISMODULE_CTX_THREAD_SAFE`,
and asynchronously reply errors to the client.

2. Made RM_Reply*() family API thread-safe
Related discussion:
redis#12817 (comment)
Call chain: `RM_Reply*()` -> `_addReplyToBufferOrList()` -> touch
server.current_client

    - Introduced: 
       Version: 7.2.0
       PR: redis#12326

   - Harm Level: None
Since the module fake client won't have the `CLIENT_PUSHING` flag, even
if we touch server.current_client,
     we can still exit after `c->flags & CLIENT_PUSHING`.

   - Solution
      Checking `c->flags & CLIENT_PUSHING` earlier.

3. Made freeClient() thread-safe
    Fix redis#12785

    - Introduced: 
       Version: 4.0
Commit:
redis@3fcf959

    - Harm Level: Moderate
       * Trigger assertion
It happens when the module thread calls freeClient while the io-thread
is in progress,
which just triggers an assertion, and doesn't make any race condiaions.

* Touch `server.current_client`, `server.stat_clients_type_memory`, and
`clientMemUsageBucket->clients`.
It happens between the main thread and the module threads, may cause
data corruption.
1. Error reset `server.current_client` to NULL, but theoretically this
won't happen,
because the module has already reset `server.current_client` to old
value before entering freeClient.
2. corrupts `clientMemUsageBucket->clients` in
updateClientMemUsageAndBucket().
3. Causes server.stat_clients_type_memory memory statistics to be
inaccurate.
    
    - Solution:
* No longer counts memory usage on fake clients, to avoid updating
`server.stat_clients_type_memory` in freeClient.
* No longer resetting `server.current_client` in unlinkClient, because
the fake client won't be evicted or disconnected in the mid of the
process.
* Judgment assertion `io_threads_op == IO_THREADS_OP_IDLE` only if c is
not a fake client.

4. Fixed free client args without GIL
Related discussion:
redis#12817 (comment)
When freeing retained strings in the module thread (refcount decr), or
using them in some way (refcount incr), we should do so while holding
the GIL,
otherwise, they might be simultaneously freed while the main thread is
processing the unblock client state.

    - Introduced: 
       Version: 6.2.0
       PR: redis#8141

   - Harm Level: Low
     Trigger assertion or double free or memory leak. 

   - Solution:
Documenting that module API users need to ensure any access to these
retained strings is done with the GIL locked

5. Fix adding fake client to server.clients_pending_write
    It will incorrectly log the memory usage for the fake client.
Related discussion:
redis#12817 (comment)

    - Introduced: 
       Version: 4.0
Commit:
redis@9b01b64

    - Harm Level: None
      Only result in NOP

    - Solution:
       * Don't add fake client into server.clients_pending_write
* Add c->conn assertion for updateClientMemUsageAndBucket() and
updateClientMemoryUsage() to avoid same
         issue in the future.
So now it will be the responsibility of the caller of both of them to
avoid passing in fake client.

6. Fix calling RM_BlockedClientMeasureTimeStart() and
RM_BlockedClientMeasureTimeEnd() without GIL
    - Introduced: 
       Version: 6.2
       PR: redis#7491

   - Harm Level: Low
Causes inaccuracies in command latency histogram and slow logs, but does
not corrupt memory.

   - Solution:
Module API users, if know that non-thread-safe APIs will be used in
multi-threading, need to take responsibility for protecting them with
their own locks instead of the GIL, as using the GIL is too expensive.

### Other issue
1. RM_Yield is not thread-safe, fixed via redis#12905.

### Summarize
1. Fix thread-safe issues for `RM_UnblockClient()`, `freeClient()` and
`RM_Yield`, potentially preventing memory corruption, data disorder, or
assertion.
2. Updated docs and module test to clarify module API users'
responsibility for locking non-thread-safe APIs in multi-threading, such
as RM_BlockedClientMeasureTimeStart/End(), RM_FreeString(),
RM_RetainString(), and RM_HoldString().

### About backpot to 7.2
1. The implement of (1) is not too satisfying, would like to get more
eyes.
2. (2), (3) can be safely for backport
3. (4), (6) just modifying the module tests and updating the
documentation, no need for a backpot.
4. (5) is harmless, no need for a backpot.

---------

Co-authored-by: Oran Agra <oran@redislabs.com>
funny-dog pushed a commit to funny-dog/redis that referenced this pull request Sep 17, 2025
…edis#12817)

Fix redis#12785 and other race condition issues.
See the following isolated comments.

The following report was obtained using SANITIZER thread.
```sh
make SANITIZER=thread
./runtest-moduleapi --config io-threads 4 --config io-threads-do-reads yes --accurate
```

1. Fixed thread-safe issue in RM_UnblockClient()
Related discussion:
redis#12817 (comment)
* When blocking a client in a module using `RM_BlockClientOnKeys()` or
`RM_BlockClientOnKeysWithFlags()`
with a timeout_callback, calling RM_UnblockClient() in module threads
can lead to race conditions
     in `updateStatsOnUnblock()`.

     - Introduced: 
        Version: 6.2
        PR: redis#7491

     - Touch:
`server.stat_numcommands`, `cmd->latency_histogram`, `server.slowlog`,
and `server.latency_events`
     
     - Harm Level: High
Potentially corrupts the memory data of `cmd->latency_histogram`,
`server.slowlog`, and `server.latency_events`

     - Solution:
Differentiate whether the call to moduleBlockedClientTimedOut() comes
from the module or the main thread.
Since we can't know if RM_UnblockClient() comes from module threads, we
always assume it does and
let `updateStatsOnUnblock()` asynchronously update the unblock status.
     
* When error reply is called in timeout_callback(), ctx is not
thread-safe, eventually lead to race conditions in `afterErrorReply`.

     - Introduced: 
        Version: 6.2
        PR: redis#8217

     - Touch
       `server.stat_total_error_replies`, `server.errors`, 

     - Harm Level: High
       Potentially corrupts the memory data of `server.errors`
   
      - Solution: 
Make the ctx in `timeout_callback()` with `REDISMODULE_CTX_THREAD_SAFE`,
and asynchronously reply errors to the client.

2. Made RM_Reply*() family API thread-safe
Related discussion:
redis#12817 (comment)
Call chain: `RM_Reply*()` -> `_addReplyToBufferOrList()` -> touch
server.current_client

    - Introduced: 
       Version: 7.2.0
       PR: redis#12326

   - Harm Level: None
Since the module fake client won't have the `CLIENT_PUSHING` flag, even
if we touch server.current_client,
     we can still exit after `c->flags & CLIENT_PUSHING`.

   - Solution
      Checking `c->flags & CLIENT_PUSHING` earlier.

3. Made freeClient() thread-safe
    Fix redis#12785

    - Introduced: 
       Version: 4.0
Commit:
redis@fa1b900

    - Harm Level: Moderate
       * Trigger assertion
It happens when the module thread calls freeClient while the io-thread
is in progress,
which just triggers an assertion, and doesn't make any race condiaions.

* Touch `server.current_client`, `server.stat_clients_type_memory`, and
`clientMemUsageBucket->clients`.
It happens between the main thread and the module threads, may cause
data corruption.
1. Error reset `server.current_client` to NULL, but theoretically this
won't happen,
because the module has already reset `server.current_client` to old
value before entering freeClient.
2. corrupts `clientMemUsageBucket->clients` in
updateClientMemUsageAndBucket().
3. Causes server.stat_clients_type_memory memory statistics to be
inaccurate.
    
    - Solution:
* No longer counts memory usage on fake clients, to avoid updating
`server.stat_clients_type_memory` in freeClient.
* No longer resetting `server.current_client` in unlinkClient, because
the fake client won't be evicted or disconnected in the mid of the
process.
* Judgment assertion `io_threads_op == IO_THREADS_OP_IDLE` only if c is
not a fake client.

4. Fixed free client args without GIL
Related discussion:
redis#12817 (comment)
When freeing retained strings in the module thread (refcount decr), or
using them in some way (refcount incr), we should do so while holding
the GIL,
otherwise, they might be simultaneously freed while the main thread is
processing the unblock client state.

    - Introduced: 
       Version: 6.2.0
       PR: redis#8141

   - Harm Level: Low
     Trigger assertion or double free or memory leak. 

   - Solution:
Documenting that module API users need to ensure any access to these
retained strings is done with the GIL locked

5. Fix adding fake client to server.clients_pending_write
    It will incorrectly log the memory usage for the fake client.
Related discussion:
redis#12817 (comment)

    - Introduced: 
       Version: 4.0
Commit:
redis@e96bac5

    - Harm Level: None
      Only result in NOP

    - Solution:
       * Don't add fake client into server.clients_pending_write
* Add c->conn assertion for updateClientMemUsageAndBucket() and
updateClientMemoryUsage() to avoid same
         issue in the future.
So now it will be the responsibility of the caller of both of them to
avoid passing in fake client.

6. Fix calling RM_BlockedClientMeasureTimeStart() and
RM_BlockedClientMeasureTimeEnd() without GIL
    - Introduced: 
       Version: 6.2
       PR: redis#7491

   - Harm Level: Low
Causes inaccuracies in command latency histogram and slow logs, but does
not corrupt memory.

   - Solution:
Module API users, if know that non-thread-safe APIs will be used in
multi-threading, need to take responsibility for protecting them with
their own locks instead of the GIL, as using the GIL is too expensive.

### Other issue
1. RM_Yield is not thread-safe, fixed via redis#12905.

### Summarize
1. Fix thread-safe issues for `RM_UnblockClient()`, `freeClient()` and
`RM_Yield`, potentially preventing memory corruption, data disorder, or
assertion.
2. Updated docs and module test to clarify module API users'
responsibility for locking non-thread-safe APIs in multi-threading, such
as RM_BlockedClientMeasureTimeStart/End(), RM_FreeString(),
RM_RetainString(), and RM_HoldString().

### About backpot to 7.2
1. The implement of (1) is not too satisfying, would like to get more
eyes.
2. (2), (3) can be safely for backport
3. (4), (6) just modifying the module tests and updating the
documentation, no need for a backpot.
4. (5) is harmless, no need for a backpot.

---------

Co-authored-by: Oran Agra <oran@redislabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approval-needed Waiting for core team approval to be merged release-notes indication that this issue needs to be mentioned in the release notes state:major-decision Requires core team consensus state:needs-doc-pr requires a PR to redis-doc repository

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants