Enabled background and reply time tracking on blocked on keys/blocked on background work clients #7491

filipecosta90 · 2020-07-08T17:08:54Z

This PR enables tracking time of the background tasks and on replies, opening the door for properly tracking commands that rely on blocking/background work via the slowlog, latency history, and commandstats.

Some notes:

The time spent blocked waiting for key changes, or blocked on synchronous replication is not accounted for.
This PR does not affect latency tracking of commands that are non-blocking or do not have background work. ( meaning that it all stays the same with exception to BZPOPMIN,BZPOPMAX,BRPOP,BLPOP, etc... and module's commands that rely on background threads ).
Specifically for latency history command we've added a new event class named command-unblocking that will enable latency monitoring on commands that spawn background threads to do the work.
For blocking commands we're now considering the total time of a command as the time spent on call() + the time spent on replying when unblocked.
For Modules commands that rely on background threads we're now considering the total time of a command as the time spent on call (main thread) + the time spent on the background thread ( if marked within RedisModule_MeasureTimeStart() and RedisModule_MeasureTimeEnd() ) + the time spent on replying (main thread)

test

To test for this feature we've added a unit/moduleapi/blockonbackground test that relies on a module that blocks the client and sleeps on the background for a given time.

check blocked command that uses RedisModule_MeasureTimeStart() is tracking background time
check blocked command that uses RedisModule_MeasureTimeStart() is tracking background time even in timeout
check blocked command with multiple calls RedisModule_MeasureTimeStart() is tracking the total background time
check blocked command without calling RedisModule_MeasureTimeStart() is not reporting background time

Samples

current behaviour ( not tracking time spent on background ):

127.0.0.1:6379> config set slowlog-log-slower-than 100000
OK
127.0.0.1:6379> slowlog len
(integer) 0
127.0.0.1:6379> block.debug 0 10000
(integer) 343176910
127.0.0.1:6379> slowlog len
(integer) 0
127.0.0.1:6379> block.debug 2000 10000
(integer) 1207730440
(2.00s)
127.0.0.1:6379> slowlog len
(integer) 0

commandstats example

127.0.0.1:6379> block.debug 2000 10000
(integer) 1768881473
(2.00s)
127.0.0.1:6379> block.debug 4000 10000
(integer) 630830182
(4.00s)
127.0.0.1:6379> info commandstats
# Commandstats
cmdstat_block.debug:calls=2,usec=226,usec_per_call=113.00

new behaviour ( tracking time spend on background ):

127.0.0.1:6379> config set slowlog-log-slower-than 100000
OK
127.0.0.1:6379> slowlog len
(integer) 0
127.0.0.1:6379> block.debug 0 10000
(integer) 1471490433
127.0.0.1:6379> slowlog len
(integer) 0
127.0.0.1:6379> block.debug 2000 10000
(integer) 1127835047
(2.00s)
127.0.0.1:6379> slowlog len
(integer) 1
127.0.0.1:6379> slowlog get 1
1) 1) (integer) 0
   2) (integer) 1593712320
   3) (integer) 2053204
   4) 1) "block.debug"
      2) "2000"
      3) "10000"
   5) "127.0.0.1:51818"
   6) ""

commandstats example

127.0.0.1:6379> block.debug 2000 10000
(integer) 1052512224
(2.00s)
127.0.0.1:6379> block.debug 4000 10000
(integer) 215961991
(4.00s)
127.0.0.1:6379> info commandstats
# Commandstats
cmdstat_block.debug:calls=2,usec=6099877,usec_per_call=3049938.50

latency monitor example

127.0.0.1:6379> config set latency-monitor-threshold 200
OK
127.0.0.1:6379> block.debug 500 1000
(integer) 1887208743
(0.50s)
127.0.0.1:6379> latency history command-unblocking
1) 1) (integer) 1611265472
   2) (integer) 500

…ith background work

…locked on keys reply time tracking.

oranagra

i'd like more feedback about this.
@guybe7 @yossigo @soloestoy at least one or two of you have a look.

src/blocked.c

src/module.c

yossigo · 2020-07-12T10:01:56Z

@oranagra @filipecosta90 I think there's an interesting use case discussion to make here, as this diverges a bit from the current SLOWLOG purpose.

For the sake of the argument, we could claim that if we want SLOWLOG to have more visibility on this, we also want to query socket-level information like RTT from the OS, try to measure end-to-end latency (i.e. ACK on last reply byte), etc.

oranagra · 2020-07-12T10:13:29Z

the (maybe subjective) way i see it, the LATENCY command is about any delay in the main thread (be it fsync and such), and the SLOWLOG command is about the command execution time (and it's reply).

however, the shortcoming of LATENCY does luck any detail of the specific command that caused latency (just says command or fast-command).

the SLOWLOG output can indeed include details on various types of measurement (network time, main thread time, total time, etc), but then we need to decide which one we match to the threshold that decide if it is logged or not (i suppose the highest one).

… scope given we might have a module without replycallback set and still doing blocking work

src/blocked.c

src/module.c

tests/unit/moduleapi/blockonbackground.tcl

…ods within reply time tracking. [add] tracking reply time on module clients blocked on keys

…ents background time. Reducede useless time on blocked test from 2secs to 100ms. tracking blocked time value on test also.

oranagra

i discussed this PR with the core team this morning.
we feel it may not be right to log the entire time the command was blocked in the slowlog and and cmdstat.
i suppose modules block clients in order to execute the command in the background thread and reply later.
but that doesn't mean the entire blocked time is the processing time, some time can be lost on waiting for a worker thread to pick up the job.
i think we may need find a way the module can help us distinguish between the blocked time and the processing time, and we only want to include the processing time in the slowlog and cmdstats.
let's try to get @MeirShpilraien and @swilly22 input here.

src/blocked.c

src/server.c

MeirShpilraien · 2021-01-19T12:24:58Z

@oranagra I agree it should not be in the slow log or commands stat because those indicate the actual processing time, maybe we can introduce a new RedisModuleApi that takes the block client and add execution time to it? Something like:

RM_MeasureTimeStart(bc);
...
RM_MeasureTimeEnd(bc);

The module can call those multiple times (without acquiring the lock!!) and this allows adding the time measure to the command execution time.
Needless to say, those names are debatable :)
WDYT?

oranagra · 2021-01-19T13:28:47Z

@MeirShpilraien since this is a voluntary call, i wonder if many modules will use it.
so maybe in addition to that we need to indeed add a field indicating the blocked time to be used for modules that didn't bother to use this new API.
@yossigo @madolson WDYT?

filipecosta90 · 2021-01-19T15:19:59Z

@oranagra I agree it should not be in the slow log or commands stat because those indicate the actual processing time, maybe we can introduce a new RedisModuleApi that takes the block client and add execution time to it? Something like:
RM_MeasureTimeStart(bc);
...
RM_MeasureTimeEnd(bc);
The module can call those multiple times (without acquiring the lock!!) and this allows adding the time measure to the command execution time.
Needless to say, those names are debatable :)
WDYT?

@MeirShpilraien , @oranagra I tend to disagree with the extra RM API and not tracking the entire background time of that thread. When we block the client ( via RM_BlockClient ) by all means we're making the client wait while executing that command, meaning all work required to reply should be accounted for given that is is seen as execution time from the client perspective ( for me as a client I don't care if you put it into two queues on the BG, or wait 1sec and then start working, the effective wait time of that client should start counting when we block ).

If we do otherwise we will be making it hard to detect queue times and make it even harder for this feature to take effect given people will need to explictly call a new API. IMHO all background time belongs to the command time. WDYT?

…Unblock() and us time tracking

MeirShpilraien · 2021-01-19T15:33:25Z

@filipecosta90 I agree we should track the entire time as well but not in the slow log because the slow log shows only the processing time. Imagine the module implement the background operation with a single background thread that takes the next job from a queue, one long operation might cause all the others to also be a delay and enter the slow log (although their actual execution time is very short). This is why I suggested the new API so we can measure the actual time it takes to process the command...

filipecosta90 · 2021-01-19T16:07:03Z

@filipecosta90 I agree we should track the entire time as well but not in the slow log because the slow log shows only the processing time. Imagine the module implement the background operation with a single background thread that takes the next job from a queue, one long operation might cause all the others to also be a delay and enter the slow log (although their actual execution time is very short). This is why I suggested the new API so we can measure the actual time it takes to process the command...

@MeirShpilraien I believe that by default all background time should be taken into account and only on opt-in usage of the new API you would be allowed to have more control on time tracking ( meaning the module developer will acknowledge that there are times he does not want to be present on the slowlog ) by using RM_MeasureTimeStart(bc); In that manner we would have out of the box latencies that at most are overzealous and not the opposite. WDYT?

src/server.c

madolson · 2021-01-19T17:24:58Z

@filipecosta90 I don't think the default time should be total background processing time. If someone is using a queue mechanism like meir mentioned, the slowlog will report commands that are slow simply if they are blocked behind other commands, and that will probably be confusing to end users since that isn't normally what the slowlog shows. I think having an option or a command to opt in a command to total background time tracking, that would be fine though.

soloestoy · 2021-01-28T11:10:29Z

command-unblocking latency makes sense to me, but not sure if it is appropriate to push the original block command to slowlog, now it means the unblocking time instead of execution time right?

filipecosta90 · 2021-01-28T11:17:01Z

now it means the unblocking time instead of execution time right?

@soloestoy it still means overall execution time, but now includes not only the time until blocking but also the reply time ( for blocking commands that are not from modules ) and means time until blocking ( main thread) + thread execution time ( when using the new RM API ) + reply time (main thread).

https://github.com/redis/redis/pull/7491/files#diff-ddd6be3a7a9289fbf41890c34da16f1ea5b0a8d6d9206ae1ff5303e50bae1970R107

soloestoy · 2021-01-29T08:37:12Z

@filipecosta90 OK, I got it, and it would be nice to document this change about slowlog https://github.com/redis/redis-doc/blob/master/commands/slowlog.md

filipecosta90 · 2021-01-29T11:22:54Z

@filipecosta90 OK, I got it, and it would be nice to document this change about slowlog https://github.com/redis/redis-doc/blob/master/commands/slowlog.md

will do so @soloestoy :)

zuiderkwast · 2021-02-01T21:08:29Z

I compiled the modules test code on unstable just now and got the following warnings:

cc -I../../src  -W -Wall -fno-common -g -ggdb -std=c99 -O2 -fPIC -c blockonbackground.c -o blockonbackground.xo
blockonbackground.c: In function ‘BlockDebug_ThreadMain’:
blockonbackground.c:48:5: warning: implicit declaration of function ‘nanosleep’ [-Wimplicit-function-declaration]
   48 |     nanosleep(&ts, NULL);
      |     ^~~~~~~~~
blockonbackground.c: In function ‘HelloDoubleBlock_RedisCommand’:
blockonbackground.c:183:36: warning: ‘timeout’ may be used uninitialized in this function [-Wmaybe-uninitialized]
  183 |     RedisModuleBlockedClient *bc = RedisModule_BlockClient(ctx,HelloBlock_Reply,HelloBlock_Timeout,HelloBlock_FreeData,timeout);
      |                                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

itamarhaber · 2021-02-04T19:18:02Z

@filipecosta90 care to look at it? I also remember some CI failing on blocking commands that may be related?!? (ubuntu latest IIRC)

filipecosta90 · 2021-02-04T19:45:33Z

@filipecosta90 care to look at it? I also remember some CI failing on blocking commands that may be related?!? (ubuntu latest IIRC)

will do so @itamarhaber @zuiderkwast :)

filipecosta90 · 2021-02-05T18:03:32Z

I compiled the modules test code on unstable just now and got the following warnings:

cc -I../../src  -W -Wall -fno-common -g -ggdb -std=c99 -O2 -fPIC -c blockonbackground.c -o blockonbackground.xo
blockonbackground.c: In function ‘BlockDebug_ThreadMain’:
blockonbackground.c:48:5: warning: implicit declaration of function ‘nanosleep’ [-Wimplicit-function-declaration]
   48 |     nanosleep(&ts, NULL);
      |     ^~~~~~~~~
blockonbackground.c: In function ‘HelloDoubleBlock_RedisCommand’:
blockonbackground.c:183:36: warning: ‘timeout’ may be used uninitialized in this function [-Wmaybe-uninitialized]
  183 |     RedisModuleBlockedClient *bc = RedisModule_BlockClient(ctx,HelloBlock_Reply,HelloBlock_Timeout,HelloBlock_FreeData,timeout);
      |                                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

@zuiderkwast #8454 should fix it :)

zuiderkwast · 2021-02-05T18:40:58Z

Yes, perfect!

… on background work clients (redis#7491) This commit enables tracking time of the background tasks and on replies, opening the door for properly tracking commands that rely on blocking / background work via the slowlog, latency history, and commandstats. Some notes: - The time spent blocked waiting for key changes, or blocked on synchronous replication is not accounted for. - **This commit does not affect latency tracking of commands that are non-blocking or do not have background work.** ( meaning that it all stays the same with exception to `BZPOPMIN`,`BZPOPMAX`,`BRPOP`,`BLPOP`, etc... and module's commands that rely on background threads ). - Specifically for latency history command we've added a new event class named `command-unblocking` that will enable latency monitoring on commands that spawn background threads to do the work. - For blocking commands we're now considering the total time of a command as the time spent on call() + the time spent on replying when unblocked. - For Modules commands that rely on background threads we're now considering the total time of a command as the time spent on call (main thread) + the time spent on the background thread ( if marked within `RedisModule_MeasureTimeStart()` and `RedisModule_MeasureTimeEnd()` ) + the time spent on replying (main thread) To test for this feature we've added a `unit/moduleapi/blockonbackground` test that relies on a module that blocks the client and sleeps on the background for a given time. - check blocked command that uses RedisModule_MeasureTimeStart() is tracking background time - check blocked command that uses RedisModule_MeasureTimeStart() is tracking background time even in timeout - check blocked command with multiple calls RedisModule_MeasureTimeStart() is tracking the total background time - check blocked command without calling RedisModule_MeasureTimeStart() is not reporting background time

…xecution time, to regain up to 4% execution time (#10502) In #7491 (part of redis 6.2), we started using the monotonic timer instead of mstime to measure command execution time for stats, apparently this meant sampling the clock 3 times per command rather than two (wince we also need the wall-clock time). In some cases this causes a significant overhead. This PR fixes that by avoiding the use of monotonic timer, except for the cases were we know it should be extremely fast. This PR also adds a new INFO field called `monotonic_clock` that shows which clock redis is using. Co-authored-by: Oran Agra <oran@redislabs.com>

…xecution time, to regain up to 4% execution time (#10502) In #7491 (part of redis 6.2), we started using the monotonic timer instead of mstime to measure command execution time for stats, apparently this meant sampling the clock 3 times per command rather than two (wince we also need the wall-clock time). In some cases this causes a significant overhead. This PR fixes that by avoiding the use of monotonic timer, except for the cases were we know it should be extremely fast. This PR also adds a new INFO field called `monotonic_clock` that shows which clock redis is using. Co-authored-by: Oran Agra <oran@redislabs.com> (cherry picked from commit 3cd8baf)

…xecution time, to regain up to 4% execution time (redis#10502) In redis#7491 (part of redis 6.2), we started using the monotonic timer instead of mstime to measure command execution time for stats, apparently this meant sampling the clock 3 times per command rather than two (wince we also need the wall-clock time). In some cases this causes a significant overhead. This PR fixes that by avoiding the use of monotonic timer, except for the cases were we know it should be extremely fast. This PR also adds a new INFO field called `monotonic_clock` that shows which clock redis is using. Co-authored-by: Oran Agra <oran@redislabs.com>

…12817) Fix #12785 and other race condition issues. See the following isolated comments. The following report was obtained using SANITIZER thread. ```sh make SANITIZER=thread ./runtest-moduleapi --config io-threads 4 --config io-threads-do-reads yes --accurate ``` 1. Fixed thread-safe issue in RM_UnblockClient() Related discussion: #12817 (comment) * When blocking a client in a module using `RM_BlockClientOnKeys()` or `RM_BlockClientOnKeysWithFlags()` with a timeout_callback, calling RM_UnblockClient() in module threads can lead to race conditions in `updateStatsOnUnblock()`. - Introduced: Version: 6.2 PR: #7491 - Touch: `server.stat_numcommands`, `cmd->latency_histogram`, `server.slowlog`, and `server.latency_events` - Harm Level: High Potentially corrupts the memory data of `cmd->latency_histogram`, `server.slowlog`, and `server.latency_events` - Solution: Differentiate whether the call to moduleBlockedClientTimedOut() comes from the module or the main thread. Since we can't know if RM_UnblockClient() comes from module threads, we always assume it does and let `updateStatsOnUnblock()` asynchronously update the unblock status. * When error reply is called in timeout_callback(), ctx is not thread-safe, eventually lead to race conditions in `afterErrorReply`. - Introduced: Version: 6.2 PR: #8217 - Touch `server.stat_total_error_replies`, `server.errors`, - Harm Level: High Potentially corrupts the memory data of `server.errors` - Solution: Make the ctx in `timeout_callback()` with `REDISMODULE_CTX_THREAD_SAFE`, and asynchronously reply errors to the client. 2. Made RM_Reply*() family API thread-safe Related discussion: #12817 (comment) Call chain: `RM_Reply*()` -> `_addReplyToBufferOrList()` -> touch server.current_client - Introduced: Version: 7.2.0 PR: #12326 - Harm Level: None Since the module fake client won't have the `CLIENT_PUSHING` flag, even if we touch server.current_client, we can still exit after `c->flags & CLIENT_PUSHING`. - Solution Checking `c->flags & CLIENT_PUSHING` earlier. 3. Made freeClient() thread-safe Fix #12785 - Introduced: Version: 4.0 Commit: 3fcf959 - Harm Level: Moderate * Trigger assertion It happens when the module thread calls freeClient while the io-thread is in progress, which just triggers an assertion, and doesn't make any race condiaions. * Touch `server.current_client`, `server.stat_clients_type_memory`, and `clientMemUsageBucket->clients`. It happens between the main thread and the module threads, may cause data corruption. 1. Error reset `server.current_client` to NULL, but theoretically this won't happen, because the module has already reset `server.current_client` to old value before entering freeClient. 2. corrupts `clientMemUsageBucket->clients` in updateClientMemUsageAndBucket(). 3. Causes server.stat_clients_type_memory memory statistics to be inaccurate. - Solution: * No longer counts memory usage on fake clients, to avoid updating `server.stat_clients_type_memory` in freeClient. * No longer resetting `server.current_client` in unlinkClient, because the fake client won't be evicted or disconnected in the mid of the process. * Judgment assertion `io_threads_op == IO_THREADS_OP_IDLE` only if c is not a fake client. 4. Fixed free client args without GIL Related discussion: #12817 (comment) When freeing retained strings in the module thread (refcount decr), or using them in some way (refcount incr), we should do so while holding the GIL, otherwise, they might be simultaneously freed while the main thread is processing the unblock client state. - Introduced: Version: 6.2.0 PR: #8141 - Harm Level: Low Trigger assertion or double free or memory leak. - Solution: Documenting that module API users need to ensure any access to these retained strings is done with the GIL locked 5. Fix adding fake client to server.clients_pending_write It will incorrectly log the memory usage for the fake client. Related discussion: #12817 (comment) - Introduced: Version: 4.0 Commit: 9b01b64 - Harm Level: None Only result in NOP - Solution: * Don't add fake client into server.clients_pending_write * Add c->conn assertion for updateClientMemUsageAndBucket() and updateClientMemoryUsage() to avoid same issue in the future. So now it will be the responsibility of the caller of both of them to avoid passing in fake client. 6. Fix calling RM_BlockedClientMeasureTimeStart() and RM_BlockedClientMeasureTimeEnd() without GIL - Introduced: Version: 6.2 PR: #7491 - Harm Level: Low Causes inaccuracies in command latency histogram and slow logs, but does not corrupt memory. - Solution: Module API users, if know that non-thread-safe APIs will be used in multi-threading, need to take responsibility for protecting them with their own locks instead of the GIL, as using the GIL is too expensive. ### Other issue 1. RM_Yield is not thread-safe, fixed via #12905. ### Summarize 1. Fix thread-safe issues for `RM_UnblockClient()`, `freeClient()` and `RM_Yield`, potentially preventing memory corruption, data disorder, or assertion. 2. Updated docs and module test to clarify module API users' responsibility for locking non-thread-safe APIs in multi-threading, such as RM_BlockedClientMeasureTimeStart/End(), RM_FreeString(), RM_RetainString(), and RM_HoldString(). ### About backpot to 7.2 1. The implement of (1) is not too satisfying, would like to get more eyes. 2. (2), (3) can be safely for backport 3. (4), (6) just modifying the module tests and updating the documentation, no need for a backpot. 4. (5) is harmless, no need for a backpot. --------- Co-authored-by: Oran Agra <oran@redislabs.com>

…edis#12817) Fix redis#12785 and other race condition issues. See the following isolated comments. The following report was obtained using SANITIZER thread. ```sh make SANITIZER=thread ./runtest-moduleapi --config io-threads 4 --config io-threads-do-reads yes --accurate ``` 1. Fixed thread-safe issue in RM_UnblockClient() Related discussion: redis#12817 (comment) * When blocking a client in a module using `RM_BlockClientOnKeys()` or `RM_BlockClientOnKeysWithFlags()` with a timeout_callback, calling RM_UnblockClient() in module threads can lead to race conditions in `updateStatsOnUnblock()`. - Introduced: Version: 6.2 PR: redis#7491 - Touch: `server.stat_numcommands`, `cmd->latency_histogram`, `server.slowlog`, and `server.latency_events` - Harm Level: High Potentially corrupts the memory data of `cmd->latency_histogram`, `server.slowlog`, and `server.latency_events` - Solution: Differentiate whether the call to moduleBlockedClientTimedOut() comes from the module or the main thread. Since we can't know if RM_UnblockClient() comes from module threads, we always assume it does and let `updateStatsOnUnblock()` asynchronously update the unblock status. * When error reply is called in timeout_callback(), ctx is not thread-safe, eventually lead to race conditions in `afterErrorReply`. - Introduced: Version: 6.2 PR: redis#8217 - Touch `server.stat_total_error_replies`, `server.errors`, - Harm Level: High Potentially corrupts the memory data of `server.errors` - Solution: Make the ctx in `timeout_callback()` with `REDISMODULE_CTX_THREAD_SAFE`, and asynchronously reply errors to the client. 2. Made RM_Reply*() family API thread-safe Related discussion: redis#12817 (comment) Call chain: `RM_Reply*()` -> `_addReplyToBufferOrList()` -> touch server.current_client - Introduced: Version: 7.2.0 PR: redis#12326 - Harm Level: None Since the module fake client won't have the `CLIENT_PUSHING` flag, even if we touch server.current_client, we can still exit after `c->flags & CLIENT_PUSHING`. - Solution Checking `c->flags & CLIENT_PUSHING` earlier. 3. Made freeClient() thread-safe Fix redis#12785 - Introduced: Version: 4.0 Commit: redis@3fcf959 - Harm Level: Moderate * Trigger assertion It happens when the module thread calls freeClient while the io-thread is in progress, which just triggers an assertion, and doesn't make any race condiaions. * Touch `server.current_client`, `server.stat_clients_type_memory`, and `clientMemUsageBucket->clients`. It happens between the main thread and the module threads, may cause data corruption. 1. Error reset `server.current_client` to NULL, but theoretically this won't happen, because the module has already reset `server.current_client` to old value before entering freeClient. 2. corrupts `clientMemUsageBucket->clients` in updateClientMemUsageAndBucket(). 3. Causes server.stat_clients_type_memory memory statistics to be inaccurate. - Solution: * No longer counts memory usage on fake clients, to avoid updating `server.stat_clients_type_memory` in freeClient. * No longer resetting `server.current_client` in unlinkClient, because the fake client won't be evicted or disconnected in the mid of the process. * Judgment assertion `io_threads_op == IO_THREADS_OP_IDLE` only if c is not a fake client. 4. Fixed free client args without GIL Related discussion: redis#12817 (comment) When freeing retained strings in the module thread (refcount decr), or using them in some way (refcount incr), we should do so while holding the GIL, otherwise, they might be simultaneously freed while the main thread is processing the unblock client state. - Introduced: Version: 6.2.0 PR: redis#8141 - Harm Level: Low Trigger assertion or double free or memory leak. - Solution: Documenting that module API users need to ensure any access to these retained strings is done with the GIL locked 5. Fix adding fake client to server.clients_pending_write It will incorrectly log the memory usage for the fake client. Related discussion: redis#12817 (comment) - Introduced: Version: 4.0 Commit: redis@9b01b64 - Harm Level: None Only result in NOP - Solution: * Don't add fake client into server.clients_pending_write * Add c->conn assertion for updateClientMemUsageAndBucket() and updateClientMemoryUsage() to avoid same issue in the future. So now it will be the responsibility of the caller of both of them to avoid passing in fake client. 6. Fix calling RM_BlockedClientMeasureTimeStart() and RM_BlockedClientMeasureTimeEnd() without GIL - Introduced: Version: 6.2 PR: redis#7491 - Harm Level: Low Causes inaccuracies in command latency histogram and slow logs, but does not corrupt memory. - Solution: Module API users, if know that non-thread-safe APIs will be used in multi-threading, need to take responsibility for protecting them with their own locks instead of the GIL, as using the GIL is too expensive. ### Other issue 1. RM_Yield is not thread-safe, fixed via redis#12905. ### Summarize 1. Fix thread-safe issues for `RM_UnblockClient()`, `freeClient()` and `RM_Yield`, potentially preventing memory corruption, data disorder, or assertion. 2. Updated docs and module test to clarify module API users' responsibility for locking non-thread-safe APIs in multi-threading, such as RM_BlockedClientMeasureTimeStart/End(), RM_FreeString(), RM_RetainString(), and RM_HoldString(). ### About backpot to 7.2 1. The implement of (1) is not too satisfying, would like to get more eyes. 2. (2), (3) can be safely for backport 3. (4), (6) just modifying the module tests and updating the documentation, no need for a backpot. 4. (5) is harmless, no need for a backpot. --------- Co-authored-by: Oran Agra <oran@redislabs.com>

…edis#12817) Fix redis#12785 and other race condition issues. See the following isolated comments. The following report was obtained using SANITIZER thread. ```sh make SANITIZER=thread ./runtest-moduleapi --config io-threads 4 --config io-threads-do-reads yes --accurate ``` 1. Fixed thread-safe issue in RM_UnblockClient() Related discussion: redis#12817 (comment) * When blocking a client in a module using `RM_BlockClientOnKeys()` or `RM_BlockClientOnKeysWithFlags()` with a timeout_callback, calling RM_UnblockClient() in module threads can lead to race conditions in `updateStatsOnUnblock()`. - Introduced: Version: 6.2 PR: redis#7491 - Touch: `server.stat_numcommands`, `cmd->latency_histogram`, `server.slowlog`, and `server.latency_events` - Harm Level: High Potentially corrupts the memory data of `cmd->latency_histogram`, `server.slowlog`, and `server.latency_events` - Solution: Differentiate whether the call to moduleBlockedClientTimedOut() comes from the module or the main thread. Since we can't know if RM_UnblockClient() comes from module threads, we always assume it does and let `updateStatsOnUnblock()` asynchronously update the unblock status. * When error reply is called in timeout_callback(), ctx is not thread-safe, eventually lead to race conditions in `afterErrorReply`. - Introduced: Version: 6.2 PR: redis#8217 - Touch `server.stat_total_error_replies`, `server.errors`, - Harm Level: High Potentially corrupts the memory data of `server.errors` - Solution: Make the ctx in `timeout_callback()` with `REDISMODULE_CTX_THREAD_SAFE`, and asynchronously reply errors to the client. 2. Made RM_Reply*() family API thread-safe Related discussion: redis#12817 (comment) Call chain: `RM_Reply*()` -> `_addReplyToBufferOrList()` -> touch server.current_client - Introduced: Version: 7.2.0 PR: redis#12326 - Harm Level: None Since the module fake client won't have the `CLIENT_PUSHING` flag, even if we touch server.current_client, we can still exit after `c->flags & CLIENT_PUSHING`. - Solution Checking `c->flags & CLIENT_PUSHING` earlier. 3. Made freeClient() thread-safe Fix redis#12785 - Introduced: Version: 4.0 Commit: redis@3fcf959 - Harm Level: Moderate * Trigger assertion It happens when the module thread calls freeClient while the io-thread is in progress, which just triggers an assertion, and doesn't make any race condiaions. * Touch `server.current_client`, `server.stat_clients_type_memory`, and `clientMemUsageBucket->clients`. It happens between the main thread and the module threads, may cause data corruption. 1. Error reset `server.current_client` to NULL, but theoretically this won't happen, because the module has already reset `server.current_client` to old value before entering freeClient. 2. corrupts `clientMemUsageBucket->clients` in updateClientMemUsageAndBucket(). 3. Causes server.stat_clients_type_memory memory statistics to be inaccurate. - Solution: * No longer counts memory usage on fake clients, to avoid updating `server.stat_clients_type_memory` in freeClient. * No longer resetting `server.current_client` in unlinkClient, because the fake client won't be evicted or disconnected in the mid of the process. * Judgment assertion `io_threads_op == IO_THREADS_OP_IDLE` only if c is not a fake client. 4. Fixed free client args without GIL Related discussion: redis#12817 (comment) When freeing retained strings in the module thread (refcount decr), or using them in some way (refcount incr), we should do so while holding the GIL, otherwise, they might be simultaneously freed while the main thread is processing the unblock client state. - Introduced: Version: 6.2.0 PR: redis#8141 - Harm Level: Low Trigger assertion or double free or memory leak. - Solution: Documenting that module API users need to ensure any access to these retained strings is done with the GIL locked 5. Fix adding fake client to server.clients_pending_write It will incorrectly log the memory usage for the fake client. Related discussion: redis#12817 (comment) - Introduced: Version: 4.0 Commit: redis@9b01b64 - Harm Level: None Only result in NOP - Solution: * Don't add fake client into server.clients_pending_write * Add c->conn assertion for updateClientMemUsageAndBucket() and updateClientMemoryUsage() to avoid same issue in the future. So now it will be the responsibility of the caller of both of them to avoid passing in fake client. 6. Fix calling RM_BlockedClientMeasureTimeStart() and RM_BlockedClientMeasureTimeEnd() without GIL - Introduced: Version: 6.2 PR: redis#7491 - Harm Level: Low Causes inaccuracies in command latency histogram and slow logs, but does not corrupt memory. - Solution: Module API users, if know that non-thread-safe APIs will be used in multi-threading, need to take responsibility for protecting them with their own locks instead of the GIL, as using the GIL is too expensive. 1. RM_Yield is not thread-safe, fixed via redis#12905. 1. Fix thread-safe issues for `RM_UnblockClient()`, `freeClient()` and `RM_Yield`, potentially preventing memory corruption, data disorder, or assertion. 2. Updated docs and module test to clarify module API users' responsibility for locking non-thread-safe APIs in multi-threading, such as RM_BlockedClientMeasureTimeStart/End(), RM_FreeString(), RM_RetainString(), and RM_HoldString(). 1. The implement of (1) is not too satisfying, would like to get more eyes. 2. (2), (3) can be safely for backport 3. (4), (6) just modifying the module tests and updating the documentation, no need for a backpot. 4. (5) is harmless, no need for a backpot. --------- Co-authored-by: Oran Agra <oran@redislabs.com>

…edis#12817) Fix redis#12785 and other race condition issues. See the following isolated comments. The following report was obtained using SANITIZER thread. ```sh make SANITIZER=thread ./runtest-moduleapi --config io-threads 4 --config io-threads-do-reads yes --accurate ``` 1. Fixed thread-safe issue in RM_UnblockClient() Related discussion: redis#12817 (comment) * When blocking a client in a module using `RM_BlockClientOnKeys()` or `RM_BlockClientOnKeysWithFlags()` with a timeout_callback, calling RM_UnblockClient() in module threads can lead to race conditions in `updateStatsOnUnblock()`. - Introduced: Version: 6.2 PR: redis#7491 - Touch: `server.stat_numcommands`, `cmd->latency_histogram`, `server.slowlog`, and `server.latency_events` - Harm Level: High Potentially corrupts the memory data of `cmd->latency_histogram`, `server.slowlog`, and `server.latency_events` - Solution: Differentiate whether the call to moduleBlockedClientTimedOut() comes from the module or the main thread. Since we can't know if RM_UnblockClient() comes from module threads, we always assume it does and let `updateStatsOnUnblock()` asynchronously update the unblock status. * When error reply is called in timeout_callback(), ctx is not thread-safe, eventually lead to race conditions in `afterErrorReply`. - Introduced: Version: 6.2 PR: redis#8217 - Touch `server.stat_total_error_replies`, `server.errors`, - Harm Level: High Potentially corrupts the memory data of `server.errors` - Solution: Make the ctx in `timeout_callback()` with `REDISMODULE_CTX_THREAD_SAFE`, and asynchronously reply errors to the client. 2. Made RM_Reply*() family API thread-safe Related discussion: redis#12817 (comment) Call chain: `RM_Reply*()` -> `_addReplyToBufferOrList()` -> touch server.current_client - Introduced: Version: 7.2.0 PR: redis#12326 - Harm Level: None Since the module fake client won't have the `CLIENT_PUSHING` flag, even if we touch server.current_client, we can still exit after `c->flags & CLIENT_PUSHING`. - Solution Checking `c->flags & CLIENT_PUSHING` earlier. 3. Made freeClient() thread-safe Fix redis#12785 - Introduced: Version: 4.0 Commit: redis@3fcf959 - Harm Level: Moderate * Trigger assertion It happens when the module thread calls freeClient while the io-thread is in progress, which just triggers an assertion, and doesn't make any race condiaions. * Touch `server.current_client`, `server.stat_clients_type_memory`, and `clientMemUsageBucket->clients`. It happens between the main thread and the module threads, may cause data corruption. 1. Error reset `server.current_client` to NULL, but theoretically this won't happen, because the module has already reset `server.current_client` to old value before entering freeClient. 2. corrupts `clientMemUsageBucket->clients` in updateClientMemUsageAndBucket(). 3. Causes server.stat_clients_type_memory memory statistics to be inaccurate. - Solution: * No longer counts memory usage on fake clients, to avoid updating `server.stat_clients_type_memory` in freeClient. * No longer resetting `server.current_client` in unlinkClient, because the fake client won't be evicted or disconnected in the mid of the process. * Judgment assertion `io_threads_op == IO_THREADS_OP_IDLE` only if c is not a fake client. 4. Fixed free client args without GIL Related discussion: redis#12817 (comment) When freeing retained strings in the module thread (refcount decr), or using them in some way (refcount incr), we should do so while holding the GIL, otherwise, they might be simultaneously freed while the main thread is processing the unblock client state. - Introduced: Version: 6.2.0 PR: redis#8141 - Harm Level: Low Trigger assertion or double free or memory leak. - Solution: Documenting that module API users need to ensure any access to these retained strings is done with the GIL locked 5. Fix adding fake client to server.clients_pending_write It will incorrectly log the memory usage for the fake client. Related discussion: redis#12817 (comment) - Introduced: Version: 4.0 Commit: redis@9b01b64 - Harm Level: None Only result in NOP - Solution: * Don't add fake client into server.clients_pending_write * Add c->conn assertion for updateClientMemUsageAndBucket() and updateClientMemoryUsage() to avoid same issue in the future. So now it will be the responsibility of the caller of both of them to avoid passing in fake client. 6. Fix calling RM_BlockedClientMeasureTimeStart() and RM_BlockedClientMeasureTimeEnd() without GIL - Introduced: Version: 6.2 PR: redis#7491 - Harm Level: Low Causes inaccuracies in command latency histogram and slow logs, but does not corrupt memory. - Solution: Module API users, if know that non-thread-safe APIs will be used in multi-threading, need to take responsibility for protecting them with their own locks instead of the GIL, as using the GIL is too expensive. ### Other issue 1. RM_Yield is not thread-safe, fixed via redis#12905. ### Summarize 1. Fix thread-safe issues for `RM_UnblockClient()`, `freeClient()` and `RM_Yield`, potentially preventing memory corruption, data disorder, or assertion. 2. Updated docs and module test to clarify module API users' responsibility for locking non-thread-safe APIs in multi-threading, such as RM_BlockedClientMeasureTimeStart/End(), RM_FreeString(), RM_RetainString(), and RM_HoldString(). ### About backpot to 7.2 1. The implement of (1) is not too satisfying, would like to get more eyes. 2. (2), (3) can be safely for backport 3. (4), (6) just modifying the module tests and updating the documentation, no need for a backpot. 4. (5) is harmless, no need for a backpot. --------- Co-authored-by: Oran Agra <oran@redislabs.com>

…edis#12817) Fix redis#12785 and other race condition issues. See the following isolated comments. The following report was obtained using SANITIZER thread. ```sh make SANITIZER=thread ./runtest-moduleapi --config io-threads 4 --config io-threads-do-reads yes --accurate ``` 1. Fixed thread-safe issue in RM_UnblockClient() Related discussion: redis#12817 (comment) * When blocking a client in a module using `RM_BlockClientOnKeys()` or `RM_BlockClientOnKeysWithFlags()` with a timeout_callback, calling RM_UnblockClient() in module threads can lead to race conditions in `updateStatsOnUnblock()`. - Introduced: Version: 6.2 PR: redis#7491 - Touch: `server.stat_numcommands`, `cmd->latency_histogram`, `server.slowlog`, and `server.latency_events` - Harm Level: High Potentially corrupts the memory data of `cmd->latency_histogram`, `server.slowlog`, and `server.latency_events` - Solution: Differentiate whether the call to moduleBlockedClientTimedOut() comes from the module or the main thread. Since we can't know if RM_UnblockClient() comes from module threads, we always assume it does and let `updateStatsOnUnblock()` asynchronously update the unblock status. * When error reply is called in timeout_callback(), ctx is not thread-safe, eventually lead to race conditions in `afterErrorReply`. - Introduced: Version: 6.2 PR: redis#8217 - Touch `server.stat_total_error_replies`, `server.errors`, - Harm Level: High Potentially corrupts the memory data of `server.errors` - Solution: Make the ctx in `timeout_callback()` with `REDISMODULE_CTX_THREAD_SAFE`, and asynchronously reply errors to the client. 2. Made RM_Reply*() family API thread-safe Related discussion: redis#12817 (comment) Call chain: `RM_Reply*()` -> `_addReplyToBufferOrList()` -> touch server.current_client - Introduced: Version: 7.2.0 PR: redis#12326 - Harm Level: None Since the module fake client won't have the `CLIENT_PUSHING` flag, even if we touch server.current_client, we can still exit after `c->flags & CLIENT_PUSHING`. - Solution Checking `c->flags & CLIENT_PUSHING` earlier. 3. Made freeClient() thread-safe Fix redis#12785 - Introduced: Version: 4.0 Commit: redis@fa1b900 - Harm Level: Moderate * Trigger assertion It happens when the module thread calls freeClient while the io-thread is in progress, which just triggers an assertion, and doesn't make any race condiaions. * Touch `server.current_client`, `server.stat_clients_type_memory`, and `clientMemUsageBucket->clients`. It happens between the main thread and the module threads, may cause data corruption. 1. Error reset `server.current_client` to NULL, but theoretically this won't happen, because the module has already reset `server.current_client` to old value before entering freeClient. 2. corrupts `clientMemUsageBucket->clients` in updateClientMemUsageAndBucket(). 3. Causes server.stat_clients_type_memory memory statistics to be inaccurate. - Solution: * No longer counts memory usage on fake clients, to avoid updating `server.stat_clients_type_memory` in freeClient. * No longer resetting `server.current_client` in unlinkClient, because the fake client won't be evicted or disconnected in the mid of the process. * Judgment assertion `io_threads_op == IO_THREADS_OP_IDLE` only if c is not a fake client. 4. Fixed free client args without GIL Related discussion: redis#12817 (comment) When freeing retained strings in the module thread (refcount decr), or using them in some way (refcount incr), we should do so while holding the GIL, otherwise, they might be simultaneously freed while the main thread is processing the unblock client state. - Introduced: Version: 6.2.0 PR: redis#8141 - Harm Level: Low Trigger assertion or double free or memory leak. - Solution: Documenting that module API users need to ensure any access to these retained strings is done with the GIL locked 5. Fix adding fake client to server.clients_pending_write It will incorrectly log the memory usage for the fake client. Related discussion: redis#12817 (comment) - Introduced: Version: 4.0 Commit: redis@e96bac5 - Harm Level: None Only result in NOP - Solution: * Don't add fake client into server.clients_pending_write * Add c->conn assertion for updateClientMemUsageAndBucket() and updateClientMemoryUsage() to avoid same issue in the future. So now it will be the responsibility of the caller of both of them to avoid passing in fake client. 6. Fix calling RM_BlockedClientMeasureTimeStart() and RM_BlockedClientMeasureTimeEnd() without GIL - Introduced: Version: 6.2 PR: redis#7491 - Harm Level: Low Causes inaccuracies in command latency histogram and slow logs, but does not corrupt memory. - Solution: Module API users, if know that non-thread-safe APIs will be used in multi-threading, need to take responsibility for protecting them with their own locks instead of the GIL, as using the GIL is too expensive. ### Other issue 1. RM_Yield is not thread-safe, fixed via redis#12905. ### Summarize 1. Fix thread-safe issues for `RM_UnblockClient()`, `freeClient()` and `RM_Yield`, potentially preventing memory corruption, data disorder, or assertion. 2. Updated docs and module test to clarify module API users' responsibility for locking non-thread-safe APIs in multi-threading, such as RM_BlockedClientMeasureTimeStart/End(), RM_FreeString(), RM_RetainString(), and RM_HoldString(). ### About backpot to 7.2 1. The implement of (1) is not too satisfying, would like to get more eyes. 2. (2), (3) can be safely for backport 3. (4), (6) just modifying the module tests and updating the documentation, no need for a backpot. 4. (5) is harmless, no need for a backpot. --------- Co-authored-by: Oran Agra <oran@redislabs.com>

filipecosta90 added 2 commits July 3, 2020 01:51

[add] enabled background and reply time tracking on blocked clients w…

bf3607f

…ith background work

[add] updated towards peer review on latency histogram workflow and b…

ef5f268

…locked on keys reply time tracking.

oranagra reviewed Jul 10, 2020

View reviewed changes

src/blocked.c Outdated Show resolved Hide resolved

src/module.c Outdated Show resolved Hide resolved

src/module.c Outdated Show resolved Hide resolved

[fix] fixes per PR review

00492ae

filipecosta90 requested a review from oranagra July 17, 2020 10:36

filipecosta90 mentioned this pull request Jul 17, 2020

AI.TENSORSET takes more time to run than AI.MODELRUN on a full aibench Inference Flow RedisAI/redis-inference-optimization#226

Closed

filipecosta90 added 7 commits January 17, 2021 17:34

[wip] unified common code into updateStatsOnUnblock

e50dd24

Merge branch 'unstable' into slowlog.blocking.commands

12a7788

[add] removed spiruous unrelated changes

cd726c8

[add] removed spiruous unrelated changes

a82f627

[fix] Fixed incompatible pointer passing on updateStatsOnUnblock

dfc0d4f

[fix] Moved module's updateStatsOnUnblock() outside of reply callback…

bc11a38

… scope given we might have a module without replycallback set and still doing blocking work

[fix] Moved module's updateStatsOnUnblock() outside of reply callback…

1d8498f

… scope given we might have a module without replycallback set and still doing blocking work

oranagra reviewed Jan 18, 2021

View reviewed changes

src/blocked.c Outdated Show resolved Hide resolved

src/blocked.c Outdated Show resolved Hide resolved

src/module.c Show resolved Hide resolved

tests/unit/moduleapi/blockonbackground.tcl Outdated Show resolved Hide resolved

filipecosta90 added 3 commits January 18, 2021 23:15

[fix] Fixes per PR review: replacing ustime() usage for monotime meth…

1e0fc66

…ods within reply time tracking. [add] tracking reply time on module clients blocked on keys

[add] swapping ustime for monotonic clock on call()

a702ce0

[fix] Fixes per PR review: using monotonic clock to track blocked cli…

a1f4c6d

…ents background time. Reducede useless time on blocked test from 2secs to 100ms. tracking blocked time value on test also.

filipecosta90 requested a review from oranagra January 19, 2021 00:22

oranagra reviewed Jan 19, 2021

View reviewed changes

src/blocked.c Outdated Show resolved Hide resolved

src/server.c Outdated Show resolved Hide resolved

filipecosta90 requested a review from oranagra January 19, 2021 15:23

[fix] Fixes per PR review: change time_t to long within updateStatsOn…

83c4dd4

…Unblock() and us time tracking

madolson reviewed Jan 19, 2021

View reviewed changes

src/server.c Show resolved Hide resolved

oranagra approved these changes Jan 28, 2021

View reviewed changes

yossigo approved these changes Jan 28, 2021

View reviewed changes

soloestoy approved these changes Jan 29, 2021

View reviewed changes

itamarhaber approved these changes Jan 29, 2021

View reviewed changes

itamarhaber added the state:needs-doc-pr requires a PR to redis-doc repository label Jan 29, 2021

oranagra merged commit f0c5052 into redis:unstable Jan 29, 2021

oranagra mentioned this pull request Jan 31, 2021

6.2 RC3 #8427

Merged

filipecosta90 mentioned this pull request Feb 5, 2021

Fixes compiler warning on implicit declaration of function nanosleep #8454

Merged

filipecosta90 mentioned this pull request Mar 24, 2022

OSS Coordinator command internal latency tracking is not properly measuring the total command duration RediSearch/RediSearch#2686

Closed

filipecosta90 mentioned this pull request Apr 5, 2022

Optimization: Use either monotonic or wall-clock to measure command execution time, to regain up to 4% execution time #10502

Merged

filipecosta90 mentioned this pull request May 19, 2022

Enable background work latency tracking via exposed redis API ( i.e. RedisModule_BlockedClientMeasureTimeStart ) RedisGraph/RedisGraph#2365

Closed

oranagra mentioned this pull request Dec 4, 2023

Fix race condition issues between the main thread and module threads #12817

Merged

Enabled background and reply time tracking on blocked on keys/blocked on background work clients #7491

Enabled background and reply time tracking on blocked on keys/blocked on background work clients #7491

Uh oh!

Conversation

filipecosta90 commented Jul 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

test

Samples

current behaviour ( not tracking time spent on background ):

commandstats example

new behaviour ( tracking time spend on background ):

commandstats example

latency monitor example

Uh oh!

oranagra left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yossigo commented Jul 12, 2020

Uh oh!

oranagra commented Jul 12, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

oranagra left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

MeirShpilraien commented Jan 19, 2021

Uh oh!

oranagra commented Jan 19, 2021

Uh oh!

filipecosta90 commented Jan 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MeirShpilraien commented Jan 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

filipecosta90 commented Jan 19, 2021

Uh oh!

Uh oh!

madolson commented Jan 19, 2021

Uh oh!

soloestoy commented Jan 28, 2021

Uh oh!

filipecosta90 commented Jan 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

soloestoy commented Jan 29, 2021

Uh oh!

filipecosta90 commented Jan 29, 2021

Uh oh!

zuiderkwast commented Feb 1, 2021

Uh oh!

itamarhaber commented Feb 4, 2021

Uh oh!

filipecosta90 commented Feb 4, 2021

Uh oh!

filipecosta90 commented Feb 5, 2021

Uh oh!

zuiderkwast commented Feb 5, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

filipecosta90 commented Jul 8, 2020 •

edited

Loading

filipecosta90 commented Jan 19, 2021 •

edited

Loading

MeirShpilraien commented Jan 19, 2021 •

edited

Loading

filipecosta90 commented Jan 28, 2021 •

edited

Loading