Skip to content

Conversation

@madolson
Copy link
Contributor

@madolson madolson commented Mar 25, 2023

In #11012, we changed the way command durations were computed to handle the same command being executed multiple times. This commit fixes some misses from that commit.

  1. Wait commands were not correctly reporting their duration if the timeout was reached.
  2. Multi/scripts/and modules with RM_Call were not properly resetting the duration between inner calls, leading to them reporting cumulative duration.
  3. When a blocked client is freed, the call and duration are always discarded.

This commit also adds an assert if the duration is not properly reset, potentially indicating that a report to call statistics was missed. The assert potentially be removed in the future, as it's mainly intended to detect misses in tests.

Before:

(error) ERR EXEC without MULTI
127.0.0.1:6379> multi
OK
127.0.0.1:6379> debug sleep 0.1
QUEUED
127.0.0.1:6379> debug sleep 0.1
QUEUED
127.0.0.1:6379> info commandstats
127.0.0.1:6379> exec
1) OK
2) OK
3) "# Commandstats\r\ncmdstat_exec:calls=1,usec=4,usec_per_call=4.00,rejected_calls=0,failed_calls=1\r\ncmdstat_debug:calls=2,usec=300206,usec_per_call=150103.00,rejected_calls=0,failed_calls=0\r\ncmdstat_multi:calls=1,usec=5,usec_per_call=5.00,rejected_calls=0,failed_calls=0\r\n"

After:

127.0.0.1:6379> multi
OK
127.0.0.1:6379> debug sleep 0.1
QUEUED
127.0.0.1:6379> debug sleep 0.1
QUEUED
127.0.0.1:6379> exec
1) OK
2) OK
127.0.0.1:6379> info commandstats
# Commandstats
cmdstat_exec:calls=2,usec=200176,usec_per_call=100088.00,rejected_calls=0,failed_calls=1
cmdstat_info:calls=1,usec=35,usec_per_call=35.00,rejected_calls=0,failed_calls=0
cmdstat_debug:calls=2,usec=200132,usec_per_call=100066.00,rejected_calls=0,failed_calls=0
cmdstat_multi:calls=2,usec=4,usec_per_call=2.00,rejected_calls=0,failed_calls=0

@madolson madolson requested a review from ranshid March 25, 2023 00:52
@ranshid
Copy link
Contributor

ranshid commented Mar 25, 2023

good catch - I missed it (after struggling to adjust lua to start calling resetClient)
I think this fix makes sense. I just need to check the blocked timeout case though

@ranshid
Copy link
Contributor

ranshid commented Mar 25, 2023

maybe we can also zero the duration when we are updating stats?

───────────────────────────────────────────────────────────────────────────────
modified: src/blocked.c
───────────────────────────────────────────────────────────────────────────────
@ src/blocked.c:111 @ void updateStatsOnUnblock(client *c, long blocked_us, long reply_us, int had_err
    c->lastcmd->microseconds += total_cmd_duration;
    c->lastcmd->calls++;
    server.stat_numcommands++;
    c->duration = 0; <-- here
    if (had_errors)
        c->lastcmd->failed_calls++;
    if (server.latency_tracking_enabled)

@oranagra
Copy link
Member

i think i agree, with Ran, it'll look better if we zero it right after using it (explicitly in call() and in updateStatsOnUnblock()).
i think we can also keep the one in resetClient just for safety.

@madolson
Copy link
Contributor Author

i think we can also keep the one in resetClient just for safety.

I moved it to a serverAssert(), maybe it will catch a place where we miss. Missed one case related to WAIT where we weren't recording the wait time and one extra false positive related to module client freeing.

@ranshid
Copy link
Contributor

ranshid commented Mar 27, 2023

i think we can also keep the one in resetClient just for safety.

I moved it to a serverAssert(), maybe it will catch a place where we miss. Missed one case related to WAIT where we weren't recording the wait time and one extra false positive related to module client freeing.

classic case to place a "debugAssert" which we do not have :)
anyway I think that is fine, but I wonder if the code wouldn't have been cleaner if we just had the original fix+keep the zeroing inside resetClient?

Copy link
Member

@oranagra oranagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think i'd prefer a simpler approach of clearing it in obvious places:

  • creteClient, resetClient
  • and after using the value (i.e. updateStatsOnUnblock and call)

not sure we need the assertion...

@oranagra oranagra added the release-notes indication that this issue needs to be mentioned in the release notes label Mar 27, 2023
@madolson
Copy link
Contributor Author

i think i'd prefer a simpler approach of clearing it in obvious places:

The problem is that the "obvious places" aren't really obvious anymore.

not sure we need the assertion...

I would have agreed if it didn't find another issue with wait. It wasn't critical, but still

@oranagra
Copy link
Member

You mean the condition that ended up in call?

Anyway. OK. Go ahead and merge it..


/* Deallocate structures used to block on blocking ops. */
/* If there is any in-flight command, we don't don't record their duration. */
c->duration = 0;
Copy link
Contributor Author

@madolson madolson Mar 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started running into issues related to module APIs that had resetClient() being called when a client with a pending blocking RM_Call() was getting freed, which triggered the resetClient() call path (which crashed since it had an unrecorded command). There might be a more elegant way to fix it, but I ended up just zero'ing the duration when the client was being freed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't understand. are we calling resetClient from within / after freeClient?
same thing for moduleReleaseTempClient?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, resetClient is being called from within freeClient, because freeClient calls unbockClient which calls resetClient here: https://github.com/redis/redis/blob/unstable/src/blocked.c#L211.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok.
but now that we fixed the problem with WAIT, can we (at least in theory), replace that assert with a =0, and drop many other changes?
i'm ok with to keep the assert, just curious since i'm not looking at this PR with much care..

Copy link
Contributor

@ranshid ranshid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thank you for fixing this!

Copy link
Member

@oranagra oranagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please mention the change about unblockClientWaitingReplicas in the top comment.

@madolson madolson changed the title Fixed tracking of command duration for multi/eval/module Fixed tracking of command duration for multi/eval/module/wait Mar 29, 2023
@madolson madolson merged commit 971b177 into redis:unstable Mar 30, 2023
@oranagra oranagra mentioned this pull request May 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-notes indication that this issue needs to be mentioned in the release notes

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants