Moving client flags to a more cache friendly position within client struct #10697

filipecosta90 · 2022-05-09T12:03:10Z

Fixes #10648 .
Following @oranagra 's recommendation to move the client flags to a more cache friendly position within the client struct we indeed regain the lost 2% of CPU cycles since v6.2 ( from 630532.57 to 647449.80 ops/sec ).

to reproduce:

# spin a standalone redis 
taskset -c 0 `pwd`/src/redis-server --logfile redis-opt.log --save "" --daemonize yes

# populate with data
redis-cli "RPUSH" "list:10" "lysbgqqfqw" "mtccjerdon" "jekkafodvk" "nmgxcctxpn" "vyqqkuszzh" "pytrnqdhvs" "oguwnmniig" "gekntrykfh" "nhfnbxqgol" "cgoeihlnei"

# benchmark 
taskset -c 1-5 memtier_benchmark --command="LRANGE list:10 0 -1"  --hide-histogram --test-time 60 --pipeline 10 -x 3

unstable 2bcd890 (merge-base )

BEST RUN RESULTS
==================================================================================================
Type         Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
--------------------------------------------------------------------------------------------------
Lranges    630593.68         3.16045         3.40700         4.31900         4.60700    134863.30 
Totals     630593.68         3.16045         3.40700         4.31900         4.60700    134863.30 


WORST RUN RESULTS
==================================================================================================
Type         Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
--------------------------------------------------------------------------------------------------
Lranges    630444.65         3.16119         3.40700         4.31900         4.60700    134831.42 
Totals     630444.65         3.16119         3.40700         4.31900         4.60700    134831.42 


AGGREGATED AVERAGE RESULTS (3 runs)
==================================================================================================
Type         Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
--------------------------------------------------------------------------------------------------
Lranges    630532.57         3.16075         3.40700         4.31900         4.60700    134850.23 
Totals     630532.57         3.16075         3.40700         4.31900         4.60700    134850.23

this PR

BEST RUN RESULTS
==================================================================================================
Type         Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
--------------------------------------------------------------------------------------------------
Lranges    647852.82         3.07598         3.31100         4.25500         4.51100    138554.46 
Totals     647852.82         3.07598         3.31100         4.25500         4.51100    138554.46 


WORST RUN RESULTS
==================================================================================================
Type         Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
--------------------------------------------------------------------------------------------------
Lranges    646919.60         3.08041         3.31100         4.25500         4.54300    138354.88 
Totals     646919.60         3.08041         3.31100         4.25500         4.54300    138354.88 


AGGREGATED AVERAGE RESULTS (3 runs)
==================================================================================================
Type         Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
--------------------------------------------------------------------------------------------------
Lranges    647449.80         3.07789         3.31100         4.25500         4.54300    138468.27 
Totals     647449.80         3.07789         3.31100         4.25500         4.54300    138468.27

…truct

oranagra

@filipecosta90 maybe we wanna use this opportunity to also check what happens when we move the reply buffers and query buffers to the top?

filipecosta90 · 2022-05-09T15:42:15Z

@filipecosta90 maybe we wanna use this opportunity to also check what happens when we move the reply buffers and query buffers to the top?

will check it. 👍

uvletter · 2022-05-09T16:53:29Z

How about set the option -e cache-misses to perf, maybe it can be more clearer to observe the cache issues. though I wanna have a try, but a Linux physical machine with root permission seems hard to me now...

ShooterIT · 2022-05-09T22:51:21Z

@filipecosta90 maybe we wanna use this opportunity to also check what happens when we move the reply buffers and query buffers to the top?

good idea.

we can add !(c->flags & CLIENT_PENDING_WRITE) at the last judgement in prepareClientToWrite. i.e. like this

if (!(c->flags & CLIENT_PENDING_WRITE) && !clientHasPendingReplies(c) && io_threads_op == IO_THREADS_OP_IDLE)

I think this way also can effectively reduce the call number of clientHasPendingReplies.

filipecosta90 · 2022-05-12T15:03:00Z

How about set the option -e cache-misses to perf, maybe it can be more clearer to observe the cache issues. though I wanna have a try, but a Linux physical machine with root permission seems hard to me now...

@uvletter if we get the follow counters while doing the benchmark

perf stat -e "cpu-clock,cpu-cycles,instructions,uops_executed.core,uops_executed.stall_cycles,cache-references,cache-misses,cycle_activity.stalls_total,cycle_activity.stalls_mem_any,cycle_activity.stalls_l3_miss,cycle_activity.stalls_l2_miss,cycle_activity.stalls_l1d_miss" --pid $(pgrep redis-server) -- sleep 60

we can observe that indeed the improvement comes from a reducal of stall cycles when compared to the total uops executed :

unstable: 8569.031 M uops_executed /sec and 707.174 M stall_cycles /sec ~= 8.25% of total cycles
this PR : 8626.031 M uops_executed /sec and 700.658 M stall_cycles /sec ~= 8.11% of total cycles

As you can check below, the stall cycles drop is where it matters the most ( stalls_mem_any ). Furthermore we now also execute more uops as expected ( given we're reducing the expensive waits on memory ).

full details bellow:

unstable 2bcd890 (merge-base )

 Performance counter stats for process id '1243676':

          59989.54 msec cpu-clock                 #    1.000 CPUs utilized          
      213398172712      cpu-cycles                #    3.557 GHz                      (72.72%)
      494122045956      instructions              #    2.32  insn per cycle           (81.81%)
      514052168626      uops_executed.core        # 8569.031 M/sec                    (81.81%)
       42423046518      uops_executed.stall_cycles #  707.174 M/sec                    (81.81%)
        1068570145      cache-references          #   17.813 M/sec                    (81.81%)
            281576      cache-misses              #    0.026 % of all cache refs      (81.82%)
       42404809459      cycle_activity.stalls_total #  706.870 M/sec                    (81.82%)
       28518853797      cycle_activity.stalls_mem_any #  475.397 M/sec                    (81.83%)
          13530460      cycle_activity.stalls_l3_miss #    0.226 M/sec                    (81.83%)
       11168967183      cycle_activity.stalls_l2_miss #  186.182 M/sec                    (81.82%)
       13030424876      cycle_activity.stalls_l1d_miss #  217.212 M/sec                    (72.72%)

this PR:

 Performance counter stats for process id '1112762':

          59989.29 msec cpu-clock                 #    1.000 CPUs utilized          
      213397095725      cpu-cycles                #    3.557 GHz                      (72.72%)
      496671149840      instructions              #    2.33  insn per cycle           (81.81%)
      517477547014      uops_executed.core        # 8626.166 M/sec                    (81.81%)
       42031996232      uops_executed.stall_cycles #  700.658 M/sec                    (81.81%)
        1064018120      cache-references          #   17.737 M/sec                    (81.81%)
            298139      cache-misses              #    0.028 % of all cache refs      (81.81%)
       42036409619      cycle_activity.stalls_total #  700.732 M/sec                    (81.82%)
       28412615363      cycle_activity.stalls_mem_any #  473.628 M/sec                    (81.83%)
          14161990      cycle_activity.stalls_l3_miss #    0.236 M/sec                    (81.83%)
       11230344994      cycle_activity.stalls_l2_miss #  187.206 M/sec                    (81.83%)
       13064986598      cycle_activity.stalls_l1d_miss #  217.789 M/sec                    (72.73%)

      60.001278949 seconds time elapsed

oranagra · 2022-05-15T07:12:36Z

@filipecosta90 are you gonna try some other combinations (like moving the reply/response buffer variables to the top too)?
or shall i merge it like it is and we can consider further changes later in the future?

filipecosta90 · 2022-05-15T19:57:55Z

@filipecosta90 are you gonna try some other combinations (like moving the reply/response buffer variables to the top too)? or shall i merge it like it is and we can consider further changes later in the future?

@oranagra I'll confirm with perf perf LOC that this is the best scenario. Give me until tomorrow EOD to confirm if this is already an ideal scenario

oranagra · 2022-05-15T20:02:25Z

i didn't mean to move the flags somewhere else, i mean to take querybuf, qb_pos, querybuf_peak, reply, buf, bufpos, buf_usable_size, etc all to the top

yoav-steinberg · 2022-05-24T09:25:45Z

@oranagra Following your suggestion re moving stuff around in the client struct I decided to run Cachegrind on the specific LRANGE benchmark above and see if there are any hotspots. It showed quite a few hotspots, mostly accessing other client struct fields but also some other interesting findings. I'll try to summarize it later today/tomorrow...

yoav-steinberg · 2022-05-25T14:30:39Z

I used cachegrind with the specified benchmark above and saw the following:

There are lots of L1 cache read and write misses mostly when accessing the client struct.
flags is important as stated in this PR but also almost all query and reply related fields. Also the memory usage tracking looks at many other fields like the lists of subscribed channles so they two are accessed heavily in the main execution path.
Specifically this benchmak does Redis List access and uses a list iterator. The profiling revealed we needlessly dynamically allocate the iterator struct (and the internal quicklist iterator struct as well). When modifying the code to use an iterator on the stack we not only reduce malloc/free overheads we also reduce cache misses du to the stack being much more likely to be in the cache.

I did some dirty modifications to the code both to reorder the fields in the client struct and also use stack based iterators and ran the benchmark locally. Then in cachegrind I couldn't find much more to improve. Running the benchmark again showed consistently better results after the optimization, although they are pretty mild: 3-4%. @filipecosta90 Maybe you'd like to try the patch below and run the benchmark on your setup?

I fear maintaining the client struct cache friendly will be a pain, whatever optimization we do now will just break when we change the struct. I'm not sure @oranagra WDYT?

Regarding using stack based iterators I think we should transtion this in all the code, we can use what's in the patch below as a guidline. There are many places internally in quicklist.c that use dynamic allocations for the iterators that need to be changed.

https://gist.github.com/yoav-steinberg/c881868efa58098fa472a901cd28566c

oranagra · 2022-05-25T15:21:13Z

i think we have no choice but periodically re-optimize the cache efficiency of the client struct.
i'd like to benchmark the two subject separately (client struct reordering from iterator refactoring), so that we know what each of them done for us, and also don't risk a case where one did some damage but the other improvement hides it.
@yoav-steinberg i think the easiest would be for you to push two PRs or two branches so that @filipecosta90 can trigger a benchmark on each.

oranagra · 2022-06-01T07:45:46Z

i'm merging this as a step forward, keeping #10460 open for now and posting a reference to the above comment by yoav.

…truct (redis#10697) Move the client flags to a more cache friendly position within the client struct we regain the lost 2% of CPU cycles since v6.2 ( from 630532.57 to 647449.80 ops/sec ). These are due to higher rate of calls to getClientType due to changes in redis#9166 and redis#10020

Moving client flags to a more cache friendly position within client s…

d47c8e2

…truct

filipecosta90 requested review from ShooterIT and oranagra May 9, 2022 12:03

filipecosta90 mentioned this pull request May 9, 2022

large ammount of getClientType() within the reply code is costing us around 3% of cpu cycles on LRANGE command (or any command that heavily relies on partial reply building) #10648

Closed

oranagra approved these changes May 9, 2022

View reviewed changes

ShooterIT approved these changes May 9, 2022

View reviewed changes

filipecosta90 self-assigned this May 16, 2022

filipecosta90 added action:run-benchmark Triggers the benchmark suite for this Pull Request and removed action:run-benchmark Triggers the benchmark suite for this Pull Request labels May 16, 2022

oranagra added the 7.0-must-have label May 24, 2022

oranagra mentioned this pull request Jun 1, 2022

5-7% Performance regression from v5 to v6.2 to unstable due to added features ( more visible on pipeline ) #10460

Open

oranagra merged commit 6a6e911 into redis:unstable Jun 1, 2022

filipecosta90 mentioned this pull request Jul 15, 2022

Performance degrade 7.0.3 vs 6.2.7 #10981

Closed

oranagra mentioned this pull request Oct 19, 2022

[PERFORMANCE] Reduce the impact of prepareClientToWrite in the ReplyProto pipeline #11295

Open

sundb mentioned this pull request Dec 26, 2023

Improve performance by using a static argv array in client structure #12730

Closed

fcostaoliveira deleted the client.flags branch October 29, 2024 17:05

Moving client flags to a more cache friendly position within client struct #10697

Moving client flags to a more cache friendly position within client struct #10697

Uh oh!

Conversation

filipecosta90 commented May 9, 2022

to reproduce:

Uh oh!

oranagra left a comment

Choose a reason for hiding this comment

Uh oh!

filipecosta90 commented May 9, 2022

Uh oh!

uvletter commented May 9, 2022

Uh oh!

ShooterIT commented May 9, 2022

Uh oh!

filipecosta90 commented May 12, 2022

Uh oh!

oranagra commented May 15, 2022

Uh oh!

filipecosta90 commented May 15, 2022

Uh oh!

oranagra commented May 15, 2022

Uh oh!

yoav-steinberg commented May 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yoav-steinberg commented May 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oranagra commented May 25, 2022

Uh oh!

oranagra commented Jun 1, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yoav-steinberg commented May 24, 2022 •

edited

Loading

yoav-steinberg commented May 25, 2022 •

edited

Loading