Skip to content

Conversation

@filipecosta90
Copy link
Contributor

Fixes #10648 .
Following @oranagra 's recommendation to move the client flags to a more cache friendly position within the client struct we indeed regain the lost 2% of CPU cycles since v6.2 ( from 630532.57 to 647449.80 ops/sec ).

to reproduce:

# spin a standalone redis 
taskset -c 0 `pwd`/src/redis-server --logfile redis-opt.log --save "" --daemonize yes

# populate with data
redis-cli "RPUSH" "list:10" "lysbgqqfqw" "mtccjerdon" "jekkafodvk" "nmgxcctxpn" "vyqqkuszzh" "pytrnqdhvs" "oguwnmniig" "gekntrykfh" "nhfnbxqgol" "cgoeihlnei"

# benchmark 
taskset -c 1-5 memtier_benchmark --command="LRANGE list:10 0 -1"  --hide-histogram --test-time 60 --pipeline 10 -x 3

unstable 2bcd890 (merge-base )

BEST RUN RESULTS
==================================================================================================
Type         Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
--------------------------------------------------------------------------------------------------
Lranges    630593.68         3.16045         3.40700         4.31900         4.60700    134863.30 
Totals     630593.68         3.16045         3.40700         4.31900         4.60700    134863.30 


WORST RUN RESULTS
==================================================================================================
Type         Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
--------------------------------------------------------------------------------------------------
Lranges    630444.65         3.16119         3.40700         4.31900         4.60700    134831.42 
Totals     630444.65         3.16119         3.40700         4.31900         4.60700    134831.42 


AGGREGATED AVERAGE RESULTS (3 runs)
==================================================================================================
Type         Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
--------------------------------------------------------------------------------------------------
Lranges    630532.57         3.16075         3.40700         4.31900         4.60700    134850.23 
Totals     630532.57         3.16075         3.40700         4.31900         4.60700    134850.23 

this PR

BEST RUN RESULTS
==================================================================================================
Type         Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
--------------------------------------------------------------------------------------------------
Lranges    647852.82         3.07598         3.31100         4.25500         4.51100    138554.46 
Totals     647852.82         3.07598         3.31100         4.25500         4.51100    138554.46 


WORST RUN RESULTS
==================================================================================================
Type         Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
--------------------------------------------------------------------------------------------------
Lranges    646919.60         3.08041         3.31100         4.25500         4.54300    138354.88 
Totals     646919.60         3.08041         3.31100         4.25500         4.54300    138354.88 


AGGREGATED AVERAGE RESULTS (3 runs)
==================================================================================================
Type         Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
--------------------------------------------------------------------------------------------------
Lranges    647449.80         3.07789         3.31100         4.25500         4.54300    138468.27 
Totals     647449.80         3.07789         3.31100         4.25500         4.54300    138468.27 

Copy link
Member

@oranagra oranagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@filipecosta90 maybe we wanna use this opportunity to also check what happens when we move the reply buffers and query buffers to the top?

@filipecosta90
Copy link
Contributor Author

@filipecosta90 maybe we wanna use this opportunity to also check what happens when we move the reply buffers and query buffers to the top?

will check it. 👍

@uvletter
Copy link
Contributor

uvletter commented May 9, 2022

How about set the option -e cache-misses to perf, maybe it can be more clearer to observe the cache issues. though I wanna have a try, but a Linux physical machine with root permission seems hard to me now...

@ShooterIT
Copy link
Member

@filipecosta90 maybe we wanna use this opportunity to also check what happens when we move the reply buffers and query buffers to the top?

good idea.

we can add !(c->flags & CLIENT_PENDING_WRITE) at the last judgement in prepareClientToWrite. i.e. like this

if (!(c->flags & CLIENT_PENDING_WRITE) && !clientHasPendingReplies(c) && io_threads_op == IO_THREADS_OP_IDLE)

I think this way also can effectively reduce the call number of clientHasPendingReplies.

@filipecosta90
Copy link
Contributor Author

How about set the option -e cache-misses to perf, maybe it can be more clearer to observe the cache issues. though I wanna have a try, but a Linux physical machine with root permission seems hard to me now...

@uvletter if we get the follow counters while doing the benchmark

perf stat -e "cpu-clock,cpu-cycles,instructions,uops_executed.core,uops_executed.stall_cycles,cache-references,cache-misses,cycle_activity.stalls_total,cycle_activity.stalls_mem_any,cycle_activity.stalls_l3_miss,cycle_activity.stalls_l2_miss,cycle_activity.stalls_l1d_miss" --pid $(pgrep redis-server) -- sleep 60

we can observe that indeed the improvement comes from a reducal of stall cycles when compared to the total uops executed :

  • unstable: 8569.031 M uops_executed /sec and 707.174 M stall_cycles /sec ~= 8.25% of total cycles
  • this PR : 8626.031 M uops_executed /sec and 700.658 M stall_cycles /sec ~= 8.11% of total cycles

As you can check below, the stall cycles drop is where it matters the most ( stalls_mem_any ). Furthermore we now also execute more uops as expected ( given we're reducing the expensive waits on memory ).

full details bellow:

unstable 2bcd890 (merge-base )

 Performance counter stats for process id '1243676':

          59989.54 msec cpu-clock                 #    1.000 CPUs utilized          
      213398172712      cpu-cycles                #    3.557 GHz                      (72.72%)
      494122045956      instructions              #    2.32  insn per cycle           (81.81%)
      514052168626      uops_executed.core        # 8569.031 M/sec                    (81.81%)
       42423046518      uops_executed.stall_cycles #  707.174 M/sec                    (81.81%)
        1068570145      cache-references          #   17.813 M/sec                    (81.81%)
            281576      cache-misses              #    0.026 % of all cache refs      (81.82%)
       42404809459      cycle_activity.stalls_total #  706.870 M/sec                    (81.82%)
       28518853797      cycle_activity.stalls_mem_any #  475.397 M/sec                    (81.83%)
          13530460      cycle_activity.stalls_l3_miss #    0.226 M/sec                    (81.83%)
       11168967183      cycle_activity.stalls_l2_miss #  186.182 M/sec                    (81.82%)
       13030424876      cycle_activity.stalls_l1d_miss #  217.212 M/sec                    (72.72%)

this PR:

 Performance counter stats for process id '1112762':

          59989.29 msec cpu-clock                 #    1.000 CPUs utilized          
      213397095725      cpu-cycles                #    3.557 GHz                      (72.72%)
      496671149840      instructions              #    2.33  insn per cycle           (81.81%)
      517477547014      uops_executed.core        # 8626.166 M/sec                    (81.81%)
       42031996232      uops_executed.stall_cycles #  700.658 M/sec                    (81.81%)
        1064018120      cache-references          #   17.737 M/sec                    (81.81%)
            298139      cache-misses              #    0.028 % of all cache refs      (81.81%)
       42036409619      cycle_activity.stalls_total #  700.732 M/sec                    (81.82%)
       28412615363      cycle_activity.stalls_mem_any #  473.628 M/sec                    (81.83%)
          14161990      cycle_activity.stalls_l3_miss #    0.236 M/sec                    (81.83%)
       11230344994      cycle_activity.stalls_l2_miss #  187.206 M/sec                    (81.83%)
       13064986598      cycle_activity.stalls_l1d_miss #  217.789 M/sec                    (72.73%)

      60.001278949 seconds time elapsed

@oranagra
Copy link
Member

@filipecosta90 are you gonna try some other combinations (like moving the reply/response buffer variables to the top too)?
or shall i merge it like it is and we can consider further changes later in the future?

@filipecosta90
Copy link
Contributor Author

@filipecosta90 are you gonna try some other combinations (like moving the reply/response buffer variables to the top too)? or shall i merge it like it is and we can consider further changes later in the future?

@oranagra I'll confirm with perf perf LOC that this is the best scenario. Give me until tomorrow EOD to confirm if this is already an ideal scenario

@oranagra
Copy link
Member

i didn't mean to move the flags somewhere else, i mean to take querybuf, qb_pos, querybuf_peak, reply, buf, bufpos, buf_usable_size, etc all to the top

@filipecosta90 filipecosta90 self-assigned this May 16, 2022
@filipecosta90 filipecosta90 added action:run-benchmark Triggers the benchmark suite for this Pull Request and removed action:run-benchmark Triggers the benchmark suite for this Pull Request labels May 16, 2022
@yoav-steinberg
Copy link
Contributor

yoav-steinberg commented May 24, 2022

@oranagra Following your suggestion re moving stuff around in the client struct I decided to run Cachegrind on the specific LRANGE benchmark above and see if there are any hotspots. It showed quite a few hotspots, mostly accessing other client struct fields but also some other interesting findings. I'll try to summarize it later today/tomorrow...

@yoav-steinberg
Copy link
Contributor

yoav-steinberg commented May 25, 2022

I used cachegrind with the specified benchmark above and saw the following:

  • There are lots of L1 cache read and write misses mostly when accessing the client struct.
  • flags is important as stated in this PR but also almost all query and reply related fields. Also the memory usage tracking looks at many other fields like the lists of subscribed channles so they two are accessed heavily in the main execution path.
  • Specifically this benchmak does Redis List access and uses a list iterator. The profiling revealed we needlessly dynamically allocate the iterator struct (and the internal quicklist iterator struct as well). When modifying the code to use an iterator on the stack we not only reduce malloc/free overheads we also reduce cache misses du to the stack being much more likely to be in the cache.

I did some dirty modifications to the code both to reorder the fields in the client struct and also use stack based iterators and ran the benchmark locally. Then in cachegrind I couldn't find much more to improve. Running the benchmark again showed consistently better results after the optimization, although they are pretty mild: 3-4%. @filipecosta90 Maybe you'd like to try the patch below and run the benchmark on your setup?

I fear maintaining the client struct cache friendly will be a pain, whatever optimization we do now will just break when we change the struct. I'm not sure @oranagra WDYT?

Regarding using stack based iterators I think we should transtion this in all the code, we can use what's in the patch below as a guidline. There are many places internally in quicklist.c that use dynamic allocations for the iterators that need to be changed.

https://gist.github.com/yoav-steinberg/c881868efa58098fa472a901cd28566c

@oranagra
Copy link
Member

i think we have no choice but periodically re-optimize the cache efficiency of the client struct.
i'd like to benchmark the two subject separately (client struct reordering from iterator refactoring), so that we know what each of them done for us, and also don't risk a case where one did some damage but the other improvement hides it.
@yoav-steinberg i think the easiest would be for you to push two PRs or two branches so that @filipecosta90 can trigger a benchmark on each.

@oranagra
Copy link
Member

oranagra commented Jun 1, 2022

i'm merging this as a step forward, keeping #10460 open for now and posting a reference to the above comment by yoav.

@oranagra oranagra merged commit 6a6e911 into redis:unstable Jun 1, 2022
enjoy-binbin pushed a commit to enjoy-binbin/redis that referenced this pull request Jul 31, 2023
…truct (redis#10697)

Move the client flags to a more cache friendly position within the client struct
we regain the lost 2% of CPU cycles since v6.2 ( from 630532.57 to 647449.80 ops/sec ).
These are due to higher rate of calls to getClientType due to changes in redis#9166 and redis#10020
@fcostaoliveira fcostaoliveira deleted the client.flags branch October 29, 2024 17:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

action:run-benchmark Triggers the benchmark suite for this Pull Request

Projects

Archived in project

5 participants