Skip to content

Conversation

@yoav-steinberg
Copy link
Contributor

In a benchmark done by @filipecosta90 he noticed we spend a relatively long time updating the client memory usage leading to performance degradation.
Before #8687 this was performed in the client's cron and didn't affect performance. But since introducing client eviction we need to perform this after filling the input buffers and after processing commands. This also lead me to write this code to be thread safe and perform it in the i/o threads.

It turns out that the main performance issue here is related to atomic operations being performed while updating the total clients memory usage stats used for client eviction (server.stat_clients_type_memory[]). This update needed to be atomic because updateClientMemUsage() was called from the IO threads.

In this PR I make sure to call updateClientMemUsage() only from the main thread. In case of threaded IO I call it for each client during the "fan-in" phase of the read/write operation. This also means I could chuck the updateClientMemUsageBucket() function which was called during this phase and embed it into updateClientMemUsage().

Profiling shows this makes updateClientMemUsage() (on my x86_64 linux) roughly x4 faster.

This is still a WIP since I need to clean up the code and it requires some attention during review to make sure all threading issues are resolved and client eviction wasn't broken in any way including when running with read/write io threads.

Attached are my profiling flame graphs.

@yoav-steinberg
Copy link
Contributor Author

yoav-steinberg commented Mar 9, 2022

Before this PR

unstable

After this PRucmu-opt

Copy link
Member

@oranagra oranagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yoav-steinberg yoav-steinberg marked this pull request as ready for review March 14, 2022 08:16
@yoav-steinberg
Copy link
Contributor Author

@oranagra Please re-review.
@filipecosta90 Please re-benchmark.

@filipecosta90 filipecosta90 added the action:run-benchmark Triggers the benchmark suite for this Pull Request label Mar 14, 2022
@filipecosta90
Copy link
Contributor

@yoav-steinberg we can notice a reduction of the CPU cycles on updateClientMemUsage from 2.2% -> 1.5% but the increase in ops/sec is not very significant ( as expected the measurements of throughput and latency are done without the profiler ). Nonetheless, if we look at the impact in latency ( 0.5% drop we can associate it with the drop on the CPU cycles spent on updateClientMemUsage ).

WRT to measured changes on throughput and latency:

  • throughput: from 558591 to 559258 ops/sec ( 0.12% change )
  • latency (p50 including RTT): from 2.863 ms to 2.847 ms ( 0.5% change )

This is small improvement but nonetheless is a reduction that we should merge IMHO. Agree?

confirmation that indeed the function is less heavy on CPU now

# to record data
perf record -g --pid `pgrep redis-server` --call-graph dwarf -o unstable -- sleep 30

# report ( search for updateClientMemUsage with `/` when in perf report mode )
perf report -g -g "graph,0.01,caller" -i unstable -d redis-server --inline

unstable (af6d5c5) total CPU time of updateClientMemUsage + children's: ~2.2% of CPU cyles

Samples: 119K of event 'cycles:ppp', Event count (approx.): 107447366785
  Children      Self  Command       Symbol
+    2.17%     0.78%  redis-server  [.] updateClientMemUsage
+    0.23%     0.23%  redis-server  [.] updateClientMemUsageBucket

wip_optimize_updateClientMemUsage, (406aea4):

Samples: 119K of event 'cycles:ppp', Event count (approx.): 107509389716
  Children      Self  Command       Symbol
+    1.46%     0.31%  redis-server  [.] updateClientMemUsage

benchmark

steps:

# spin redis
taskset -c 0 ./src/redis-server --save "" --daemonize yes

# populate data
memtier_benchmark --ratio 1:0 --key-maximum 1000000 --key-minimum 1 --key-pattern P:P -d 1000 --hide-histogram

# benchmark
taskset -c 1,2 memtier_benchmark -d 1000 --ratio 0:1 --test-time 60 --pipeline 15 --key-pattern=P:P -t 2 --hide-histogram --key-maximum=1000000 --key-minimum 1 -x 3

results for unstable unstable (af6d5c5) :

BEST RUN RESULTS
============================================================================================================================
Type         Ops/sec     Hits/sec   Misses/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
----------------------------------------------------------------------------------------------------------------------------
Sets            0.00          ---          ---             ---             ---             ---             ---         0.00 
Gets       558591.33    558591.33         0.00         2.68485         2.86300         3.82300         4.12700    568894.72 
Waits           0.00          ---          ---             ---             ---             ---             ---          --- 
Totals     558591.33    558591.33         0.00         2.68485         2.86300         3.82300         4.12700    568894.72 


WORST RUN RESULTS
============================================================================================================================
Type         Ops/sec     Hits/sec   Misses/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
----------------------------------------------------------------------------------------------------------------------------
Sets            0.00          ---          ---             ---             ---             ---             ---         0.00 
Gets       558335.05    558335.05         0.00         2.68610         2.84700         3.82300         4.12700    568634.38 
Waits           0.00          ---          ---             ---             ---             ---             ---          --- 
Totals     558335.05    558335.05         0.00         2.68610         2.84700         3.82300         4.12700    568634.38 

results for wip_optimize_updateClientMemUsage, (406aea4) :

BEST RUN RESULTS
============================================================================================================================
Type         Ops/sec     Hits/sec   Misses/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
----------------------------------------------------------------------------------------------------------------------------
Sets            0.00          ---          ---             ---             ---             ---             ---         0.00 
Gets       559258.27    559258.27         0.00         2.68165         2.84700         3.82300         4.12700    569574.49 
Waits           0.00          ---          ---             ---             ---             ---             ---          --- 
Totals     559258.27    559258.27         0.00         2.68165         2.84700         3.82300         4.12700    569574.49 


WORST RUN RESULTS
============================================================================================================================
Type         Ops/sec     Hits/sec   Misses/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
----------------------------------------------------------------------------------------------------------------------------
Sets            0.00          ---          ---             ---             ---             ---             ---         0.00 
Gets       558619.75    558619.75         0.00         2.68470         2.84700         3.82300         4.12700    568923.76 
Waits           0.00          ---          ---             ---             ---             ---             ---          --- 
Totals     558619.75    558619.75         0.00         2.68470         2.84700         3.82300         4.12700    568923.76 

@yoav-steinberg
Copy link
Contributor Author

@filipecosta90 Thanks for the analysis. Please note that this fix merges updateClientMemUsage() and updateClientMemUsageBucket() into one function. So the actual change in your benchmark from ~2.4% to ~1.5%.

@oranagra oranagra merged commit cf6dcb7 into redis:unstable Mar 15, 2022
@oranagra oranagra mentioned this pull request Apr 5, 2022
warrick1016 pushed a commit to ctripcorp/Redis-On-Rocks that referenced this pull request Aug 29, 2025
…10401)

In a benchmark we noticed we spend a relatively long time updating the client
memory usage leading to performance degradation.
Before redis#8687 this was performed in the client's cron and didn't affect performance.
But since introducing client eviction we need to perform this after filling the input
buffers and after processing commands. This also lead me to write this code to be
thread safe and perform it in the i/o threads.

It turns out that the main performance issue here is related to atomic operations
being performed while updating the total clients memory usage stats used for client
eviction (`server.stat_clients_type_memory[]`). This update needed to be atomic
because `updateClientMemUsage()` was called from the IO threads.

In this commit I make sure to call `updateClientMemUsage()` only from the main thread.
In case of threaded IO I call it for each client during the "fan-in" phase of the read/write
operation. This also means I could chuck the `updateClientMemUsageBucket()` function
which was called during this phase and embed it into `updateClientMemUsage()`.

Profiling shows this makes `updateClientMemUsage()` (on my x86_64 linux) roughly x4 faster.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

action:run-benchmark Triggers the benchmark suite for this Pull Request

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

3 participants