Optimization: remove `updateClientMemUsage` from i/o threads. #10401

yoav-steinberg · 2022-03-09T07:27:54Z

In a benchmark done by @filipecosta90 he noticed we spend a relatively long time updating the client memory usage leading to performance degradation.
Before #8687 this was performed in the client's cron and didn't affect performance. But since introducing client eviction we need to perform this after filling the input buffers and after processing commands. This also lead me to write this code to be thread safe and perform it in the i/o threads.

It turns out that the main performance issue here is related to atomic operations being performed while updating the total clients memory usage stats used for client eviction (server.stat_clients_type_memory[]). This update needed to be atomic because updateClientMemUsage() was called from the IO threads.

In this PR I make sure to call updateClientMemUsage() only from the main thread. In case of threaded IO I call it for each client during the "fan-in" phase of the read/write operation. This also means I could chuck the updateClientMemUsageBucket() function which was called during this phase and embed it into updateClientMemUsage().

Profiling shows this makes updateClientMemUsage() (on my x86_64 linux) roughly x4 faster.

This is still a WIP since I need to clean up the code and it requires some attention during review to make sure all threading issues are resolved and client eviction wasn't broken in any way including when running with read/write io threads.

Attached are my profiling flame graphs.

yoav-steinberg · 2022-03-09T07:28:55Z

Before this PR

After this PR

oranagra

LGTM

…only if we're not threading.

yoav-steinberg · 2022-03-14T08:17:28Z

@oranagra Please re-review.
@filipecosta90 Please re-benchmark.

filipecosta90 · 2022-03-14T11:35:54Z

@yoav-steinberg we can notice a reduction of the CPU cycles on updateClientMemUsage from 2.2% -> 1.5% but the increase in ops/sec is not very significant ( as expected the measurements of throughput and latency are done without the profiler ). Nonetheless, if we look at the impact in latency ( 0.5% drop we can associate it with the drop on the CPU cycles spent on updateClientMemUsage ).

WRT to measured changes on throughput and latency:

throughput: from 558591 to 559258 ops/sec ( 0.12% change )
latency (p50 including RTT): from 2.863 ms to 2.847 ms ( 0.5% change )

This is small improvement but nonetheless is a reduction that we should merge IMHO. Agree?

confirmation that indeed the function is less heavy on CPU now

# to record data
perf record -g --pid `pgrep redis-server` --call-graph dwarf -o unstable -- sleep 30

# report ( search for updateClientMemUsage with `/` when in perf report mode )
perf report -g -g "graph,0.01,caller" -i unstable -d redis-server --inline

unstable (af6d5c5) total CPU time of updateClientMemUsage + children's: ~2.2% of CPU cyles

Samples: 119K of event 'cycles:ppp', Event count (approx.): 107447366785
  Children      Self  Command       Symbol
+    2.17%     0.78%  redis-server  [.] updateClientMemUsage
+    0.23%     0.23%  redis-server  [.] updateClientMemUsageBucket

wip_optimize_updateClientMemUsage, (406aea4):

Samples: 119K of event 'cycles:ppp', Event count (approx.): 107509389716
  Children      Self  Command       Symbol
+    1.46%     0.31%  redis-server  [.] updateClientMemUsage

benchmark

steps:

# spin redis
taskset -c 0 ./src/redis-server --save "" --daemonize yes

# populate data
memtier_benchmark --ratio 1:0 --key-maximum 1000000 --key-minimum 1 --key-pattern P:P -d 1000 --hide-histogram

# benchmark
taskset -c 1,2 memtier_benchmark -d 1000 --ratio 0:1 --test-time 60 --pipeline 15 --key-pattern=P:P -t 2 --hide-histogram --key-maximum=1000000 --key-minimum 1 -x 3

results for unstable unstable (af6d5c5) :

BEST RUN RESULTS
============================================================================================================================
Type         Ops/sec     Hits/sec   Misses/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
----------------------------------------------------------------------------------------------------------------------------
Sets            0.00          ---          ---             ---             ---             ---             ---         0.00 
Gets       558591.33    558591.33         0.00         2.68485         2.86300         3.82300         4.12700    568894.72 
Waits           0.00          ---          ---             ---             ---             ---             ---          --- 
Totals     558591.33    558591.33         0.00         2.68485         2.86300         3.82300         4.12700    568894.72 


WORST RUN RESULTS
============================================================================================================================
Type         Ops/sec     Hits/sec   Misses/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
----------------------------------------------------------------------------------------------------------------------------
Sets            0.00          ---          ---             ---             ---             ---             ---         0.00 
Gets       558335.05    558335.05         0.00         2.68610         2.84700         3.82300         4.12700    568634.38 
Waits           0.00          ---          ---             ---             ---             ---             ---          --- 
Totals     558335.05    558335.05         0.00         2.68610         2.84700         3.82300         4.12700    568634.38

results for wip_optimize_updateClientMemUsage, (406aea4) :

BEST RUN RESULTS
============================================================================================================================
Type         Ops/sec     Hits/sec   Misses/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
----------------------------------------------------------------------------------------------------------------------------
Sets            0.00          ---          ---             ---             ---             ---             ---         0.00 
Gets       559258.27    559258.27         0.00         2.68165         2.84700         3.82300         4.12700    569574.49 
Waits           0.00          ---          ---             ---             ---             ---             ---          --- 
Totals     559258.27    559258.27         0.00         2.68165         2.84700         3.82300         4.12700    569574.49 


WORST RUN RESULTS
============================================================================================================================
Type         Ops/sec     Hits/sec   Misses/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
----------------------------------------------------------------------------------------------------------------------------
Sets            0.00          ---          ---             ---             ---             ---             ---         0.00 
Gets       558619.75    558619.75         0.00         2.68470         2.84700         3.82300         4.12700    568923.76 
Waits           0.00          ---          ---             ---             ---             ---             ---          --- 
Totals     558619.75    558619.75         0.00         2.68470         2.84700         3.82300         4.12700    568923.76

yoav-steinberg · 2022-03-14T12:14:05Z

@filipecosta90 Thanks for the analysis. Please note that this fix merges updateClientMemUsage() and updateClientMemUsageBucket() into one function. So the actual change in your benchmark from ~2.4% to ~1.5%.

…10401) In a benchmark we noticed we spend a relatively long time updating the client memory usage leading to performance degradation. Before redis#8687 this was performed in the client's cron and didn't affect performance. But since introducing client eviction we need to perform this after filling the input buffers and after processing commands. This also lead me to write this code to be thread safe and perform it in the i/o threads. It turns out that the main performance issue here is related to atomic operations being performed while updating the total clients memory usage stats used for client eviction (`server.stat_clients_type_memory[]`). This update needed to be atomic because `updateClientMemUsage()` was called from the IO threads. In this commit I make sure to call `updateClientMemUsage()` only from the main thread. In case of threaded IO I call it for each client during the "fan-in" phase of the read/write operation. This also means I could chuck the `updateClientMemUsageBucket()` function which was called during this phase and embed it into `updateClientMemUsage()`. Profiling shows this makes `updateClientMemUsage()` (on my x86_64 linux) roughly x4 faster.

wip optimize updateClientMemUsage

b81080c

cleanups

2c4ad78

oranagra approved these changes Mar 9, 2022

View reviewed changes

Refactor: updateClientMemUsage now called from writeToClient but …

406aea4

…only if we're not threading.

yoav-steinberg marked this pull request as ready for review March 14, 2022 08:16

filipecosta90 added the action:run-benchmark Triggers the benchmark suite for this Pull Request label Mar 14, 2022

oranagra approved these changes Mar 15, 2022

View reviewed changes

oranagra merged commit cf6dcb7 into redis:unstable Mar 15, 2022

filipecosta90 mentioned this pull request Mar 21, 2022

5-7% Performance regression from v5 to v6.2 to unstable due to added features ( more visible on pipeline ) #10460

Open

oranagra mentioned this pull request Apr 5, 2022

Release 7.0 rc3 #10532

Merged

filipecosta90 mentioned this pull request Jul 15, 2022

Performance degrade 7.0.3 vs 6.2.7 #10981

Closed

filipecosta90 mentioned this pull request Dec 13, 2023

[QUESTION]Performance degrade of GET command on version 7.x & 6.x compare with 5.x #12748

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimization: remove `updateClientMemUsage` from i/o threads. #10401

Optimization: remove `updateClientMemUsage` from i/o threads. #10401

Uh oh!

yoav-steinberg commented Mar 9, 2022

Uh oh!

yoav-steinberg commented Mar 9, 2022 •

edited

Loading

Uh oh!

oranagra left a comment

Uh oh!

yoav-steinberg commented Mar 14, 2022

Uh oh!

filipecosta90 commented Mar 14, 2022

Uh oh!

yoav-steinberg commented Mar 14, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Optimization: remove updateClientMemUsage from i/o threads. #10401

Optimization: remove updateClientMemUsage from i/o threads. #10401

Uh oh!

Conversation

yoav-steinberg commented Mar 9, 2022

Uh oh!

yoav-steinberg commented Mar 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Before this PR

After this PR

Uh oh!

oranagra left a comment

Choose a reason for hiding this comment

Uh oh!

yoav-steinberg commented Mar 14, 2022

Uh oh!

filipecosta90 commented Mar 14, 2022

confirmation that indeed the function is less heavy on CPU now

benchmark

Uh oh!

yoav-steinberg commented Mar 14, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Optimization: remove `updateClientMemUsage` from i/o threads. #10401

Optimization: remove `updateClientMemUsage` from i/o threads. #10401

yoav-steinberg commented Mar 9, 2022 •

edited

Loading