Lua eval scripts first in first out LRU eviction #13108

enjoy-binbin · 2024-03-04T11:39:15Z

In some cases, users will abuse lua eval. Each EVAL call generates
a new lua script, which is added to the lua interpreter and cached
to redis-server, consuming a large amount of memory over time.

Since EVAL is mostly the one that abuses the lua cache, and these
won't have pipeline issues (i.e. the script won't disappear unexpectedly,
and cause errors like it would with SCRIPT LOAD and EVALSHA),
we implement a plain FIFO LRU eviction only for these (not for
scripts loaded with SCRIPT LOAD).

Implementation notes:

When not abused we'll probably have less than 100 scripts, and when
abused we'll have many thousands. So we use a hard coded value of 500
scripts. And considering that we don't have many scripts, then unlike
keys, we don't need to worry about the memory usage of keeping a true
sorted LRU linked list. We compute the SHA of each script anyway,
and put the script in a dict, we can store a listNode there, and use
it for quick removal and re-insertion into an LRU list each time the
script is used.

New interfaces:

At the same time, a new evicted_scripts field is added to
INFO, which represents the number of evicted eval scripts. Users
can check it to see if they are abusing EVAL.

benchmark:

./src/redis-benchmark -P 10 -n 1000000 -r 10000000000 eval "return __rand_int__" 0

The simple abuse of eval benchmark test that will create 1 million EVAL
scripts. The performance has been improved by 50%, and the max latency
has dropped from 500ms to 13ms (this may be caused by table expansion
inside Lua when the number of scripts is large). And in the INFO memory,
it used to consume 120MB (server cache) + 310MB (lua engine), but now
it only consumes 70KB (server cache) + 210KB (lua_engine) because of
the scripts eviction.

For non-abusive case of about 100 EVAL scripts, there's no noticeable
change in performance or memory usage.

unlikely potentially breaking change:

in theory, a user can maybe load a
script with EVAL and then use EVALSHA to call it (by calculating the
SHA1 value on the client side), it could be that if we read the docs
carefully we'll realized it's a valid scenario, but we suppose it's
extremely rare. So it may happen that EVALSHA acts on a script created
by EVAL, and the script is evicted and EVALSHA returns a NOSCRIPT error.
that is if you have more than 500 scripts being used in the same transaction / pipeline.

This solves the second point in #13102.

In some cases, users will abuse lua eval. Each eval call generates a new lua script, which is added to the lua interpreter and cached to redis-server, consuming a large amount of memory over time. Since EVAL is mostly the one that abuses the lua cache, and these won't have pipeline issues, we implement script approximate LUR eviction only for these (not for one loaded with SCRIPT LOAD). Lua scripts are divided into eval scripts and load scripts, and we will only evict eval scripts. We call this mechanism "eval scripts eviction". A new configuration of `maxmemory-eval-scripts` is introduced, which represents the memory limit used by eval scripts in redis-server (part of used_memory). The default is 0 means no limit. At the same time, a new `evicted_eval_scripts` field is added to INFO, which represents the number of evicted eval scripts due to `maxmemory-eval-scripts` limit. The general design is as following: - Save eval scripts and load scripts in separate dictionaries. - Update the lua body's lru when created or called. - Use the approximate LRU pool in evict.c to populate the data. - In order to reduce code complexity and not affect users, we will only check the limit in serverCron (also will control us limit) This solves the second point in redis#13102.

enjoy-binbin · 2024-03-04T11:41:41Z

I try to keep the diff small and simple. If we click "Hide whitespace" during the review, the diff in evict will look much better. post it out first for some review, and then i will see how to add some test cases.

sundb · 2024-03-04T13:36:05Z

first impression, too much repetitive code.🙃

oranagra

i didn't review it in depth, but here are some random thoughts:

i don't think we need a config for this, when not abused we'll probably have less than 100 scripts, and when abused we'll have many thousands. i think we can use a hard coded value of 500 scripts, and i don't think we need to monitor their memory usage.
considering that we don't have many scripts, then unlike keys, we don't need to worry about the memory usag of keeping a true sorted LRU linked list. IIRC we compute the SHA of each script anyway, and put the script in a dict, we can store a listNode there, and use it for quick removal and re-insertion into an LRU list each time the script is used.

WDYT?

enjoy-binbin · 2024-03-05T11:37:24Z

yes, this sound good to me, this would be much simpler. I also thought this way at first, but I was worried that it would be too "simple". i will re-code this part

This reverts commit bd66d4d.

…s_eviction

oranagra

can you post some benchmarks and memory usage comparisons of before and after.

src/eval.c

tests/unit/scripting.tcl

oranagra · 2024-03-07T11:33:44Z

while we're waiting for the benchmarks and memory stats.
@redis/core-team please approve the concept and new info field.

enjoy-binbin · 2024-03-09T03:21:49Z

benchmark:
./src/redis-benchmark -P 10 -n 1000000 -r 10000000000 eval "return __rand_int__" 0

The simple abuse of eval benchmark test that will create 1 million eval scripts. The performance has been improved by 50%, and the max lantency has dropped from 500ms to 13ms (this may be caused by table expansion inside Lua when the number of scripts is large). And in the INFO memory, it used to consume 120MB (server cache) + 310MB (lua engine), but now it only consumes 70KB (server cache) + 210KB (lua_engine) because of the scripts eviction.

unstable:

round 1:
Summary:
  throughput summary: 48346.55 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
       10.243     1.176     7.871    22.511    45.727   503.295

used_memory_vm_eval:327702528
used_memory_scripts_eval:120342128
number_of_cached_scripts:999585


round 2:
Summary:
  throughput summary: 48787.63 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
       10.155     1.608     7.919    22.335    44.543   477.951

used_memory_vm_eval:327690240
used_memory_scripts_eval:120347952
number_of_cached_scripts:999637

round 3:
Summary:
  throughput summary: 48564.91 requests per second  latency summary (msec):
          avg       min       p50       p95       p99       max
       10.199     0.744     7.927    22.527    46.367   482.047

used_memory_vm_eval:327727104
used_memory_scripts_eval:120344704
number_of_cached_scripts:999608

this branch:

round 1:
Summary:
  throughput summary: 73719.13 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        6.689     0.784     7.223     7.719    11.471    13.295

used_memory_vm_eval:214016
used_memory_scripts_eval:76096
number_of_cached_scripts:500
evicted_scripts:999500

round2:
Summary:
  throughput summary: 75677.31 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        6.507     0.752     7.079     7.567     7.815    13.151

used_memory_vm_eval:214016
used_memory_scripts_eval:76096
number_of_cached_scripts:500
evicted_scripts:999498

round3:
Summary:
  throughput summary: 75261.54 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        6.544     0.744     7.111     7.575     7.791    11.439

used_memory_vm_eval:214016
used_memory_scripts_eval:76096
number_of_cached_scripts:500
evicted_scripts:999500

oranagra · 2024-03-10T09:41:34Z

thanks. i'm more interested in the impact on non abusive cases.
e.g. one with heavy use of 100 scripts.

enjoy-binbin · 2024-03-10T13:46:55Z

ohh, right, i have thought about testing non-abusive cases before, but did not came up with a simple/good way.

using -r 100, it will make only create 100 scripts, it seems like there is a slight loss in performance
./src/redis-benchmark -P 10 -n 1000000 -r 100 eval "return __rand_int__" 0

unstable, five consecutive records after restarting:

Summary:
  throughput summary: 244558.56 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        1.964     0.736     2.079     2.319     2.511     4.471

Summary:
  throughput summary: 246548.31 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        1.945     0.664     2.063     2.295     2.383     4.223

Summary:
  throughput summary: 244917.94 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        1.959     0.688     2.087     2.327     2.431     3.255

Summary:
  throughput summary: 243546.03 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        1.972     0.696     2.103     2.351     2.455     3.415

Summary:
  throughput summary: 246548.31 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        1.946     0.664     2.055     2.287     2.367     4.383

this PR:

Summary:
  throughput summary: 241021.92 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        1.993     0.744     2.111     2.351     2.487     6.143

Summary:
  throughput summary: 243486.73 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        1.969     0.672     2.087     2.319     2.383     2.863

Summary:
  throughput summary: 239578.36 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        2.004     0.936     2.111     2.367     2.783     4.335

Summary:
  throughput summary: 241779.48 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        1.984     0.648     2.111     2.351     2.447     3.759

Summary:
  throughput summary: 240963.84 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        1.993     0.664     2.119     2.359     2.495     4.151

memory usage:

unstable:
used_memory_lua:63488
used_memory_scripts_eval:12224

this branch:
used_memory_lua:63488
used_memory_scripts_eval:15424

oranagra · 2024-03-11T06:37:40Z

that's a 0.13% change (insignificant).
just to cover everything, let's also measure the memory usage impact.

enjoy-binbin · 2024-03-11T06:47:20Z

right, i updated it in the comments. Basically a script will take 32 more bytes than before (listNode * in luaScript and listNode in lua_scripts_lru_list)

unstable:
used_memory_lua:63488
used_memory_scripts_eval:12224

this branch:
used_memory_lua:63488
used_memory_scripts_eval:15424

zuiderkwast · 2024-03-11T13:34:54Z

it could be that if we read the docs
carefully we'll realized it's a valid scenario, but we suppose it's
extremely rare. So it may happen that EVALSHA acts on a script created
by EVAL, and the script is evicted and EVALSHA returns a NOSCRIPT error.

We should update these docs to make this clear.

madolson

Directional approval, with some minor comments.

src/eval.c

oranagra · 2024-03-13T06:27:35Z

approved in a core-team meeting. decided to tag it as a breaking change.
@enjoy-binbin can you make a redis-doc PR with some adjustments (the new info field, and maybe a word about script eviction if you find place for it)

This PR fixes two crashes due to the defragmentation of the Lua script, which were by #13108 1. During long-running Lua script execution, active defragmentation may be triggered, causing the luaScript structure to be reallocated to a new memory location, then we access `l->node`(may be reallocatedd) after script execution to update the Lua LRU list. In this PR, we don't defrag during blocked scripts, so we don't mess up the LRU update when the script ends. Note that defrag is now only permitted during loading. This PR also reverts the changes made by #14274. 2. Forgot to update the Lua LUR list node's value. Since `lua_scripts_lru_list` node stores a pointer to the `lua_script`'s key, we also need to update `node->value` when the key is reallocated. In this PR, after performing defragmentation on a Lua script, if the script is in the LRU list, its reference in the LRU list will be unconditionally updated.

In some cases, users will abuse lua eval. Each EVAL call generates a new lua script, which is added to the lua interpreter and cached to redis-server, consuming a large amount of memory over time. Since EVAL is mostly the one that abuses the lua cache, and these won't have pipeline issues (i.e. the script won't disappear unexpectedly, and cause errors like it would with SCRIPT LOAD and EVALSHA), we implement a plain FIFO LRU eviction only for these (not for scripts loaded with SCRIPT LOAD). ### Implementation notes: When not abused we'll probably have less than 100 scripts, and when abused we'll have many thousands. So we use a hard coded value of 500 scripts. And considering that we don't have many scripts, then unlike keys, we don't need to worry about the memory usage of keeping a true sorted LRU linked list. We compute the SHA of each script anyway, and put the script in a dict, we can store a listNode there, and use it for quick removal and re-insertion into an LRU list each time the script is used. ### New interfaces: At the same time, a new `evicted_scripts` field is added to INFO, which represents the number of evicted eval scripts. Users can check it to see if they are abusing EVAL. ### benchmark: `./src/redis-benchmark -P 10 -n 1000000 -r 10000000000 eval "return __rand_int__" 0` The simple abuse of eval benchmark test that will create 1 million EVAL scripts. The performance has been improved by 50%, and the max latency has dropped from 500ms to 13ms (this may be caused by table expansion inside Lua when the number of scripts is large). And in the INFO memory, it used to consume 120MB (server cache) + 310MB (lua engine), but now it only consumes 70KB (server cache) + 210KB (lua_engine) because of the scripts eviction. For non-abusive case of about 100 EVAL scripts, there's no noticeable change in performance or memory usage. ### unlikely potentially breaking change: in theory, a user can maybe load a script with EVAL and then use EVALSHA to call it (by calculating the SHA1 value on the client side), it could be that if we read the docs carefully we'll realized it's a valid scenario, but we suppose it's extremely rare. So it may happen that EVALSHA acts on a script created by EVAL, and the script is evicted and EVALSHA returns a NOSCRIPT error. that is if you have more than 500 scripts being used in the same transaction / pipeline. This solves the second point in redis#13102.

This PR fixes two crashes due to the defragmentation of the Lua script, which were by redis#13108 1. During long-running Lua script execution, active defragmentation may be triggered, causing the luaScript structure to be reallocated to a new memory location, then we access `l->node`(may be reallocatedd) after script execution to update the Lua LRU list. In this PR, we don't defrag during blocked scripts, so we don't mess up the LRU update when the script ends. Note that defrag is now only permitted during loading. This PR also reverts the changes made by redis#14274. 2. Forgot to update the Lua LUR list node's value. Since `lua_scripts_lru_list` node stores a pointer to the `lua_script`'s key, we also need to update `node->value` when the key is reallocated. In this PR, after performing defragmentation on a Lua script, if the script is in the LRU list, its reference in the LRU list will be unconditionally updated.

enjoy-binbin requested a review from oranagra March 4, 2024 11:41

oranagra reviewed Mar 5, 2024

View reviewed changes

enjoy-binbin added 3 commits March 6, 2024 17:32

Revert "Lua eval scripts eviction"

8d55c04

This reverts commit bd66d4d.

Merge remote-tracking branch 'upstream/unstable' into lua_eval_script…

2a5fa82

…s_eviction

Re-code as suggested by oranagra, using 500 FIFO

11b619e

enjoy-binbin marked this pull request as ready for review March 6, 2024 09:34

enjoy-binbin changed the title ~~Lua eval scripts eviction~~ Lua eval scripts first in first out LRU eviction Mar 6, 2024

oranagra added state:major-decision Requires core team consensus release-notes indication that this issue needs to be mentioned in the release notes labels Mar 7, 2024

oranagra approved these changes Mar 7, 2024

View reviewed changes

code review from oran

7869ab7

oranagra added the approval-needed Waiting for core team approval to be merged label Mar 7, 2024

enjoy-binbin added the state:needs-doc-pr requires a PR to redis-doc repository label Mar 11, 2024

madolson approved these changes Mar 11, 2024

View reviewed changes

src/eval.c Outdated Show resolved Hide resolved

src/eval.c Outdated Show resolved Hide resolved

src/eval.c Outdated Show resolved Hide resolved

src/eval.c Outdated Show resolved Hide resolved

src/eval.c Outdated Show resolved Hide resolved

code review from madolson

c83914f

oranagra merged commit ad28d22 into redis:unstable Mar 13, 2024

oranagra added the breaking-change This change can potentially break existing application label Mar 13, 2024

enjoy-binbin deleted the lua_eval_scripts_eviction branch March 13, 2024 06:29

This was referenced Mar 13, 2024

Add new INFO evicted_scripts field and eval scripts eviction redis/redis-doc#2686

Merged

improvements around Lua Scripts memory usage #13102

Closed

sundb mentioned this pull request Aug 29, 2025

Fix crash during lua script defrag #14319

Merged

Lua eval scripts first in first out LRU eviction #13108

Lua eval scripts first in first out LRU eviction #13108

Uh oh!

Conversation

enjoy-binbin commented Mar 4, 2024 • edited by oranagra Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Implementation notes:

New interfaces:

benchmark:

unlikely potentially breaking change:

Uh oh!

enjoy-binbin commented Mar 4, 2024

Uh oh!

sundb commented Mar 4, 2024

Uh oh!

oranagra left a comment

Choose a reason for hiding this comment

Uh oh!

enjoy-binbin commented Mar 5, 2024

Uh oh!

oranagra left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

oranagra commented Mar 7, 2024

Uh oh!

enjoy-binbin commented Mar 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oranagra commented Mar 10, 2024

Uh oh!

enjoy-binbin commented Mar 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oranagra commented Mar 11, 2024

Uh oh!

enjoy-binbin commented Mar 11, 2024

Uh oh!

zuiderkwast commented Mar 11, 2024

Uh oh!

madolson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

oranagra commented Mar 13, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

enjoy-binbin commented Mar 4, 2024 •

edited by oranagra

Loading

enjoy-binbin commented Mar 9, 2024 •

edited

Loading

enjoy-binbin commented Mar 10, 2024 •

edited

Loading