Skip to content

Conversation

@filipecosta90
Copy link
Contributor

@filipecosta90 filipecosta90 commented Dec 21, 2020

this PR pushes forward the observability on overall error statistics and command statistics within redis-server:

  • It extends INFO COMMANDSTATS to have

    • failed_calls in - so we can keep track of errors that happen from the command itself, broken by command.
    • rejected_calls - so we can keep track of errors that were triggered outside the commmand processing per se ( example of wrong arity, oom, etc... )
      Sample ( notice rejected_calls ):
    127.0.0.1:6379> set k 
    (error) ERR wrong number of arguments for 'set' command
    127.0.0.1:6379> info commandstats
    # Commandstats
    cmdstat_set:calls=0,usec=0,usec_per_call=0.00,rejected_calls=1,failed_calls=0
    cmdstat_info:calls=8,usec=1077,usec_per_call=134.62,rejected_calls=0,failed_calls=0
    
  • Adds a new section to INFO, named ERRORSTATS that enables keeping track of the different errors that occur within redis ( within processCommand and call ) based uppong the reply Error Prefix ( The first word after the "-", up to the first space ).

    Sample:

    127.0.0.1:6379> auth a
    (error) ERR AUTH <password> called without any password configured for the default user. Are you sure your configuration is correct?
    127.0.0.1:6379> set k
    (error) ERR wrong number of arguments for 'set' command
    127.0.0.1:6379> ACL SETUSER alice on >p1pp0 ~cached:* +get +info
    OK
    127.0.0.1:6379> auth alice p1pp0
    OK
    127.0.0.1:6379> set a b
    (error) NOPERM this user has no permissions to run the 'set' command or its subcommand
    127.0.0.1:6379> info errorstats
    # Errorstats
    errorstat_ERR:count=2
    errorstat_NOPERM:count=1
    
  • This PR also fixes RM_ReplyWithError so that it can be correctly identified as an error reply.


To test the added features:

tclsh tests/test_helper.tcl --verbose --single unit/info
tclsh tests/cluster/run.tcl --single tests/18-info.tcl

The added tests are divided as follow:

  • failed call authentication error
  • failed call NOGROUP error
  • rejected call due to wrong arity
  • rejected call by OOM error
  • rejected call by authorization error
  • rejected call rejected due to MOVED Redirection

Copy link
Member

@oranagra oranagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@filipecosta90 thanks for this PR.
i added a few comments about the implementation, but maybe we should first discuss the key features here.

as we've established, there are generally two groups of errors, ones that are handled in processCommand and ones that happen inside the command code itself.
do we intend to cover only the first group? or also the second?
If we intend to extend this feature to match other types of errors, it's probably wrong to do an if-else chain of calls to strncmp, and then a lookup in a dict (or an enum lookup into an array).
maybe we should have a rax in which we can search by prefix, and implicitly support all type of errors.

i think it would be nice to examine this feature after adding a few more errors as an example (like OOM, WRONGTYPE, NOSCRIPT, LOADING, BUSY, BUSYKEY.
one thing we'll soon discover is that many of these are being used with a simple addReply (so we don't even have them on our radar).

@madolson
Copy link
Contributor

Also related to Oran's top level comment, we don't always use "addReplyError*" for all of the errors, we need to go through and audit them.

@filipecosta90
Copy link
Contributor Author

do we intend to cover only the first group? or also the second?
I believe we can see value out of both groups right? meaning the more errors we can keep track without paying a penalty the better.

If we intend to extend this feature to match other types of errors, it's probably wrong to do an if-else chain of calls to strncmp, and then a lookup in a dict (or an enum lookup into an array).
maybe we should have a rax in which we can search by prefix, and implicitly support all type of errors.
i think it would be nice to examine this feature after adding a few more errors as an example (like OOM, WRONGTYPE, NOSCRIPT, LOADING, BUSY, BUSYKEY.

@oranagra @madolson, following what you've mentioned I've updated the code to use the Error Prefix ( The first word after the "-", up to the first space ) as the rax key (lowercased). In that manner we have the desired implicit support for all errors. Can you guys check the new commit.

Sample output of errorstats now:

127.0.0.1:6379> auth a
(error) ERR AUTH <password> called without any password configured for the default user. Are you sure your configuration is correct?
127.0.0.1:6379> set k
(error) ERR wrong number of arguments for 'set' command
127.0.0.1:6379> info errorstats
# Errorstats
errorstat_auth:count=1
errorstat_wrong:count=1

@oranagra
Copy link
Member

@filipecosta90 looking at the example in your last post (didn't read the code yet), i think these two cases should fall into the standard err since the error code is -ERR ("wrong" and "AUTH" are just part of the text).
i.e.

errorstat_err:count=2

not

errorstat_auth:count=1
errorstat_wrong:count=1

if we think it is important for redis to distinguish between them, then it's also important for client libraries, and we should change the code and promote them to have their own error code.

@filipecosta90 filipecosta90 force-pushed the info-errorstats branch 4 times, most recently from d479b61 to 36e1317 Compare December 22, 2020 23:19
@filipecosta90
Copy link
Contributor Author

we do not want to optionally remove the spurious dash, this is in fact an indication that this method (afterErrorReply) is called in the wrong place, and given the wrong input (incomplete error message).
if we do that, the dash will always be present (and we can remove it, and even assert that it's there).

@oranagra agree. In the latest code push I assume that if no - is present than the error is the default ERR (same as done in addReplyErrorLength), otherwise we use the first after the - until the first space.

I've also removed the error name from the error struct and I'm using the error prefix as is ( meaning no lowercase ).

I believe all comments have been address. wdyt?

@filipecosta90 filipecosta90 changed the title Added Errorstats info section containing command and security errors; Added failed_calls to commandstats Added Errorstats info section to enable keeping track of the different errors that occur within Redis; Added failed_calls to commandstats Dec 22, 2020
src/server.c Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just noting that this section is not included by default. i guess that's good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oranagra given this is so "cheap" to compute and potentially a very usefull info I've added it as default. wdyt?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see the only section that's currently not part of the default is the commandstats, i don't suppose it's a matter of being cheap to generate (other parts are possibly more demanding), i think it's also about the verbosity and usefulness.
i suppose that in practice we'll never see here more than some 4 errors, but i'm not sure i'd say it's more useful than the commands stats.

let's leave this question open for later discussion.

@oranagra
Copy link
Member

@filipecosta90 it looks like you added a bunch of commits, and a rebase/merge from unstable, and then squashed it all, and i can't seem to be able to review the changes easily.
please avoid it in the future.

i'm ok with amending a commit and force pushing it (github let's me view the diffs), or to add more and more commits which will then be squashed when the PR is merged.
rebase from unstable or squashing your commits should be done when no one is still looking at the PR, and if you have to grab fresh code from unstable, do it with merge and no squashing, so i can skip that commit.

@madolson
Copy link
Contributor

@oranagra Did you look at the code where we are temporarily attaching a flag to a command? That doesn't seem like a very maintainable assumption to make.

@oranagra
Copy link
Member

@madolson i agree. i'm still editing my CR post when i'm suggesting a better alternative.

@oranagra oranagra merged commit 90b9f08 into redis:unstable Dec 31, 2020
itamarhaber pushed a commit to redis/redis-doc that referenced this pull request Jan 3, 2021
@oranagra oranagra mentioned this pull request Jan 10, 2021
JackieXie168 pushed a commit to JackieXie168/redis that referenced this pull request Mar 2, 2021
Properly throw errors for invalid replication stream and support redis#8217
JackieXie168 pushed a commit to JackieXie168/redis that referenced this pull request Mar 2, 2021
…ommandstats (redis#8217)

This Commit pushes forward the observability on overall error statistics and command statistics within redis-server:

It extends INFO COMMANDSTATS to have
- failed_calls in - so we can keep track of errors that happen from the command itself, broken by command.
- rejected_calls - so we can keep track of errors that were triggered outside the commmand processing per se

Adds a new section to INFO, named ERRORSTATS that enables keeping track of the different errors that
occur within redis ( within processCommand and call ) based on the reply Error Prefix ( The first word
after the "-", up to the first space ).

This commit also fixes RM_ReplyWithError so that it can be correctly identified as an error reply.
mmkal added a commit to mmkal/handy-redis that referenced this pull request Sep 17, 2021
dddae7721 Temp workaround for redis/redis-io#251 (#1646)
2fd52e72b Add ASKING command (#1642)
b9155866a add bitfield_ro command (#1645)
de7eeb2cc Update MONITOR command (#1641)
df6e3889a Update Lua RESP3 reply format (#1619)
be38fe6f1 Add INFO metrics {total,current}_active_defrag_time (#1643)
8ec59f994 Fix grammar in pipelining doc for easier consumption (#1024)
80cd34aff Format Ruby examples (#1136)
12f577959 Add new LMPOP and BLMPOP commands. (#1636)
660170158 Merge pull request #1637 from madolson/numpat-change
d9410a6a1 Update clients.json (#1639)
8dbd962e6 update readme: small typo fix (#1634)
1f5754db4 Escape markdown in my_random_value (#1635)
e94992395 Clarify wording around number of pubsub numpat
845a28d3e Notes on manual failover to a new (just upgraded) replica (#1633)
9a53d7038 Sentinel: rephrase replica-priority paragraph. (#1630)
94a0e4846 Add missing hyperlinks to different sections of the sentinel doc
3554566ce Add SORT_RO documentation (#1621)
46db91367 Update active Java clients (#1623)
dfa8671d3 Sponsor name change (#1624)
b7d1ecfae Avoid confusing use of term "pure function"
87ecbfca7 Add current_cow_peak field in INFO command (#1617)
f9a75a4b0 Add doc for new options of expire commands (#1613)
2d26a5a9c adds new SINTERCARD and ZINTERCARD (#1610)
af282ed4a Fix spelling of acknowledgment (#1618)
1edd7b68a Render lists properly in CLIENT PAUSE doc (#1601)
e3d11f00b Fix inconsistent EVALSHA error output (#1616)
a25c27b0f Add RedisIMS (If-Modified-Since) module (#1599)
228f8c692 Fix typo (#1606)
10a013653 Clarifies that timeout is in seconds. (#1604)
6ca156269 INFO details about fragmentation (#1608)
9877b6740 Fix line feed in acl cats
5b46fada0 add description of the ACL categories (#1609)
a7a331b9d Add 'systemd' to wordlist (#1614)
2320aaeaf Better phrasing and fix typos in streams (#1607)
b00eac4cf Add `process_supervised` description at INFO (#1612)
788189c65 Add info metrics total_eviction_exceeded_time and current_eviction_exceeded_time, redis#9031
5ddb3b26b update stars (#1611)
6a9c6310b RPOPLPUSH minor phrasing updates
bfc5e2cb7 Add unit to TTL summary (be consistent with PTTL) (#1600)
e722a583b Document the INFO field "sentinel_tilt_since_seconds" for Sentinel (#1594)
7282fadc4 GEOSEARCH, GEOSEARCHSTORE more detailed examples (#1597)
5739d443a Change args of BITPOS from [start] [end] to [start [end]]
f0a260346 Merge pull request #1223 from redis/bit-group
9924218f0 Kicked off performance analysis guide - on-CPU profiling and tracing (#1437)
a5ce7cec5 Cluster: reference redis-cli instead of redis-trib + minor fix (#1592)
593ef6677 SCAN: Address documentation typo on parallel iterations (#1595)
632c46aa5 Add client: redis-async (ocaml) #1593
c7143d636 Update the doc about maxmemory on replicas (#1457)
d48c19234 Update time complexity of EXISTS, SPOP (#612)
706170b19 Fix the return of LPOP and RPOP with count + examples (#1591)
7b270c467 Improve wording around expire + transact (#1590)
fe2ff16b9 Cleanup erlang clients that are not maintained
28e7d37d4 Add Bash client https://github.com/SomajitDey/redis-client (#1581)
7f7b61ca9 Add redis-interval-module
8cbe3cfc0 Adding ecredis - Erlang Redis Cluster client (#1586)
157443c74 Add Jedis.jl to clients.json as a new Julia client #1584
5d8afc49f Use 'nil' in LPOS command docs (#1571)
f851b3593 Add note re TTL getting reset by getset
92b7099cf Rephrase description of XX and NX options of ZADD
16d577371 GEOSEARCH command: Changed wording in a few places. (#1568)
d68271df1 Removed wrong space (#1567)
ef1709469 Added SWI-Prolog client (#1407)
618cca8f0 Update SET documentation of the NX and GET flags (#1565)
13a022958 Document optionality of PFADD, EVAL, EVALSHA args
cb3c9b5fb Merge pull request #1572 from sazzad16/patch-3
19db26df5 Merge pull request #1579 from huangzhw/cluster
be82a6f22 Merge pull request #1560 from nmvk/evalro
64d5225df Update PGP key. (#1589)
586b10591 API documentation of EXPIRETIME and PEXPIRETIME commands (#1582)
0146a9e6d slowlog get command supports passing in -1 to get all logs. (#1585)
874d646ee Add missing flag nofailover in "cluster nodes" command output.
9fe1f36dd Add TairHash and TairString modules (#1577)
13bda277d Update client-side-caching.md (#1575)
2865b8e96 ZREMRANGEBYLEX : Replace "return" with "remove" (#1576)
d4b50349f Fix typo (#1574)
23a9ab1b3 MEMORY USAGE command may return 'nil'
af4e30177 add godis client (#1566)
6fca48493 Little tweak on INCR rate limiter 1 (#1516)
29fe06d83 Specifies an error is returned when xadd fails (#1495)
df4eda7ce Add Red to tools.json (#1558)
0e39ef9ec Escapes Redis command names in bitfield's syntax
b2fc5c53c Update bitfield.md (#1553)
3f34db95a SPOP with COUNT returns empty array (not nil) (#1559)
a4fa6557a GEOSEARCHSTORE: Remove WITHCOORD, WITHDIST and WITHHASH (#1561)
eea6e9d1c Apply suggestions from code review
a24651db2 Update to Commands.json
328df19b3 Address PR comments
e42581151 EVAL_RO and EVALSHA_RO documentation
42ccc962f update module api ref again (#1557)
71490b714 Update notifications documentation with module key type event notifications (#1548)
437b16e63 Update module api reference (#1556)
6e749f1fb Updated complexity of BRPOP and BLPOP commands. (#1554)
189ecd19e Fix sentinel failover delay to 2 * failover-timeout (#1555)
3ca7ad000 Add more details about configuration format (#1549)
c79195847 Remove cli tag from XACK (#1552)
ecb0ab336 Update persistence.md (#1550)
29f18a45d update module api doc for 6.2.2 (#1546)
4e2bdf592 Minor spelling / phrasing update. (#1544)
4953e008b Fix LIMIT index error in ZRANGEBYLEX command of indexes.md. (#1542)
48bb84e86 Tiny grammatical change. (#1543)
136bb9ea3 Add documentation for Lua behavior under low memory conditions. (#1533)
5295c398f Fix wrong order of geospatial tuple (#1541)
18e360252 Fix GEOPOS time complexity (#1540)
4514130e0 Update set.md
dbd2b8a0f Stream command cleanups (#1537)
1f28de1ea Update modules.json (#1538)
7f42ec7a6 Adding a link to redicrypt module (#1536)
3493e18c3 Fix RESET command misleading typo and add CLIENT KILL LADDR. (#1535)
ed0b3f694 Added Jedis as an example of client side partitioning (#1534)
400b13f9a Fix incorrect references to SYNC command. (#1531)
18ae06bb2 Adding Redis Interval Sets to the redis.io modules (#1532)
89b88883e Update sponsors.md
8792b6250 Specify STRALGO LCS syntax more accurately. (#1529)
a61380b17 Client and tools cleanups (#1530)
64a0d67c3 Update xpending.md
f1d54a691 module api changes for 6.2.1 (#1526)
7870ef7bb Persistence: Define acronyms RDB and AOF (#1524)
539dd6c06 Fix minor phrasing / grammar issue (#1525)
e0528232f update module api ref (#1523)
8aa3e63c5 grammar changes (#1520)
2cfe75876 update INFO doc for current_fork_perc, current_save_keys_processed, c… (#1518)
08330f388 Fix typo (#1517)
b8434ef82 Add a warning about Sentinel and hostnames. (#1513)
260ee0b73 small typo (#1515)
b702dd26d admin.md: madvise for transparent_hugepage (#841)
eda70b134 Updates tools.json (#1514)
78e0f7ac5 Cleans clients.json (#1512)
4aa0c7c1a Adds aspell dependency to PR action
252a588e4 clients.json: marked node-redis as inactive (#1511)
ac7fcd41e Updates clients.json
ecbe17343 Update clients.json (#1510)
e46f14628 add ltoddy's rust redis client into doc (#1509)
89970b305 Add C# client FreeRedis (#1434)
a2fefb86a Revamps release process description (#1491)
77d54696e improve GETDEL description (#1508)
cac4c937c Update getex.md
3c81ab6ef Update hrandfield.md
99cf22d4c module up update for 6.2 RC3 (#1507)
956e1dd9e Updates for Sentinel in 6.2. (#1506)
f1db48664 Documents the FAILOVER command (#1502)
0923d9e7e Update lazyfree-lazy-user-flush doc (#1493)
47c4105d6 Adds HRANDMEMBER and ZRANDMEMBER, revamps SRANDMEMBER (#1504)
90fccf9f6 Adding redis-py-cluster to the list of clients. (#1505)
3672323bd GETEX, GETDEL and SET PXAT/EXAT Documentation (#1501)
ff1f4bf9a Fix: Removed tonumber from the Lua script (#1496)
0536b897b Apply module command cleanup from redis/8344 (#1499)
3a73cc2d6 Apply module command cleanup from redis/8344 (#1498)
5a336ba5e apply module command cleanup from redis/8344 (#1497)
7b40968c7 apply module command cleanup from redis/8344 (#1497)
bd829f2e8 update module api formatting and missing functions (#1494)
6acf74c4e Update zrangestore.md
7b9874cee Adds complexity to SWAPDB (#1484)
da92d05e9 update module api for 6.2 RC2 (#1492)
926bc64f1 Update set.md (#1490)
01af47f41 GEOSEARCH ANY argument (#1480)
70760fc30 docs for new MINID trim strategy (#1488)
1e8355fe0 Fixed consistency issue with history versions (#1489)
78a3753fd Documention for client pause (#1487)
70c16cffe Updated documentation about BCAST (#1486)
53ead8b58 update INFO doc for current_cow_size field (#1478)
361b855a2 New ZRANGESTORE and ZRANGE new args, deprecate others (#1482)
f256d308b XAUTOCLAIM docs (#1475)
60b5f4bcb new styling guides (#1485)
fd2f0fa52 XREADGROUP example and reply (#1469)
10b306f4c Update clients.json (#1477)
63966e54f hedis now supports cluster (#1483)
31c5e3e6c Rename Presto SQL to Trino (#1476)
40933ff8f Adds CH/NX/XX options to GEOADD (#1474)
090a5fe57 Added documentation regarding error statistics (redis/redis#8217) (#1472)
d1a28e7b2 add command deprecatoin note for GEORADIUS (#1481)
a55223bc7 Documents noargs HELLO variant (#1471)
9f43155fe add reauthenticate and deauthenticate to workdlist
54a440872 Update reset.md
a8278c441 Update reset.md
d60438a09 Documents CLIENT TRACKINGINFO (#1459)
2042e052c Adds count to L/RPOP (#1464)
33bc5ddcf Abide current HELLO subcommand (#1381)
bc5c60596 Documents '--cluster-yes' and a couple of env vars (#1435)
0629c3a5c Clarify the usage of maxlen with a value of 0 (#1436)
616eab455 change port 7002 after redirection (#1468)
7dc7474bb Revert "Use a static example for client info"
056315147 Use a static example for client info
43f18c80b update module docs for 6.2 RC1 (#1466)
0ee00bbc9 Add GEOSEARCH/GEOSEARCHSTORE (#1465)
ed8571b22 Documented redis#8132: Redis main thread cpu time, and scrape system time (#1455)
248dce42a Add description for total_forks in INFO STATS (#1461)
f3caf86cf Copy edit intro (#1463)
dbd555ab4 fix twitter account to g_korland (#1460)
c92e61613 Documents XPENDING's exclusive ranges (#1453)
b6251621d Add Pottery Redlock implementation (#1458)
7ec457eca Typo fix (#1456)
05247b6d0 Adds CLIENT INFO and CLIENT LIST ID (#1451)
940ce878d Clients: Add forks of C and Erlang clients and mark old as inactive (#1452)
a65ca12ac Adds exclusive ranges to XRANGE/XREVRANGE (#1444)
0521a4dbf Documents ACL Pub/Sub (#1448)
871c84235 Adds names to blocks (#1449)
899df8d8a Adds nested arguments (#1443)
4772fa991 Add asynchronous Redlock PHP implementation (#1445)
9220d0235 Jedis is moved under Redis (#1447)
e18dcb675 Fix XPENDING IDLE (#1446)
e382212fa Add handy-redis to list of clients (#1416)
07a0cee32 Adds the COPY command (#1439)
7d664e59d Docs for XPENDING IDLE (#1442)
ca01ef14d Add documentation for the new zdiff/zdiffstore commands (#1428)
76e46bff4 Stream intro doc cleanups (#1433)
7b6de45a4 add missing user information in client list (#1441)
edbe192ba update client list command for including bcast and redir information (#1440)
751fe6501 Adds empty authors to Presto
895548224 Add Presto Redis connector (#1430)
c62a515eb Adds NewLife.Redis C# client (#1431)
4787687dc Typo fix: entires -> entries (#1432)
e3daa2f95 Updates INFO fields about diskless loading progress and maxclients (#1427)
08904d273 Adds the RESET command (#1426)
1b594c6cf Adds clarifications about Pub/Sub replies in a Redis Cluster (#1425)
7a03ea3ed Update latency-reset.md
8a0d21a22 Update cluster-flushslots.md
a0a330b73 Updated CLIENT LIST and KILL doc to include laddr (#1424)
8d4bf9bc4 Update module api before releasing 6.0.9 (#1423)
8bb78b4c2 Some cleanups to streams doc (#1422)
c387a8f0c Updates to Sentinel doc (#1417)
e0fb54518 Renames bit group to bitmap to conform to internals
207e09f20 Moves some commands to a new 'bit' category

git-subtree-dir: docs/redis-doc
git-subtree-split: dddae77217aa8d0e0d1d00cd14993032156aa705
mmkal added a commit to mmkal/handy-redis that referenced this pull request Sep 17, 2021
93a2afb33 Escapes Redis commands
099aae476 Escapes keywords that are also commands
dddae7721 Temp workaround for redis/redis-io#251 (#1646)
2fd52e72b Add ASKING command (#1642)
b9155866a add bitfield_ro command (#1645)
de7eeb2cc Update MONITOR command (#1641)
df6e3889a Update Lua RESP3 reply format (#1619)
be38fe6f1 Add INFO metrics {total,current}_active_defrag_time (#1643)
8ec59f994 Fix grammar in pipelining doc for easier consumption (#1024)
80cd34aff Format Ruby examples (#1136)
12f577959 Add new LMPOP and BLMPOP commands. (#1636)
660170158 Merge pull request #1637 from madolson/numpat-change
d9410a6a1 Update clients.json (#1639)
8dbd962e6 update readme: small typo fix (#1634)
1f5754db4 Escape markdown in my_random_value (#1635)
e94992395 Clarify wording around number of pubsub numpat
845a28d3e Notes on manual failover to a new (just upgraded) replica (#1633)
9a53d7038 Sentinel: rephrase replica-priority paragraph. (#1630)
94a0e4846 Add missing hyperlinks to different sections of the sentinel doc
3554566ce Add SORT_RO documentation (#1621)
46db91367 Update active Java clients (#1623)
dfa8671d3 Sponsor name change (#1624)
b7d1ecfae Avoid confusing use of term "pure function"
87ecbfca7 Add current_cow_peak field in INFO command (#1617)
f9a75a4b0 Add doc for new options of expire commands (#1613)
2d26a5a9c adds new SINTERCARD and ZINTERCARD (#1610)
af282ed4a Fix spelling of acknowledgment (#1618)
1edd7b68a Render lists properly in CLIENT PAUSE doc (#1601)
e3d11f00b Fix inconsistent EVALSHA error output (#1616)
a25c27b0f Add RedisIMS (If-Modified-Since) module (#1599)
228f8c692 Fix typo (#1606)
10a013653 Clarifies that timeout is in seconds. (#1604)
6ca156269 INFO details about fragmentation (#1608)
9877b6740 Fix line feed in acl cats
5b46fada0 add description of the ACL categories (#1609)
a7a331b9d Add 'systemd' to wordlist (#1614)
2320aaeaf Better phrasing and fix typos in streams (#1607)
b00eac4cf Add `process_supervised` description at INFO (#1612)
788189c65 Add info metrics total_eviction_exceeded_time and current_eviction_exceeded_time, redis#9031
5ddb3b26b update stars (#1611)
6a9c6310b RPOPLPUSH minor phrasing updates
bfc5e2cb7 Add unit to TTL summary (be consistent with PTTL) (#1600)
e722a583b Document the INFO field "sentinel_tilt_since_seconds" for Sentinel (#1594)
7282fadc4 GEOSEARCH, GEOSEARCHSTORE more detailed examples (#1597)
5739d443a Change args of BITPOS from [start] [end] to [start [end]]
f0a260346 Merge pull request #1223 from redis/bit-group
9924218f0 Kicked off performance analysis guide - on-CPU profiling and tracing (#1437)
a5ce7cec5 Cluster: reference redis-cli instead of redis-trib + minor fix (#1592)
593ef6677 SCAN: Address documentation typo on parallel iterations (#1595)
632c46aa5 Add client: redis-async (ocaml) #1593
c7143d636 Update the doc about maxmemory on replicas (#1457)
d48c19234 Update time complexity of EXISTS, SPOP (#612)
706170b19 Fix the return of LPOP and RPOP with count + examples (#1591)
7b270c467 Improve wording around expire + transact (#1590)
fe2ff16b9 Cleanup erlang clients that are not maintained
28e7d37d4 Add Bash client https://github.com/SomajitDey/redis-client (#1581)
7f7b61ca9 Add redis-interval-module
8cbe3cfc0 Adding ecredis - Erlang Redis Cluster client (#1586)
157443c74 Add Jedis.jl to clients.json as a new Julia client #1584
5d8afc49f Use 'nil' in LPOS command docs (#1571)
f851b3593 Add note re TTL getting reset by getset
92b7099cf Rephrase description of XX and NX options of ZADD
16d577371 GEOSEARCH command: Changed wording in a few places. (#1568)
d68271df1 Removed wrong space (#1567)
ef1709469 Added SWI-Prolog client (#1407)
618cca8f0 Update SET documentation of the NX and GET flags (#1565)
13a022958 Document optionality of PFADD, EVAL, EVALSHA args
cb3c9b5fb Merge pull request #1572 from sazzad16/patch-3
19db26df5 Merge pull request #1579 from huangzhw/cluster
be82a6f22 Merge pull request #1560 from nmvk/evalro
64d5225df Update PGP key. (#1589)
586b10591 API documentation of EXPIRETIME and PEXPIRETIME commands (#1582)
0146a9e6d slowlog get command supports passing in -1 to get all logs. (#1585)
874d646ee Add missing flag nofailover in "cluster nodes" command output.
9fe1f36dd Add TairHash and TairString modules (#1577)
13bda277d Update client-side-caching.md (#1575)
2865b8e96 ZREMRANGEBYLEX : Replace "return" with "remove" (#1576)
d4b50349f Fix typo (#1574)
23a9ab1b3 MEMORY USAGE command may return 'nil'
af4e30177 add godis client (#1566)
6fca48493 Little tweak on INCR rate limiter 1 (#1516)
29fe06d83 Specifies an error is returned when xadd fails (#1495)
df4eda7ce Add Red to tools.json (#1558)
0e39ef9ec Escapes Redis command names in bitfield's syntax
b2fc5c53c Update bitfield.md (#1553)
3f34db95a SPOP with COUNT returns empty array (not nil) (#1559)
a4fa6557a GEOSEARCHSTORE: Remove WITHCOORD, WITHDIST and WITHHASH (#1561)
eea6e9d1c Apply suggestions from code review
a24651db2 Update to Commands.json
328df19b3 Address PR comments
e42581151 EVAL_RO and EVALSHA_RO documentation
42ccc962f update module api ref again (#1557)
71490b714 Update notifications documentation with module key type event notifications (#1548)
437b16e63 Update module api reference (#1556)
6e749f1fb Updated complexity of BRPOP and BLPOP commands. (#1554)
189ecd19e Fix sentinel failover delay to 2 * failover-timeout (#1555)
3ca7ad000 Add more details about configuration format (#1549)
c79195847 Remove cli tag from XACK (#1552)
ecb0ab336 Update persistence.md (#1550)
29f18a45d update module api doc for 6.2.2 (#1546)
4e2bdf592 Minor spelling / phrasing update. (#1544)
4953e008b Fix LIMIT index error in ZRANGEBYLEX command of indexes.md. (#1542)
48bb84e86 Tiny grammatical change. (#1543)
136bb9ea3 Add documentation for Lua behavior under low memory conditions. (#1533)
5295c398f Fix wrong order of geospatial tuple (#1541)
18e360252 Fix GEOPOS time complexity (#1540)
4514130e0 Update set.md
dbd2b8a0f Stream command cleanups (#1537)
1f28de1ea Update modules.json (#1538)
7f42ec7a6 Adding a link to redicrypt module (#1536)
3493e18c3 Fix RESET command misleading typo and add CLIENT KILL LADDR. (#1535)
ed0b3f694 Added Jedis as an example of client side partitioning (#1534)
400b13f9a Fix incorrect references to SYNC command. (#1531)
18ae06bb2 Adding Redis Interval Sets to the redis.io modules (#1532)
89b88883e Update sponsors.md
8792b6250 Specify STRALGO LCS syntax more accurately. (#1529)
a61380b17 Client and tools cleanups (#1530)
64a0d67c3 Update xpending.md
f1d54a691 module api changes for 6.2.1 (#1526)
7870ef7bb Persistence: Define acronyms RDB and AOF (#1524)
539dd6c06 Fix minor phrasing / grammar issue (#1525)
e0528232f update module api ref (#1523)
8aa3e63c5 grammar changes (#1520)
2cfe75876 update INFO doc for current_fork_perc, current_save_keys_processed, c… (#1518)
08330f388 Fix typo (#1517)
b8434ef82 Add a warning about Sentinel and hostnames. (#1513)
260ee0b73 small typo (#1515)
b702dd26d admin.md: madvise for transparent_hugepage (#841)
eda70b134 Updates tools.json (#1514)
78e0f7ac5 Cleans clients.json (#1512)
4aa0c7c1a Adds aspell dependency to PR action
252a588e4 clients.json: marked node-redis as inactive (#1511)
ac7fcd41e Updates clients.json
ecbe17343 Update clients.json (#1510)
e46f14628 add ltoddy's rust redis client into doc (#1509)
89970b305 Add C# client FreeRedis (#1434)
a2fefb86a Revamps release process description (#1491)
77d54696e improve GETDEL description (#1508)
cac4c937c Update getex.md
3c81ab6ef Update hrandfield.md
99cf22d4c module up update for 6.2 RC3 (#1507)
956e1dd9e Updates for Sentinel in 6.2. (#1506)
f1db48664 Documents the FAILOVER command (#1502)
0923d9e7e Update lazyfree-lazy-user-flush doc (#1493)
47c4105d6 Adds HRANDMEMBER and ZRANDMEMBER, revamps SRANDMEMBER (#1504)
90fccf9f6 Adding redis-py-cluster to the list of clients. (#1505)
3672323bd GETEX, GETDEL and SET PXAT/EXAT Documentation (#1501)
ff1f4bf9a Fix: Removed tonumber from the Lua script (#1496)
0536b897b Apply module command cleanup from redis/8344 (#1499)
3a73cc2d6 Apply module command cleanup from redis/8344 (#1498)
5a336ba5e apply module command cleanup from redis/8344 (#1497)
7b40968c7 apply module command cleanup from redis/8344 (#1497)
bd829f2e8 update module api formatting and missing functions (#1494)
6acf74c4e Update zrangestore.md
7b9874cee Adds complexity to SWAPDB (#1484)
da92d05e9 update module api for 6.2 RC2 (#1492)
926bc64f1 Update set.md (#1490)
01af47f41 GEOSEARCH ANY argument (#1480)
70760fc30 docs for new MINID trim strategy (#1488)
1e8355fe0 Fixed consistency issue with history versions (#1489)
78a3753fd Documention for client pause (#1487)
70c16cffe Updated documentation about BCAST (#1486)
53ead8b58 update INFO doc for current_cow_size field (#1478)
361b855a2 New ZRANGESTORE and ZRANGE new args, deprecate others (#1482)
f256d308b XAUTOCLAIM docs (#1475)
60b5f4bcb new styling guides (#1485)
fd2f0fa52 XREADGROUP example and reply (#1469)
10b306f4c Update clients.json (#1477)
63966e54f hedis now supports cluster (#1483)
31c5e3e6c Rename Presto SQL to Trino (#1476)
40933ff8f Adds CH/NX/XX options to GEOADD (#1474)
090a5fe57 Added documentation regarding error statistics (redis/redis#8217) (#1472)
d1a28e7b2 add command deprecatoin note for GEORADIUS (#1481)
a55223bc7 Documents noargs HELLO variant (#1471)
9f43155fe add reauthenticate and deauthenticate to workdlist
54a440872 Update reset.md
a8278c441 Update reset.md
d60438a09 Documents CLIENT TRACKINGINFO (#1459)
2042e052c Adds count to L/RPOP (#1464)
33bc5ddcf Abide current HELLO subcommand (#1381)
bc5c60596 Documents '--cluster-yes' and a couple of env vars (#1435)
0629c3a5c Clarify the usage of maxlen with a value of 0 (#1436)
616eab455 change port 7002 after redirection (#1468)
7dc7474bb Revert "Use a static example for client info"
056315147 Use a static example for client info
43f18c80b update module docs for 6.2 RC1 (#1466)
0ee00bbc9 Add GEOSEARCH/GEOSEARCHSTORE (#1465)
ed8571b22 Documented redis#8132: Redis main thread cpu time, and scrape system time (#1455)
248dce42a Add description for total_forks in INFO STATS (#1461)
f3caf86cf Copy edit intro (#1463)
dbd555ab4 fix twitter account to g_korland (#1460)
c92e61613 Documents XPENDING's exclusive ranges (#1453)
b6251621d Add Pottery Redlock implementation (#1458)
7ec457eca Typo fix (#1456)
05247b6d0 Adds CLIENT INFO and CLIENT LIST ID (#1451)
940ce878d Clients: Add forks of C and Erlang clients and mark old as inactive (#1452)
a65ca12ac Adds exclusive ranges to XRANGE/XREVRANGE (#1444)
0521a4dbf Documents ACL Pub/Sub (#1448)
871c84235 Adds names to blocks (#1449)
899df8d8a Adds nested arguments (#1443)
4772fa991 Add asynchronous Redlock PHP implementation (#1445)
9220d0235 Jedis is moved under Redis (#1447)
e18dcb675 Fix XPENDING IDLE (#1446)
e382212fa Add handy-redis to list of clients (#1416)
07a0cee32 Adds the COPY command (#1439)
7d664e59d Docs for XPENDING IDLE (#1442)
ca01ef14d Add documentation for the new zdiff/zdiffstore commands (#1428)
76e46bff4 Stream intro doc cleanups (#1433)
7b6de45a4 add missing user information in client list (#1441)
edbe192ba update client list command for including bcast and redir information (#1440)
751fe6501 Adds empty authors to Presto
895548224 Add Presto Redis connector (#1430)
c62a515eb Adds NewLife.Redis C# client (#1431)
4787687dc Typo fix: entires -> entries (#1432)
e3daa2f95 Updates INFO fields about diskless loading progress and maxclients (#1427)
08904d273 Adds the RESET command (#1426)
1b594c6cf Adds clarifications about Pub/Sub replies in a Redis Cluster (#1425)
7a03ea3ed Update latency-reset.md
8a0d21a22 Update cluster-flushslots.md
a0a330b73 Updated CLIENT LIST and KILL doc to include laddr (#1424)
8d4bf9bc4 Update module api before releasing 6.0.9 (#1423)
8bb78b4c2 Some cleanups to streams doc (#1422)
c387a8f0c Updates to Sentinel doc (#1417)
e0fb54518 Renames bit group to bitmap to conform to internals
207e09f20 Moves some commands to a new 'bit' category

git-subtree-dir: docs/redis-doc
git-subtree-split: 93a2afb336b3456699d755a7c3d7b2ab27298a9a
oranagra added a commit that referenced this pull request Jan 19, 2024
…12817)

Fix #12785 and other race condition issues.
See the following isolated comments.

The following report was obtained using SANITIZER thread.
```sh
make SANITIZER=thread
./runtest-moduleapi --config io-threads 4 --config io-threads-do-reads yes --accurate
```

1. Fixed thread-safe issue in RM_UnblockClient()
Related discussion:
#12817 (comment)
* When blocking a client in a module using `RM_BlockClientOnKeys()` or
`RM_BlockClientOnKeysWithFlags()`
with a timeout_callback, calling RM_UnblockClient() in module threads
can lead to race conditions
     in `updateStatsOnUnblock()`.

     - Introduced: 
        Version: 6.2
        PR: #7491

     - Touch:
`server.stat_numcommands`, `cmd->latency_histogram`, `server.slowlog`,
and `server.latency_events`
     
     - Harm Level: High
Potentially corrupts the memory data of `cmd->latency_histogram`,
`server.slowlog`, and `server.latency_events`

     - Solution:
Differentiate whether the call to moduleBlockedClientTimedOut() comes
from the module or the main thread.
Since we can't know if RM_UnblockClient() comes from module threads, we
always assume it does and
let `updateStatsOnUnblock()` asynchronously update the unblock status.
     
* When error reply is called in timeout_callback(), ctx is not
thread-safe, eventually lead to race conditions in `afterErrorReply`.

     - Introduced: 
        Version: 6.2
        PR: #8217

     - Touch
       `server.stat_total_error_replies`, `server.errors`, 

     - Harm Level: High
       Potentially corrupts the memory data of `server.errors`
   
      - Solution: 
Make the ctx in `timeout_callback()` with `REDISMODULE_CTX_THREAD_SAFE`,
and asynchronously reply errors to the client.

2. Made RM_Reply*() family API thread-safe
Related discussion:
#12817 (comment)
Call chain: `RM_Reply*()` -> `_addReplyToBufferOrList()` -> touch
server.current_client

    - Introduced: 
       Version: 7.2.0
       PR: #12326

   - Harm Level: None
Since the module fake client won't have the `CLIENT_PUSHING` flag, even
if we touch server.current_client,
     we can still exit after `c->flags & CLIENT_PUSHING`.

   - Solution
      Checking `c->flags & CLIENT_PUSHING` earlier.

3. Made freeClient() thread-safe
    Fix #12785

    - Introduced: 
       Version: 4.0
Commit:
3fcf959

    - Harm Level: Moderate
       * Trigger assertion
It happens when the module thread calls freeClient while the io-thread
is in progress,
which just triggers an assertion, and doesn't make any race condiaions.

* Touch `server.current_client`, `server.stat_clients_type_memory`, and
`clientMemUsageBucket->clients`.
It happens between the main thread and the module threads, may cause
data corruption.
1. Error reset `server.current_client` to NULL, but theoretically this
won't happen,
because the module has already reset `server.current_client` to old
value before entering freeClient.
2. corrupts `clientMemUsageBucket->clients` in
updateClientMemUsageAndBucket().
3. Causes server.stat_clients_type_memory memory statistics to be
inaccurate.
    
    - Solution:
* No longer counts memory usage on fake clients, to avoid updating
`server.stat_clients_type_memory` in freeClient.
* No longer resetting `server.current_client` in unlinkClient, because
the fake client won't be evicted or disconnected in the mid of the
process.
* Judgment assertion `io_threads_op == IO_THREADS_OP_IDLE` only if c is
not a fake client.

4. Fixed free client args without GIL
Related discussion:
#12817 (comment)
When freeing retained strings in the module thread (refcount decr), or
using them in some way (refcount incr), we should do so while holding
the GIL,
otherwise, they might be simultaneously freed while the main thread is
processing the unblock client state.

    - Introduced: 
       Version: 6.2.0
       PR: #8141

   - Harm Level: Low
     Trigger assertion or double free or memory leak. 

   - Solution:
Documenting that module API users need to ensure any access to these
retained strings is done with the GIL locked

5. Fix adding fake client to server.clients_pending_write
    It will incorrectly log the memory usage for the fake client.
Related discussion:
#12817 (comment)

    - Introduced: 
       Version: 4.0
Commit:
9b01b64

    - Harm Level: None
      Only result in NOP

    - Solution:
       * Don't add fake client into server.clients_pending_write
* Add c->conn assertion for updateClientMemUsageAndBucket() and
updateClientMemoryUsage() to avoid same
         issue in the future.
So now it will be the responsibility of the caller of both of them to
avoid passing in fake client.

6. Fix calling RM_BlockedClientMeasureTimeStart() and
RM_BlockedClientMeasureTimeEnd() without GIL
    - Introduced: 
       Version: 6.2
       PR: #7491

   - Harm Level: Low
Causes inaccuracies in command latency histogram and slow logs, but does
not corrupt memory.

   - Solution:
Module API users, if know that non-thread-safe APIs will be used in
multi-threading, need to take responsibility for protecting them with
their own locks instead of the GIL, as using the GIL is too expensive.

### Other issue
1. RM_Yield is not thread-safe, fixed via #12905.

### Summarize
1. Fix thread-safe issues for `RM_UnblockClient()`, `freeClient()` and
`RM_Yield`, potentially preventing memory corruption, data disorder, or
assertion.
2. Updated docs and module test to clarify module API users'
responsibility for locking non-thread-safe APIs in multi-threading, such
as RM_BlockedClientMeasureTimeStart/End(), RM_FreeString(),
RM_RetainString(), and RM_HoldString().

### About backpot to 7.2
1. The implement of (1) is not too satisfying, would like to get more
eyes.
2. (2), (3) can be safely for backport
3. (4), (6) just modifying the module tests and updating the
documentation, no need for a backpot.
4. (5) is harmless, no need for a backpot.

---------

Co-authored-by: Oran Agra <oran@redislabs.com>
roggervalf pushed a commit to roggervalf/redis that referenced this pull request Feb 11, 2024
…edis#12817)

Fix redis#12785 and other race condition issues.
See the following isolated comments.

The following report was obtained using SANITIZER thread.
```sh
make SANITIZER=thread
./runtest-moduleapi --config io-threads 4 --config io-threads-do-reads yes --accurate
```

1. Fixed thread-safe issue in RM_UnblockClient()
Related discussion:
redis#12817 (comment)
* When blocking a client in a module using `RM_BlockClientOnKeys()` or
`RM_BlockClientOnKeysWithFlags()`
with a timeout_callback, calling RM_UnblockClient() in module threads
can lead to race conditions
     in `updateStatsOnUnblock()`.

     - Introduced: 
        Version: 6.2
        PR: redis#7491

     - Touch:
`server.stat_numcommands`, `cmd->latency_histogram`, `server.slowlog`,
and `server.latency_events`
     
     - Harm Level: High
Potentially corrupts the memory data of `cmd->latency_histogram`,
`server.slowlog`, and `server.latency_events`

     - Solution:
Differentiate whether the call to moduleBlockedClientTimedOut() comes
from the module or the main thread.
Since we can't know if RM_UnblockClient() comes from module threads, we
always assume it does and
let `updateStatsOnUnblock()` asynchronously update the unblock status.
     
* When error reply is called in timeout_callback(), ctx is not
thread-safe, eventually lead to race conditions in `afterErrorReply`.

     - Introduced: 
        Version: 6.2
        PR: redis#8217

     - Touch
       `server.stat_total_error_replies`, `server.errors`, 

     - Harm Level: High
       Potentially corrupts the memory data of `server.errors`
   
      - Solution: 
Make the ctx in `timeout_callback()` with `REDISMODULE_CTX_THREAD_SAFE`,
and asynchronously reply errors to the client.

2. Made RM_Reply*() family API thread-safe
Related discussion:
redis#12817 (comment)
Call chain: `RM_Reply*()` -> `_addReplyToBufferOrList()` -> touch
server.current_client

    - Introduced: 
       Version: 7.2.0
       PR: redis#12326

   - Harm Level: None
Since the module fake client won't have the `CLIENT_PUSHING` flag, even
if we touch server.current_client,
     we can still exit after `c->flags & CLIENT_PUSHING`.

   - Solution
      Checking `c->flags & CLIENT_PUSHING` earlier.

3. Made freeClient() thread-safe
    Fix redis#12785

    - Introduced: 
       Version: 4.0
Commit:
redis@3fcf959

    - Harm Level: Moderate
       * Trigger assertion
It happens when the module thread calls freeClient while the io-thread
is in progress,
which just triggers an assertion, and doesn't make any race condiaions.

* Touch `server.current_client`, `server.stat_clients_type_memory`, and
`clientMemUsageBucket->clients`.
It happens between the main thread and the module threads, may cause
data corruption.
1. Error reset `server.current_client` to NULL, but theoretically this
won't happen,
because the module has already reset `server.current_client` to old
value before entering freeClient.
2. corrupts `clientMemUsageBucket->clients` in
updateClientMemUsageAndBucket().
3. Causes server.stat_clients_type_memory memory statistics to be
inaccurate.
    
    - Solution:
* No longer counts memory usage on fake clients, to avoid updating
`server.stat_clients_type_memory` in freeClient.
* No longer resetting `server.current_client` in unlinkClient, because
the fake client won't be evicted or disconnected in the mid of the
process.
* Judgment assertion `io_threads_op == IO_THREADS_OP_IDLE` only if c is
not a fake client.

4. Fixed free client args without GIL
Related discussion:
redis#12817 (comment)
When freeing retained strings in the module thread (refcount decr), or
using them in some way (refcount incr), we should do so while holding
the GIL,
otherwise, they might be simultaneously freed while the main thread is
processing the unblock client state.

    - Introduced: 
       Version: 6.2.0
       PR: redis#8141

   - Harm Level: Low
     Trigger assertion or double free or memory leak. 

   - Solution:
Documenting that module API users need to ensure any access to these
retained strings is done with the GIL locked

5. Fix adding fake client to server.clients_pending_write
    It will incorrectly log the memory usage for the fake client.
Related discussion:
redis#12817 (comment)

    - Introduced: 
       Version: 4.0
Commit:
redis@9b01b64

    - Harm Level: None
      Only result in NOP

    - Solution:
       * Don't add fake client into server.clients_pending_write
* Add c->conn assertion for updateClientMemUsageAndBucket() and
updateClientMemoryUsage() to avoid same
         issue in the future.
So now it will be the responsibility of the caller of both of them to
avoid passing in fake client.

6. Fix calling RM_BlockedClientMeasureTimeStart() and
RM_BlockedClientMeasureTimeEnd() without GIL
    - Introduced: 
       Version: 6.2
       PR: redis#7491

   - Harm Level: Low
Causes inaccuracies in command latency histogram and slow logs, but does
not corrupt memory.

   - Solution:
Module API users, if know that non-thread-safe APIs will be used in
multi-threading, need to take responsibility for protecting them with
their own locks instead of the GIL, as using the GIL is too expensive.

### Other issue
1. RM_Yield is not thread-safe, fixed via redis#12905.

### Summarize
1. Fix thread-safe issues for `RM_UnblockClient()`, `freeClient()` and
`RM_Yield`, potentially preventing memory corruption, data disorder, or
assertion.
2. Updated docs and module test to clarify module API users'
responsibility for locking non-thread-safe APIs in multi-threading, such
as RM_BlockedClientMeasureTimeStart/End(), RM_FreeString(),
RM_RetainString(), and RM_HoldString().

### About backpot to 7.2
1. The implement of (1) is not too satisfying, would like to get more
eyes.
2. (2), (3) can be safely for backport
3. (4), (6) just modifying the module tests and updating the
documentation, no need for a backpot.
4. (5) is harmless, no need for a backpot.

---------

Co-authored-by: Oran Agra <oran@redislabs.com>
@enjoy-binbin
Copy link
Contributor

We have encountered such a problem here. The user used such lua code:

return redis.error_reply(string.format('last:%s, now:%s....)

We can see last:%s, which will format the parameters. In errorstats, it will be used as an error code, causing more and more data saved in info Errorstats, and then when using INFO, the server will block since there are too many errorstats.

#Errorstats
errorstat_ERR:count=27
errorstat_last_1709544349,:count=1
errorstat_last_1709544354,:count=1
errorstat_last_1709544359,:count=1
errorstat_last_1709544360,:count=1
errorstat_last_1709544371,:count=1
errorstat_last_1709544385,:count=1
errorstat_last_1709544390,:count=1
errorstat_last_1709544395,:count=1
errorstat_last_1709544396,:count=1
errorstat_last_1709544408,:count=1
errorstat_last_1709544413,:count=2
errorstat_last_1709544414,:count=2
errorstat_last_1709544432,:count=1
errorstat_last_1709544442,:count=1
errorstat_last_1709544450,:count=1

Is there any way we can control this incorrect usage?

@oranagra
Copy link
Member

oranagra commented Mar 4, 2024

i'm not sure there's anything we can do.
the API is clearly documented that the first word is the error code, and i don't think we have any way to distinguish between a misuse and a valid one.
the only thing we can do is maybe put some maximum cap on the extent of damage it does, and disable the error stats mechanism completely if there are more than say 1000 different errors.
i.e. print them to the log file, empty the dict, and set a flag to avoid populating it from now on.

@enjoy-binbin
Copy link
Contributor

the only thing we can do is maybe put some maximum cap on the extent of damage it does, and disable the error stats mechanism completely if there are more than say 1000 different errors.

yean, the only way i thought of is the same. in info errorstats, use raxSize and only part of the data is output to break out of the loop. Note that it may also potentially consume some memory (if it has been accumulated)

Or set a parameter to limit the number of Errorstats we save

@oranagra
Copy link
Member

oranagra commented Mar 4, 2024

if we do that, i'd rather stop collecting them, release the memory and permanently disable error stats. not just filter them at reporting time.

@enjoy-binbin
Copy link
Contributor

yes, i think so too, just.. i can’t describe it very well.

ok, what i think is similar to only saving up to 1000 errorstats, and deleting some of them when it exceeds, similar to our earliest repl_scriptcache_fifo, something like that

but just some random ideas, just to throw the discussion out there first

@oranagra
Copy link
Member

oranagra commented Mar 4, 2024

i don't think we'd want to delete some. i think we can detect a misuse and completely disable this feature preventing the damage it can cause, and losing the functionality.
to help users figure out what happened, i'd print the existing values to the log file, add a single entry to replace them that mentioned it's disabled due to misuse, and prevent any further accumulation.

enjoy-binbin added a commit to enjoy-binbin/redis that referenced this pull request Mar 14, 2024
Users who abuse lua error_reply will generate a new error object on each
error call, which can make server.errors get bigger and bigger. This will
cause the server to block when calling INFO (we also return errorstats by
default).

To prevent the damage it can cause, when a misuse is detected, we will
print a warning log and disable the errorstats to avoid adding more new
errors. It can be re-enabled via CONFIG RESETSTAT.

Because server.errors may be very large (it may be better now since we
have the limit), config resetstat may block for a while. So in
resetErrorTableStats, we will try to lazyfree server.errors.

See the related discussion at the end of redis#8217.
oranagra pushed a commit that referenced this pull request Mar 19, 2024
#13141)

Users who abuse lua error_reply will generate a new error object on each
error call, which can make server.errors get bigger and bigger. This
will
cause the server to block when calling INFO (we also return errorstats
by
default).

To prevent the damage it can cause, when a misuse is detected, we will
print a warning log and disable the errorstats to avoid adding more new
errors. It can be re-enabled via CONFIG RESETSTAT.

Because server.errors may be very large (it may be better now since we
have the limit), config resetstat may block for a while. So in
resetErrorTableStats, we will try to lazyfree server.errors.

See the related discussion at the end of #8217.
sundb added a commit to sundb/redis that referenced this pull request May 13, 2024
…edis#12817)

Fix redis#12785 and other race condition issues.
See the following isolated comments.

The following report was obtained using SANITIZER thread.
```sh
make SANITIZER=thread
./runtest-moduleapi --config io-threads 4 --config io-threads-do-reads yes --accurate
```

1. Fixed thread-safe issue in RM_UnblockClient()
Related discussion:
redis#12817 (comment)
* When blocking a client in a module using `RM_BlockClientOnKeys()` or
`RM_BlockClientOnKeysWithFlags()`
with a timeout_callback, calling RM_UnblockClient() in module threads
can lead to race conditions
     in `updateStatsOnUnblock()`.

     - Introduced:
        Version: 6.2
        PR: redis#7491

     - Touch:
`server.stat_numcommands`, `cmd->latency_histogram`, `server.slowlog`,
and `server.latency_events`

     - Harm Level: High
Potentially corrupts the memory data of `cmd->latency_histogram`,
`server.slowlog`, and `server.latency_events`

     - Solution:
Differentiate whether the call to moduleBlockedClientTimedOut() comes
from the module or the main thread.
Since we can't know if RM_UnblockClient() comes from module threads, we
always assume it does and
let `updateStatsOnUnblock()` asynchronously update the unblock status.

* When error reply is called in timeout_callback(), ctx is not
thread-safe, eventually lead to race conditions in `afterErrorReply`.

     - Introduced:
        Version: 6.2
        PR: redis#8217

     - Touch
       `server.stat_total_error_replies`, `server.errors`,

     - Harm Level: High
       Potentially corrupts the memory data of `server.errors`

      - Solution:
Make the ctx in `timeout_callback()` with `REDISMODULE_CTX_THREAD_SAFE`,
and asynchronously reply errors to the client.

2. Made RM_Reply*() family API thread-safe
Related discussion:
redis#12817 (comment)
Call chain: `RM_Reply*()` -> `_addReplyToBufferOrList()` -> touch
server.current_client

    - Introduced:
       Version: 7.2.0
       PR: redis#12326

   - Harm Level: None
Since the module fake client won't have the `CLIENT_PUSHING` flag, even
if we touch server.current_client,
     we can still exit after `c->flags & CLIENT_PUSHING`.

   - Solution
      Checking `c->flags & CLIENT_PUSHING` earlier.

3. Made freeClient() thread-safe
    Fix redis#12785

    - Introduced:
       Version: 4.0
Commit:
redis@3fcf959

    - Harm Level: Moderate
       * Trigger assertion
It happens when the module thread calls freeClient while the io-thread
is in progress,
which just triggers an assertion, and doesn't make any race condiaions.

* Touch `server.current_client`, `server.stat_clients_type_memory`, and
`clientMemUsageBucket->clients`.
It happens between the main thread and the module threads, may cause
data corruption.
1. Error reset `server.current_client` to NULL, but theoretically this
won't happen,
because the module has already reset `server.current_client` to old
value before entering freeClient.
2. corrupts `clientMemUsageBucket->clients` in
updateClientMemUsageAndBucket().
3. Causes server.stat_clients_type_memory memory statistics to be
inaccurate.

    - Solution:
* No longer counts memory usage on fake clients, to avoid updating
`server.stat_clients_type_memory` in freeClient.
* No longer resetting `server.current_client` in unlinkClient, because
the fake client won't be evicted or disconnected in the mid of the
process.
* Judgment assertion `io_threads_op == IO_THREADS_OP_IDLE` only if c is
not a fake client.

4. Fixed free client args without GIL
Related discussion:
redis#12817 (comment)
When freeing retained strings in the module thread (refcount decr), or
using them in some way (refcount incr), we should do so while holding
the GIL,
otherwise, they might be simultaneously freed while the main thread is
processing the unblock client state.

    - Introduced:
       Version: 6.2.0
       PR: redis#8141

   - Harm Level: Low
     Trigger assertion or double free or memory leak.

   - Solution:
Documenting that module API users need to ensure any access to these
retained strings is done with the GIL locked

5. Fix adding fake client to server.clients_pending_write
    It will incorrectly log the memory usage for the fake client.
Related discussion:
redis#12817 (comment)

    - Introduced:
       Version: 4.0
Commit:
redis@9b01b64

    - Harm Level: None
      Only result in NOP

    - Solution:
       * Don't add fake client into server.clients_pending_write
* Add c->conn assertion for updateClientMemUsageAndBucket() and
updateClientMemoryUsage() to avoid same
         issue in the future.
So now it will be the responsibility of the caller of both of them to
avoid passing in fake client.

6. Fix calling RM_BlockedClientMeasureTimeStart() and
RM_BlockedClientMeasureTimeEnd() without GIL
    - Introduced:
       Version: 6.2
       PR: redis#7491

   - Harm Level: Low
Causes inaccuracies in command latency histogram and slow logs, but does
not corrupt memory.

   - Solution:
Module API users, if know that non-thread-safe APIs will be used in
multi-threading, need to take responsibility for protecting them with
their own locks instead of the GIL, as using the GIL is too expensive.

1. RM_Yield is not thread-safe, fixed via redis#12905.

1. Fix thread-safe issues for `RM_UnblockClient()`, `freeClient()` and
`RM_Yield`, potentially preventing memory corruption, data disorder, or
assertion.
2. Updated docs and module test to clarify module API users'
responsibility for locking non-thread-safe APIs in multi-threading, such
as RM_BlockedClientMeasureTimeStart/End(), RM_FreeString(),
RM_RetainString(), and RM_HoldString().

1. The implement of (1) is not too satisfying, would like to get more
eyes.
2. (2), (3) can be safely for backport
3. (4), (6) just modifying the module tests and updating the
documentation, no need for a backpot.
4. (5) is harmless, no need for a backpot.

---------

Co-authored-by: Oran Agra <oran@redislabs.com>
YaacovHazan pushed a commit to YaacovHazan/redis that referenced this pull request Apr 22, 2025
…edis#12817)

Fix redis#12785 and other race condition issues.
See the following isolated comments.

The following report was obtained using SANITIZER thread.
```sh
make SANITIZER=thread
./runtest-moduleapi --config io-threads 4 --config io-threads-do-reads yes --accurate
```

1. Fixed thread-safe issue in RM_UnblockClient()
Related discussion:
redis#12817 (comment)
* When blocking a client in a module using `RM_BlockClientOnKeys()` or
`RM_BlockClientOnKeysWithFlags()`
with a timeout_callback, calling RM_UnblockClient() in module threads
can lead to race conditions
     in `updateStatsOnUnblock()`.

     - Introduced: 
        Version: 6.2
        PR: redis#7491

     - Touch:
`server.stat_numcommands`, `cmd->latency_histogram`, `server.slowlog`,
and `server.latency_events`
     
     - Harm Level: High
Potentially corrupts the memory data of `cmd->latency_histogram`,
`server.slowlog`, and `server.latency_events`

     - Solution:
Differentiate whether the call to moduleBlockedClientTimedOut() comes
from the module or the main thread.
Since we can't know if RM_UnblockClient() comes from module threads, we
always assume it does and
let `updateStatsOnUnblock()` asynchronously update the unblock status.
     
* When error reply is called in timeout_callback(), ctx is not
thread-safe, eventually lead to race conditions in `afterErrorReply`.

     - Introduced: 
        Version: 6.2
        PR: redis#8217

     - Touch
       `server.stat_total_error_replies`, `server.errors`, 

     - Harm Level: High
       Potentially corrupts the memory data of `server.errors`
   
      - Solution: 
Make the ctx in `timeout_callback()` with `REDISMODULE_CTX_THREAD_SAFE`,
and asynchronously reply errors to the client.

2. Made RM_Reply*() family API thread-safe
Related discussion:
redis#12817 (comment)
Call chain: `RM_Reply*()` -> `_addReplyToBufferOrList()` -> touch
server.current_client

    - Introduced: 
       Version: 7.2.0
       PR: redis#12326

   - Harm Level: None
Since the module fake client won't have the `CLIENT_PUSHING` flag, even
if we touch server.current_client,
     we can still exit after `c->flags & CLIENT_PUSHING`.

   - Solution
      Checking `c->flags & CLIENT_PUSHING` earlier.

3. Made freeClient() thread-safe
    Fix redis#12785

    - Introduced: 
       Version: 4.0
Commit:
redis@3fcf959

    - Harm Level: Moderate
       * Trigger assertion
It happens when the module thread calls freeClient while the io-thread
is in progress,
which just triggers an assertion, and doesn't make any race condiaions.

* Touch `server.current_client`, `server.stat_clients_type_memory`, and
`clientMemUsageBucket->clients`.
It happens between the main thread and the module threads, may cause
data corruption.
1. Error reset `server.current_client` to NULL, but theoretically this
won't happen,
because the module has already reset `server.current_client` to old
value before entering freeClient.
2. corrupts `clientMemUsageBucket->clients` in
updateClientMemUsageAndBucket().
3. Causes server.stat_clients_type_memory memory statistics to be
inaccurate.
    
    - Solution:
* No longer counts memory usage on fake clients, to avoid updating
`server.stat_clients_type_memory` in freeClient.
* No longer resetting `server.current_client` in unlinkClient, because
the fake client won't be evicted or disconnected in the mid of the
process.
* Judgment assertion `io_threads_op == IO_THREADS_OP_IDLE` only if c is
not a fake client.

4. Fixed free client args without GIL
Related discussion:
redis#12817 (comment)
When freeing retained strings in the module thread (refcount decr), or
using them in some way (refcount incr), we should do so while holding
the GIL,
otherwise, they might be simultaneously freed while the main thread is
processing the unblock client state.

    - Introduced: 
       Version: 6.2.0
       PR: redis#8141

   - Harm Level: Low
     Trigger assertion or double free or memory leak. 

   - Solution:
Documenting that module API users need to ensure any access to these
retained strings is done with the GIL locked

5. Fix adding fake client to server.clients_pending_write
    It will incorrectly log the memory usage for the fake client.
Related discussion:
redis#12817 (comment)

    - Introduced: 
       Version: 4.0
Commit:
redis@9b01b64

    - Harm Level: None
      Only result in NOP

    - Solution:
       * Don't add fake client into server.clients_pending_write
* Add c->conn assertion for updateClientMemUsageAndBucket() and
updateClientMemoryUsage() to avoid same
         issue in the future.
So now it will be the responsibility of the caller of both of them to
avoid passing in fake client.

6. Fix calling RM_BlockedClientMeasureTimeStart() and
RM_BlockedClientMeasureTimeEnd() without GIL
    - Introduced: 
       Version: 6.2
       PR: redis#7491

   - Harm Level: Low
Causes inaccuracies in command latency histogram and slow logs, but does
not corrupt memory.

   - Solution:
Module API users, if know that non-thread-safe APIs will be used in
multi-threading, need to take responsibility for protecting them with
their own locks instead of the GIL, as using the GIL is too expensive.

### Other issue
1. RM_Yield is not thread-safe, fixed via redis#12905.

### Summarize
1. Fix thread-safe issues for `RM_UnblockClient()`, `freeClient()` and
`RM_Yield`, potentially preventing memory corruption, data disorder, or
assertion.
2. Updated docs and module test to clarify module API users'
responsibility for locking non-thread-safe APIs in multi-threading, such
as RM_BlockedClientMeasureTimeStart/End(), RM_FreeString(),
RM_RetainString(), and RM_HoldString().

### About backpot to 7.2
1. The implement of (1) is not too satisfying, would like to get more
eyes.
2. (2), (3) can be safely for backport
3. (4), (6) just modifying the module tests and updating the
documentation, no need for a backpot.
4. (5) is harmless, no need for a backpot.

---------

Co-authored-by: Oran Agra <oran@redislabs.com>
funny-dog pushed a commit to funny-dog/redis that referenced this pull request Sep 17, 2025
…edis#12817)

Fix redis#12785 and other race condition issues.
See the following isolated comments.

The following report was obtained using SANITIZER thread.
```sh
make SANITIZER=thread
./runtest-moduleapi --config io-threads 4 --config io-threads-do-reads yes --accurate
```

1. Fixed thread-safe issue in RM_UnblockClient()
Related discussion:
redis#12817 (comment)
* When blocking a client in a module using `RM_BlockClientOnKeys()` or
`RM_BlockClientOnKeysWithFlags()`
with a timeout_callback, calling RM_UnblockClient() in module threads
can lead to race conditions
     in `updateStatsOnUnblock()`.

     - Introduced: 
        Version: 6.2
        PR: redis#7491

     - Touch:
`server.stat_numcommands`, `cmd->latency_histogram`, `server.slowlog`,
and `server.latency_events`
     
     - Harm Level: High
Potentially corrupts the memory data of `cmd->latency_histogram`,
`server.slowlog`, and `server.latency_events`

     - Solution:
Differentiate whether the call to moduleBlockedClientTimedOut() comes
from the module or the main thread.
Since we can't know if RM_UnblockClient() comes from module threads, we
always assume it does and
let `updateStatsOnUnblock()` asynchronously update the unblock status.
     
* When error reply is called in timeout_callback(), ctx is not
thread-safe, eventually lead to race conditions in `afterErrorReply`.

     - Introduced: 
        Version: 6.2
        PR: redis#8217

     - Touch
       `server.stat_total_error_replies`, `server.errors`, 

     - Harm Level: High
       Potentially corrupts the memory data of `server.errors`
   
      - Solution: 
Make the ctx in `timeout_callback()` with `REDISMODULE_CTX_THREAD_SAFE`,
and asynchronously reply errors to the client.

2. Made RM_Reply*() family API thread-safe
Related discussion:
redis#12817 (comment)
Call chain: `RM_Reply*()` -> `_addReplyToBufferOrList()` -> touch
server.current_client

    - Introduced: 
       Version: 7.2.0
       PR: redis#12326

   - Harm Level: None
Since the module fake client won't have the `CLIENT_PUSHING` flag, even
if we touch server.current_client,
     we can still exit after `c->flags & CLIENT_PUSHING`.

   - Solution
      Checking `c->flags & CLIENT_PUSHING` earlier.

3. Made freeClient() thread-safe
    Fix redis#12785

    - Introduced: 
       Version: 4.0
Commit:
redis@fa1b900

    - Harm Level: Moderate
       * Trigger assertion
It happens when the module thread calls freeClient while the io-thread
is in progress,
which just triggers an assertion, and doesn't make any race condiaions.

* Touch `server.current_client`, `server.stat_clients_type_memory`, and
`clientMemUsageBucket->clients`.
It happens between the main thread and the module threads, may cause
data corruption.
1. Error reset `server.current_client` to NULL, but theoretically this
won't happen,
because the module has already reset `server.current_client` to old
value before entering freeClient.
2. corrupts `clientMemUsageBucket->clients` in
updateClientMemUsageAndBucket().
3. Causes server.stat_clients_type_memory memory statistics to be
inaccurate.
    
    - Solution:
* No longer counts memory usage on fake clients, to avoid updating
`server.stat_clients_type_memory` in freeClient.
* No longer resetting `server.current_client` in unlinkClient, because
the fake client won't be evicted or disconnected in the mid of the
process.
* Judgment assertion `io_threads_op == IO_THREADS_OP_IDLE` only if c is
not a fake client.

4. Fixed free client args without GIL
Related discussion:
redis#12817 (comment)
When freeing retained strings in the module thread (refcount decr), or
using them in some way (refcount incr), we should do so while holding
the GIL,
otherwise, they might be simultaneously freed while the main thread is
processing the unblock client state.

    - Introduced: 
       Version: 6.2.0
       PR: redis#8141

   - Harm Level: Low
     Trigger assertion or double free or memory leak. 

   - Solution:
Documenting that module API users need to ensure any access to these
retained strings is done with the GIL locked

5. Fix adding fake client to server.clients_pending_write
    It will incorrectly log the memory usage for the fake client.
Related discussion:
redis#12817 (comment)

    - Introduced: 
       Version: 4.0
Commit:
redis@e96bac5

    - Harm Level: None
      Only result in NOP

    - Solution:
       * Don't add fake client into server.clients_pending_write
* Add c->conn assertion for updateClientMemUsageAndBucket() and
updateClientMemoryUsage() to avoid same
         issue in the future.
So now it will be the responsibility of the caller of both of them to
avoid passing in fake client.

6. Fix calling RM_BlockedClientMeasureTimeStart() and
RM_BlockedClientMeasureTimeEnd() without GIL
    - Introduced: 
       Version: 6.2
       PR: redis#7491

   - Harm Level: Low
Causes inaccuracies in command latency histogram and slow logs, but does
not corrupt memory.

   - Solution:
Module API users, if know that non-thread-safe APIs will be used in
multi-threading, need to take responsibility for protecting them with
their own locks instead of the GIL, as using the GIL is too expensive.

### Other issue
1. RM_Yield is not thread-safe, fixed via redis#12905.

### Summarize
1. Fix thread-safe issues for `RM_UnblockClient()`, `freeClient()` and
`RM_Yield`, potentially preventing memory corruption, data disorder, or
assertion.
2. Updated docs and module test to clarify module API users'
responsibility for locking non-thread-safe APIs in multi-threading, such
as RM_BlockedClientMeasureTimeStart/End(), RM_FreeString(),
RM_RetainString(), and RM_HoldString().

### About backpot to 7.2
1. The implement of (1) is not too satisfying, would like to get more
eyes.
2. (2), (3) can be safely for backport
3. (4), (6) just modifying the module tests and updating the
documentation, no need for a backpot.
4. (5) is harmless, no need for a backpot.

---------

Co-authored-by: Oran Agra <oran@redislabs.com>
funny-dog pushed a commit to funny-dog/redis that referenced this pull request Sep 17, 2025
redis#13141)

Users who abuse lua error_reply will generate a new error object on each
error call, which can make server.errors get bigger and bigger. This
will
cause the server to block when calling INFO (we also return errorstats
by
default).

To prevent the damage it can cause, when a misuse is detected, we will
print a warning log and disable the errorstats to avoid adding more new
errors. It can be re-enabled via CONFIG RESETSTAT.

Because server.errors may be very large (it may be better now since we
have the limit), config resetstat may block for a while. So in
resetErrorTableStats, we will try to lazyfree server.errors.

See the related discussion at the end of redis#8217.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approval-needed Waiting for core team approval to be merged release-notes indication that this issue needs to be mentioned in the release notes state:major-decision Requires core team consensus state:to-be-merged The PR should be merged soon, even if not yet ready, this is used so that it won't be forgotten

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants