Fix rate-limit induced hanging at test completion by tschneiter-figma · Pull Request #340 · redis/memtier_benchmark

tschneiter-figma · 2026-02-18T21:09:04Z

Fix Rate-Limited Benchmark Hang on Completion

Problem

When running benchmarks with --rate-limiting, the program hangs indefinitely after reaching 100% completion. This is particularly prevalent in cluster mode with TLS connections. Ctrl+C also fails to terminate the hung process.

Root Cause

Rate-limited mode uses persistent timers (EV_PERSIST) to control request pacing. When a benchmark completes:

finished() returns true
The code called event_del() on the timer and bufferevent_disable() on connections
However, event_del() only removes the event from pending — it does not free the resource
The event loop stays alive waiting on these non-freed events
In cluster mode with multiple shard connections, the problem is compounded because each connection must be fully cleaned up for the event loop to exit

Solution

Added disconnect_all() method to properly disconnect all shard connections when a client finishes
Added all_connections_idle() to check that all connections have no pending responses (required for cluster mode)
When the timer fires and detects finished() && all_connections_idle(), call disconnect_all(), which properly frees all bufferevents and timers via:
- bufferevent_free()
- event_free()

Testing

Added regression test test_rate_limited_completion_no_hang that runs multiple iterations with multiple threads/connections to catch race conditions
Manually verified the fix works with TLS cluster mode

Files Changed

shard_connection.cpp
Cleanup logic in handle_timer_event() and fill_pipeline()
client.cpp
Added disconnect_all() and all_connections_idle() implementations
client.h
Added declarations
connections_manager.h
Interface additions
tests_oss_simple_flow.py
Added regression test

Note

Medium Risk
Touches connection lifecycle and event-loop shutdown behavior across clustered shards, so mistakes could cause premature disconnects or missed responses under load. Changes are mitigated by a targeted non-hanging regression test and dedicated CI coverage in the previously problematic cluster+TLS setup.

Overview
Fixes a hang at the end of rate-limited runs by changing completion/cleanup logic to wait for all shard connections to become idle and then forcibly tear down every connection (new all_connections_idle() + disconnect_all() added to the connections_manager/client API and invoked from shard_connection timer/pipeline paths).

Adds a regression test test_rate_limited_completion_no_hang and extends the test harness/CI to run it in an OSS Cluster TLS configuration with high shard count (plus new RLTest tuning knobs like --cluster_node_timeout, configurable log level, and longer per-test timeout).

^{Written by Cursor Bugbot for commit 716cca4. This will update automatically on new commits. Configure here.}

jit-ci · 2026-02-18T21:09:15Z

Hi, I’m Jit, a friendly security platform designed to help developers build secure applications from day zero with an MVS (Minimal viable security) mindset.

In case there are security findings, they will be communicated to you as a comment inside the PR.

Hope you’ll enjoy using Jit.

Questions? Comments? Want to learn more? Get in touch with us.

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

shard_connection.cpp

client.cpp

fcostaoliveira · 2026-02-19T13:08:08Z

@tschneiter-figma I've confirmed on master with your test we get the expected failure:

VERBOSE=1 OSS_CLUSTER=1 OSS_STANDALONE=0 TLS=1 SHARDS=99 TEST=tests_oss_simple_flow.py:test_rate_limited_completion_no_hang ./tests/run_tests.sh 
(...)
	❌  (FAIL):	False == True	tests_oss_simple_flow.py:770 [Benchmark hung on iteration 1 (timeout 25s)]

and after the fix we're good to go:

VERBOSE=1 OSS_CLUSTER=1 OSS_STANDALONE=0 TLS=1 SHARDS=99 TEST=tests_oss_simple_flow.py:test_rate_limited_completion_no_hang ./tests/run_tests.sh 
(...)
tests_oss_simple_flow:test_rate_limited_completion_no_hang:
	[PASS]

I'm adding this 99 shards scenario to CI and if all passes we're good to merge.

fcostaoliveira

Thank you Tristan!

* Fix rate-limit induced hanging at test completion * include a null check for conns in all_connections_idle * Remove crufty ctx reference * Extend CI with higher shard count scenario. * Extra logging on 99 shards scenario * Verbose was already defined on CI. Using 49 shards to expedite CI * Using RLTEST_DEBUG to avoid overriding old behaviour * Shard count 99 in rate-limiting test --------- Co-authored-by: fcostaoliveira <filipe@redis.com>

* Add clang-format for code style enforcement (#336) Cherry-picked and re-applied clang-format configuration, CI workflow, Makefile targets, and DEVELOPMENT.md docs from master. Source files reformatted against the 2.2.x branch codebase. * Concurrent ubuntu test jobs for faster CI (#337) * Concurrent ubuntu test jobs for faster CI * Prevent duplicate workflow runs on push+PR by filtering branches * Fixed coverage workflow * Concurrent ASAN, TSAN, and UBSAN test jobs using matrix strategy * Updated org from redislabs to redis (#339) * Fix rate-limit induced hanging at test completion (#340) * Fix rate-limit induced hanging at test completion * include a null check for conns in all_connections_idle * Remove crufty ctx reference * Extend CI with higher shard count scenario. * Extra logging on 99 shards scenario * Verbose was already defined on CI. Using 49 shards to expedite CI * Using RLTEST_DEBUG to avoid overriding old behaviour * Shard count 99 in rate-limiting test --------- Co-authored-by: fcostaoliveira <filipe@redis.com> * configure: Respect user-supplied CXXFLAGS (#342) * Add AGENTS.md and CLAUDE.md for AI assistant guidelines (#343) Add documentation to help AI assistants work effectively with the memtier_benchmark codebase, following the https://agents.md/ conventions. AGENTS.md includes: - Project overview and repository structure - Build system and commands (autotools) - Code style (clang-format) - Testing with RLTest (standalone, cluster, TLS, sanitizers) - Key technical details - Common development tasks - Debugging guide (GDB, crash handler, core dumps, sanitizers) - License header requirements CLAUDE.md points to AGENTS.md for shared guidelines. References: - https://agents.md/ - Standard for AI agent documentation - https://docs.anthropic.com/en/docs/agents - Anthropic agent guidelines * Use latest rltest and set cluster-start-timeout to accomodate large shard count (#345) * CI: trigger workflows on semver release branches Add branch pattern '[0-9]+.[0-9]+' to push/pull_request triggers for ci, code-style, asan, tsan, and ubsan workflows so CI runs on PRs targeting release branches like 2.2. * Fix hang when using --reconnect-interval with --rate-limiting (#348) The rate-limiting timer was only created on the first connection (when get_reqs_processed() == 0). After a reconnect triggered by --reconnect-interval, disconnect() properly freed the timer, but handle_event() never recreated it because requests had already been processed. This left m_request_per_cur_interval permanently at 0, causing fill_pipeline() to return immediately on every call. Move timer creation outside the first-connection guard so it is recreated on every successful connect/reconnect when m_event_timer is NULL. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * Bumping version to 2.2.2 --------- Co-authored-by: Tristan Schneiter <tschneiter@figma.com> Co-authored-by: LINKIWI <LINKIWI@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

Fix rate-limit induced hanging at test completion

6818ffe

cursor bot reviewed Feb 18, 2026

View reviewed changes

shard_connection.cpp Show resolved Hide resolved

client.cpp Show resolved Hide resolved

tschneiter-figma and others added 4 commits February 18, 2026 13:26

include a null check for conns in all_connections_idle

da8c1ac

Remove crufty ctx reference

5076b4a

Merge remote-tracking branch 'origin/master' into fix/rate-limit-hang

71e8249

Extend CI with higher shard count scenario.

c020cd6

fcostaoliveira requested review from fcostaoliveira February 19, 2026 12:56

fcostaoliveira added 3 commits February 19, 2026 14:33

Extra logging on 99 shards scenario

716cca4

Verbose was already defined on CI. Using 49 shards to expedite CI

1643cdf

Using RLTEST_DEBUG to avoid overriding old behaviour

9bf3296

fcostaoliveira mentioned this pull request Feb 19, 2026

Force stopping all threads and libevent timers when test completes or Ctrl+C is pressed #335

Closed

Shard count 99 in rate-limiting test

c0b99ad

fcostaoliveira added the bug label Feb 19, 2026

fcostaoliveira self-assigned this Feb 19, 2026

fcostaoliveira approved these changes Feb 19, 2026

View reviewed changes

fcostaoliveira merged commit 754ee6b into redis:master Feb 19, 2026
40 checks passed

fcostaoliveira mentioned this pull request Feb 26, 2026

Prepare for 2.2.2 version. #349

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix rate-limit induced hanging at test completion#340

Fix rate-limit induced hanging at test completion#340
fcostaoliveira merged 9 commits intoredis:masterfrom
tschneiter-figma:fix/rate-limit-hang

tschneiter-figma commented Feb 18, 2026 •

edited by cursor bot

Loading

Uh oh!

jit-ci bot commented Feb 18, 2026

Uh oh!

cursor bot left a comment

Uh oh!

Uh oh!

Uh oh!

fcostaoliveira commented Feb 19, 2026

Uh oh!

fcostaoliveira left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tschneiter-figma commented Feb 18, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix Rate-Limited Benchmark Hang on Completion

Problem

Root Cause

Solution

Testing

Files Changed

Uh oh!

jit-ci bot commented Feb 18, 2026

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

fcostaoliveira commented Feb 19, 2026

Uh oh!

fcostaoliveira left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tschneiter-figma commented Feb 18, 2026 •

edited by cursor bot

Loading