reverse_tunnel: add per connection timer by ivpr · Pull Request #43724 · envoyproxy/envoy

ivpr · 2026-03-03T11:15:53Z

Commit Message: use per connection rping timer with jitter similar to HTTP2 keepalives
Additional Description: we have observed higher p90 latencies for requests from upstream to downstream when there are high number(>10k) of active reverse connections due to the for loop for rpings which initiates rpings for all connections without yielding for the requests. Per connection rping timers help with request events not starving
Risk Level: Low
Testing: Unit tests
Docs Changes: N/A
Release Notes: N/A
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional API Considerations:]

Signed-off-by: Prasad I V <prasad.iv@databricks.com>

basundhara-c · 2026-03-03T23:06:30Z

source/extensions/bootstrap/reverse_tunnel/upstream_socket_interface/upstream_socket_manager.cc

+  }
+
+  // Create per-connection send timer with initial jitter spread across the full interval
+  // to distribute pings from burst-added connections across time.


Do we need the initial jitter to be the full interval? Could we set it to 15% like done by the HTTP/2 Codec, and elsewhere in UpstreamSocketManager?

basundhara-c · 2026-03-03T23:22:42Z

source/extensions/bootstrap/reverse_tunnel/upstream_socket_interface/upstream_socket_manager.h

  /**
-   * Ping all active reverse connections for health checks.
+   * Send a ping keepalive for a single reverse connection.
+   * Each connection has its own send timer, avoiding O(N) global iteration.


Nit: remove the "avoiding O(N) global iteration" bit

basundhara-c · 2026-03-03T23:22:56Z

source/extensions/bootstrap/reverse_tunnel/upstream_socket_interface/upstream_socket_manager.h

  absl::flat_hash_map<int, Event::TimerPtr> fd_to_timer_map_;

+  // Per-connection send timers that schedule individual ping sends with jitter,
+  // replacing the global O(N) ping_timer_.


Nit: remove the "replacing ..." bit

basundhara-c · 2026-03-03T23:23:15Z

source/extensions/bootstrap/reverse_tunnel/upstream_socket_interface/upstream_socket_manager.h

-  Event::TimerPtr ping_timer_;
  std::chrono::seconds ping_interval_{0};

+  // O(1) counter for total active FDs per node, replacing the O(N) scan in hasAnySocketsForNode().


Nit: Same as above, remove the "replacing the O(N) scan in hasAnySocketsForNode()." bit

Signed-off-by: Prasad I V <prasad.iv@databricks.com>

agrawroh · 2026-03-04T19:54:54Z

/gemini review

gemini-code-assist

Code Review

This pull request refactors the reverse tunnel ping mechanism from a single global timer to per-connection timers. This change aims to reduce p90 latencies when there are many active connections by avoiding a long-running loop in the event dispatcher. The new implementation introduces jitter to the timers, which is a good practice to prevent thundering herd problems. The changes are logical and well-tested. I've added a couple of suggestions to further improve performance by optimizing a data lookup and to enhance code clarity.

gemini-code-assist · 2026-03-04T19:59:38Z

source/extensions/bootstrap/reverse_tunnel/upstream_socket_interface/upstream_socket_manager.cc

+  if (count_it != node_to_active_fd_count_.end()) {
+    if (count_it->second > 0) {
+      count_it->second--;
+    }
+    if (count_it->second == 0) {
+      node_to_active_fd_count_.erase(count_it);
+    }
+  }


This logic can be slightly simplified. Since markSocketDead should only be called for active FDs, we can assume count_it->second is greater than 0. You could combine the decrement and check, and add an ASSERT to enforce this assumption.

if (count_it != node_to_active_fd_count_.end()) { ASSERT(count_it->second > 0); if (--count_it->second == 0) { node_to_active_fd_count_.erase(count_it); } }

gemini-code-assist · 2026-03-04T19:59:38Z

source/extensions/bootstrap/reverse_tunnel/upstream_socket_interface/upstream_socket_manager.cc

+  auto& sockets = accepted_reverse_connections_[node_id];
+  Network::ConnectionSocket* socket_ptr = nullptr;
+  for (auto& s : sockets) {
+    if (s->ioHandle().fdDoNotUse() == fd) {
+      socket_ptr = s.get();
      break;
    }
+  }


While this PR correctly moves to a per-connection timer, this loop to find the socket by fd introduces a linear scan over all idle connections for a given node. This could be a performance concern if a node has many idle connections. A similar linear scan also exists in markSocketDead.

To optimize this, you could introduce a map to store an iterator to the socket in the accepted_reverse_connections_ list, allowing for O(1) lookups. For example:

In upstream_socket_manager.h, add a new member:
absl::flat_hash_map<int, std::list<Network::ConnectionSocketPtr>::iterator> fd_to_socket_it_map_;

In addConnectionSocket, populate this map when a new socket is added:
accepted_reverse_connections_[node_id].push_back(std::move(socket)); fd_to_socket_it_map_[fd] = std::prev(accepted_reverse_connections_[node_id].end());

In sendPingForConnection (and markSocketDead), use this map for a direct O(1) lookup and access/erasure.

Remember to erase from fd_to_socket_it_map_ in getConnectionSocket and markSocketDead when a socket is removed from the idle list.

Signed-off-by: Prasad I V <prasad.iv@databricks.com>

Commit Message: use per connection rping timer with jitter similar to HTTP2 keepalives Additional Description: we have observed higher p90 latencies for requests from upstream to downstream when there are high number(>10k) of active reverse connections due to the for loop for rpings which initiates rpings for all connections without yielding for the requests. Per connection rping timers help with request events not starving Risk Level: Low Testing: Unit tests Docs Changes: N/A Release Notes: N/A Platform Specific Features: [Optional Runtime guard:] [Optional Fixes #Issue] [Optional Fixes commit #PR or SHA] [Optional Deprecated:] [Optional [API Considerations](https://github.com/envoyproxy/envoy/blob/main/api/review_checklist.md):] --------- Signed-off-by: Prasad I V <prasad.iv@databricks.com> Co-authored-by: Prasad I V <prasad.iv@databricks.com> Signed-off-by: bjmask <11672696+bjmask@users.noreply.github.com>

Commit Message: use per connection rping timer with jitter similar to HTTP2 keepalives Additional Description: we have observed higher p90 latencies for requests from upstream to downstream when there are high number(>10k) of active reverse connections due to the for loop for rpings which initiates rpings for all connections without yielding for the requests. Per connection rping timers help with request events not starving Risk Level: Low Testing: Unit tests Docs Changes: N/A Release Notes: N/A Platform Specific Features: [Optional Runtime guard:] [Optional Fixes #Issue] [Optional Fixes commit #PR or SHA] [Optional Deprecated:] [Optional [API Considerations](https://github.com/envoyproxy/envoy/blob/main/api/review_checklist.md):] --------- Signed-off-by: Prasad I V <prasad.iv@databricks.com> Co-authored-by: Prasad I V <prasad.iv@databricks.com>

reverse_tunnel: add per connection timer

54afa91

Signed-off-by: Prasad I V <prasad.iv@databricks.com>

ivpr requested review from agrawroh, botengyao and yanavlasov as code owners March 3, 2026 11:15

reverse_tunnel: format fixes

434db73

Signed-off-by: Prasad I V <prasad.iv@databricks.com>

agrawroh self-assigned this Mar 3, 2026

basundhara-c approved these changes Mar 3, 2026

View reviewed changes

reverse_tunnel: address review comments

a27984a

Signed-off-by: Prasad I V <prasad.iv@databricks.com>

gemini-code-assist bot reviewed Mar 4, 2026

View reviewed changes

reverse_tunnel: address review comments

4b0840e

Signed-off-by: Prasad I V <prasad.iv@databricks.com>

agrawroh approved these changes Mar 5, 2026

View reviewed changes

agrawroh merged commit ba75e8b into envoyproxy:main Mar 5, 2026
29 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reverse_tunnel: add per connection timer#43724

reverse_tunnel: add per connection timer#43724
agrawroh merged 4 commits intoenvoyproxy:mainfrom
ivpr:fix-rping-ping-timer

ivpr commented Mar 3, 2026

Uh oh!

basundhara-c Mar 3, 2026

Uh oh!

basundhara-c Mar 3, 2026

Uh oh!

basundhara-c Mar 3, 2026

Uh oh!

basundhara-c Mar 3, 2026

Uh oh!

agrawroh commented Mar 4, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 4, 2026

Uh oh!

gemini-code-assist bot Mar 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ivpr commented Mar 3, 2026

Uh oh!

basundhara-c Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

basundhara-c Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

basundhara-c Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

basundhara-c Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

agrawroh commented Mar 4, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants