Skip to content

reverse_tunnel: add per connection timer#43724

Merged
agrawroh merged 4 commits intoenvoyproxy:mainfrom
ivpr:fix-rping-ping-timer
Mar 5, 2026
Merged

reverse_tunnel: add per connection timer#43724
agrawroh merged 4 commits intoenvoyproxy:mainfrom
ivpr:fix-rping-ping-timer

Conversation

@ivpr
Copy link
Copy Markdown
Contributor

@ivpr ivpr commented Mar 3, 2026

Commit Message: use per connection rping timer with jitter similar to HTTP2 keepalives
Additional Description: we have observed higher p90 latencies for requests from upstream to downstream when there are high number(>10k) of active reverse connections due to the for loop for rpings which initiates rpings for all connections without yielding for the requests. Per connection rping timers help with request events not starving
Risk Level: Low
Testing: Unit tests
Docs Changes: N/A
Release Notes: N/A
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional API Considerations:]

Signed-off-by: Prasad I V <prasad.iv@databricks.com>
Signed-off-by: Prasad I V <prasad.iv@databricks.com>
@agrawroh agrawroh self-assigned this Mar 3, 2026
}

// Create per-connection send timer with initial jitter spread across the full interval
// to distribute pings from burst-added connections across time.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the initial jitter to be the full interval? Could we set it to 15% like done by the HTTP/2 Codec, and elsewhere in UpstreamSocketManager?

/**
* Ping all active reverse connections for health checks.
* Send a ping keepalive for a single reverse connection.
* Each connection has its own send timer, avoiding O(N) global iteration.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: remove the "avoiding O(N) global iteration" bit

absl::flat_hash_map<int, Event::TimerPtr> fd_to_timer_map_;

// Per-connection send timers that schedule individual ping sends with jitter,
// replacing the global O(N) ping_timer_.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: remove the "replacing ..." bit

Event::TimerPtr ping_timer_;
std::chrono::seconds ping_interval_{0};

// O(1) counter for total active FDs per node, replacing the O(N) scan in hasAnySocketsForNode().
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Same as above, remove the "replacing the O(N) scan in hasAnySocketsForNode()." bit

Signed-off-by: Prasad I V <prasad.iv@databricks.com>
@agrawroh
Copy link
Copy Markdown
Member

agrawroh commented Mar 4, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the reverse tunnel ping mechanism from a single global timer to per-connection timers. This change aims to reduce p90 latencies when there are many active connections by avoiding a long-running loop in the event dispatcher. The new implementation introduces jitter to the timers, which is a good practice to prevent thundering herd problems. The changes are logical and well-tested. I've added a couple of suggestions to further improve performance by optimizing a data lookup and to enhance code clarity.

Comment on lines +281 to +288
if (count_it != node_to_active_fd_count_.end()) {
if (count_it->second > 0) {
count_it->second--;
}
if (count_it->second == 0) {
node_to_active_fd_count_.erase(count_it);
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This logic can be slightly simplified. Since markSocketDead should only be called for active FDs, we can assume count_it->second is greater than 0. You could combine the decrement and check, and add an ASSERT to enforce this assumption.

  if (count_it != node_to_active_fd_count_.end()) {
    ASSERT(count_it->second > 0);
    if (--count_it->second == 0) {
      node_to_active_fd_count_.erase(count_it);
    }
  }

Comment on lines +433 to +440
auto& sockets = accepted_reverse_connections_[node_id];
Network::ConnectionSocket* socket_ptr = nullptr;
for (auto& s : sockets) {
if (s->ioHandle().fdDoNotUse() == fd) {
socket_ptr = s.get();
break;
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While this PR correctly moves to a per-connection timer, this loop to find the socket by fd introduces a linear scan over all idle connections for a given node. This could be a performance concern if a node has many idle connections. A similar linear scan also exists in markSocketDead.

To optimize this, you could introduce a map to store an iterator to the socket in the accepted_reverse_connections_ list, allowing for O(1) lookups. For example:

  1. In upstream_socket_manager.h, add a new member:
    absl::flat_hash_map<int, std::list<Network::ConnectionSocketPtr>::iterator> fd_to_socket_it_map_;
  2. In addConnectionSocket, populate this map when a new socket is added:
    accepted_reverse_connections_[node_id].push_back(std::move(socket));
    fd_to_socket_it_map_[fd] = std::prev(accepted_reverse_connections_[node_id].end());
  3. In sendPingForConnection (and markSocketDead), use this map for a direct O(1) lookup and access/erasure.
  4. Remember to erase from fd_to_socket_it_map_ in getConnectionSocket and markSocketDead when a socket is removed from the idle list.

Signed-off-by: Prasad I V <prasad.iv@databricks.com>
@agrawroh agrawroh merged commit ba75e8b into envoyproxy:main Mar 5, 2026
29 checks passed
bmjask pushed a commit to bmjask/envoy that referenced this pull request Mar 14, 2026
<!--
!!!ATTENTION!!!

If you are fixing *any* crash or *any* potential security issue, *do
not*
open a pull request in this repo. Please report the issue via emailing
envoy-security@googlegroups.com where the issue will be triaged
appropriately.
Thank you in advance for helping to keep Envoy secure.

!!!ATTENTION!!!

For an explanation of how to fill out the fields, please see the
relevant section
in
[PULL_REQUESTS.md](https://github.com/envoyproxy/envoy/blob/main/PULL_REQUESTS.md)

!!!ATTENTION!!!

Please check the [use of generative AI
policy](https://github.com/envoyproxy/envoy/blob/main/CONTRIBUTING.md?plain=1#L41).

You may use generative AI only if you fully understand the code. You
need to disclose
this usage in the PR description to ensure transparency.
-->

Commit Message: use per connection rping timer with jitter similar to
HTTP2 keepalives
Additional Description: we have observed higher p90 latencies for
requests from upstream to downstream when there are high number(>10k) of
active reverse connections due to the for loop for rpings which
initiates rpings for all connections without yielding for the requests.
Per connection rping timers help with request events not starving
Risk Level: Low
Testing: Unit tests
Docs Changes: N/A
Release Notes: N/A
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional [API
Considerations](https://github.com/envoyproxy/envoy/blob/main/api/review_checklist.md):]

---------

Signed-off-by: Prasad I V <prasad.iv@databricks.com>
Co-authored-by: Prasad I V <prasad.iv@databricks.com>
Signed-off-by: bjmask <11672696+bjmask@users.noreply.github.com>
bvandewalle pushed a commit to bvandewalle/envoy that referenced this pull request Mar 17, 2026
<!--
!!!ATTENTION!!!

If you are fixing *any* crash or *any* potential security issue, *do
not*
open a pull request in this repo. Please report the issue via emailing
envoy-security@googlegroups.com where the issue will be triaged
appropriately.
Thank you in advance for helping to keep Envoy secure.

!!!ATTENTION!!!

For an explanation of how to fill out the fields, please see the
relevant section
in
[PULL_REQUESTS.md](https://github.com/envoyproxy/envoy/blob/main/PULL_REQUESTS.md)

!!!ATTENTION!!!

Please check the [use of generative AI
policy](https://github.com/envoyproxy/envoy/blob/main/CONTRIBUTING.md?plain=1#L41).

You may use generative AI only if you fully understand the code. You
need to disclose
this usage in the PR description to ensure transparency.
-->

Commit Message: use per connection rping timer with jitter similar to
HTTP2 keepalives
Additional Description: we have observed higher p90 latencies for
requests from upstream to downstream when there are high number(>10k) of
active reverse connections due to the for loop for rpings which
initiates rpings for all connections without yielding for the requests.
Per connection rping timers help with request events not starving
Risk Level: Low
Testing: Unit tests
Docs Changes: N/A
Release Notes: N/A
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional [API
Considerations](https://github.com/envoyproxy/envoy/blob/main/api/review_checklist.md):]

---------

Signed-off-by: Prasad I V <prasad.iv@databricks.com>
Co-authored-by: Prasad I V <prasad.iv@databricks.com>
fishcakez pushed a commit to fishcakez/envoy that referenced this pull request Mar 25, 2026
<!--
!!!ATTENTION!!!

If you are fixing *any* crash or *any* potential security issue, *do
not*
open a pull request in this repo. Please report the issue via emailing
envoy-security@googlegroups.com where the issue will be triaged
appropriately.
Thank you in advance for helping to keep Envoy secure.

!!!ATTENTION!!!

For an explanation of how to fill out the fields, please see the
relevant section
in
[PULL_REQUESTS.md](https://github.com/envoyproxy/envoy/blob/main/PULL_REQUESTS.md)

!!!ATTENTION!!!

Please check the [use of generative AI
policy](https://github.com/envoyproxy/envoy/blob/main/CONTRIBUTING.md?plain=1#L41).

You may use generative AI only if you fully understand the code. You
need to disclose
this usage in the PR description to ensure transparency.
-->

Commit Message: use per connection rping timer with jitter similar to
HTTP2 keepalives
Additional Description: we have observed higher p90 latencies for
requests from upstream to downstream when there are high number(>10k) of
active reverse connections due to the for loop for rpings which
initiates rpings for all connections without yielding for the requests.
Per connection rping timers help with request events not starving
Risk Level: Low
Testing: Unit tests
Docs Changes: N/A
Release Notes: N/A
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional [API
Considerations](https://github.com/envoyproxy/envoy/blob/main/api/review_checklist.md):]

---------

Signed-off-by: Prasad I V <prasad.iv@databricks.com>
Co-authored-by: Prasad I V <prasad.iv@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants