reverse_tunnel: add per connection timer#43724
Conversation
Signed-off-by: Prasad I V <prasad.iv@databricks.com>
Signed-off-by: Prasad I V <prasad.iv@databricks.com>
| } | ||
|
|
||
| // Create per-connection send timer with initial jitter spread across the full interval | ||
| // to distribute pings from burst-added connections across time. |
There was a problem hiding this comment.
Do we need the initial jitter to be the full interval? Could we set it to 15% like done by the HTTP/2 Codec, and elsewhere in UpstreamSocketManager?
| /** | ||
| * Ping all active reverse connections for health checks. | ||
| * Send a ping keepalive for a single reverse connection. | ||
| * Each connection has its own send timer, avoiding O(N) global iteration. |
There was a problem hiding this comment.
Nit: remove the "avoiding O(N) global iteration" bit
| absl::flat_hash_map<int, Event::TimerPtr> fd_to_timer_map_; | ||
|
|
||
| // Per-connection send timers that schedule individual ping sends with jitter, | ||
| // replacing the global O(N) ping_timer_. |
There was a problem hiding this comment.
Nit: remove the "replacing ..." bit
| Event::TimerPtr ping_timer_; | ||
| std::chrono::seconds ping_interval_{0}; | ||
|
|
||
| // O(1) counter for total active FDs per node, replacing the O(N) scan in hasAnySocketsForNode(). |
There was a problem hiding this comment.
Nit: Same as above, remove the "replacing the O(N) scan in hasAnySocketsForNode()." bit
Signed-off-by: Prasad I V <prasad.iv@databricks.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request refactors the reverse tunnel ping mechanism from a single global timer to per-connection timers. This change aims to reduce p90 latencies when there are many active connections by avoiding a long-running loop in the event dispatcher. The new implementation introduces jitter to the timers, which is a good practice to prevent thundering herd problems. The changes are logical and well-tested. I've added a couple of suggestions to further improve performance by optimizing a data lookup and to enhance code clarity.
| if (count_it != node_to_active_fd_count_.end()) { | ||
| if (count_it->second > 0) { | ||
| count_it->second--; | ||
| } | ||
| if (count_it->second == 0) { | ||
| node_to_active_fd_count_.erase(count_it); | ||
| } | ||
| } |
There was a problem hiding this comment.
This logic can be slightly simplified. Since markSocketDead should only be called for active FDs, we can assume count_it->second is greater than 0. You could combine the decrement and check, and add an ASSERT to enforce this assumption.
if (count_it != node_to_active_fd_count_.end()) {
ASSERT(count_it->second > 0);
if (--count_it->second == 0) {
node_to_active_fd_count_.erase(count_it);
}
}| auto& sockets = accepted_reverse_connections_[node_id]; | ||
| Network::ConnectionSocket* socket_ptr = nullptr; | ||
| for (auto& s : sockets) { | ||
| if (s->ioHandle().fdDoNotUse() == fd) { | ||
| socket_ptr = s.get(); | ||
| break; | ||
| } | ||
| } |
There was a problem hiding this comment.
While this PR correctly moves to a per-connection timer, this loop to find the socket by fd introduces a linear scan over all idle connections for a given node. This could be a performance concern if a node has many idle connections. A similar linear scan also exists in markSocketDead.
To optimize this, you could introduce a map to store an iterator to the socket in the accepted_reverse_connections_ list, allowing for O(1) lookups. For example:
- In
upstream_socket_manager.h, add a new member:absl::flat_hash_map<int, std::list<Network::ConnectionSocketPtr>::iterator> fd_to_socket_it_map_; - In
addConnectionSocket, populate this map when a new socket is added:accepted_reverse_connections_[node_id].push_back(std::move(socket)); fd_to_socket_it_map_[fd] = std::prev(accepted_reverse_connections_[node_id].end());
- In
sendPingForConnection(andmarkSocketDead), use this map for a direct O(1) lookup and access/erasure. - Remember to erase from
fd_to_socket_it_map_ingetConnectionSocketandmarkSocketDeadwhen a socket is removed from the idle list.
Signed-off-by: Prasad I V <prasad.iv@databricks.com>
<!-- !!!ATTENTION!!! If you are fixing *any* crash or *any* potential security issue, *do not* open a pull request in this repo. Please report the issue via emailing envoy-security@googlegroups.com where the issue will be triaged appropriately. Thank you in advance for helping to keep Envoy secure. !!!ATTENTION!!! For an explanation of how to fill out the fields, please see the relevant section in [PULL_REQUESTS.md](https://github.com/envoyproxy/envoy/blob/main/PULL_REQUESTS.md) !!!ATTENTION!!! Please check the [use of generative AI policy](https://github.com/envoyproxy/envoy/blob/main/CONTRIBUTING.md?plain=1#L41). You may use generative AI only if you fully understand the code. You need to disclose this usage in the PR description to ensure transparency. --> Commit Message: use per connection rping timer with jitter similar to HTTP2 keepalives Additional Description: we have observed higher p90 latencies for requests from upstream to downstream when there are high number(>10k) of active reverse connections due to the for loop for rpings which initiates rpings for all connections without yielding for the requests. Per connection rping timers help with request events not starving Risk Level: Low Testing: Unit tests Docs Changes: N/A Release Notes: N/A Platform Specific Features: [Optional Runtime guard:] [Optional Fixes #Issue] [Optional Fixes commit #PR or SHA] [Optional Deprecated:] [Optional [API Considerations](https://github.com/envoyproxy/envoy/blob/main/api/review_checklist.md):] --------- Signed-off-by: Prasad I V <prasad.iv@databricks.com> Co-authored-by: Prasad I V <prasad.iv@databricks.com> Signed-off-by: bjmask <11672696+bjmask@users.noreply.github.com>
<!-- !!!ATTENTION!!! If you are fixing *any* crash or *any* potential security issue, *do not* open a pull request in this repo. Please report the issue via emailing envoy-security@googlegroups.com where the issue will be triaged appropriately. Thank you in advance for helping to keep Envoy secure. !!!ATTENTION!!! For an explanation of how to fill out the fields, please see the relevant section in [PULL_REQUESTS.md](https://github.com/envoyproxy/envoy/blob/main/PULL_REQUESTS.md) !!!ATTENTION!!! Please check the [use of generative AI policy](https://github.com/envoyproxy/envoy/blob/main/CONTRIBUTING.md?plain=1#L41). You may use generative AI only if you fully understand the code. You need to disclose this usage in the PR description to ensure transparency. --> Commit Message: use per connection rping timer with jitter similar to HTTP2 keepalives Additional Description: we have observed higher p90 latencies for requests from upstream to downstream when there are high number(>10k) of active reverse connections due to the for loop for rpings which initiates rpings for all connections without yielding for the requests. Per connection rping timers help with request events not starving Risk Level: Low Testing: Unit tests Docs Changes: N/A Release Notes: N/A Platform Specific Features: [Optional Runtime guard:] [Optional Fixes #Issue] [Optional Fixes commit #PR or SHA] [Optional Deprecated:] [Optional [API Considerations](https://github.com/envoyproxy/envoy/blob/main/api/review_checklist.md):] --------- Signed-off-by: Prasad I V <prasad.iv@databricks.com> Co-authored-by: Prasad I V <prasad.iv@databricks.com>
<!-- !!!ATTENTION!!! If you are fixing *any* crash or *any* potential security issue, *do not* open a pull request in this repo. Please report the issue via emailing envoy-security@googlegroups.com where the issue will be triaged appropriately. Thank you in advance for helping to keep Envoy secure. !!!ATTENTION!!! For an explanation of how to fill out the fields, please see the relevant section in [PULL_REQUESTS.md](https://github.com/envoyproxy/envoy/blob/main/PULL_REQUESTS.md) !!!ATTENTION!!! Please check the [use of generative AI policy](https://github.com/envoyproxy/envoy/blob/main/CONTRIBUTING.md?plain=1#L41). You may use generative AI only if you fully understand the code. You need to disclose this usage in the PR description to ensure transparency. --> Commit Message: use per connection rping timer with jitter similar to HTTP2 keepalives Additional Description: we have observed higher p90 latencies for requests from upstream to downstream when there are high number(>10k) of active reverse connections due to the for loop for rpings which initiates rpings for all connections without yielding for the requests. Per connection rping timers help with request events not starving Risk Level: Low Testing: Unit tests Docs Changes: N/A Release Notes: N/A Platform Specific Features: [Optional Runtime guard:] [Optional Fixes #Issue] [Optional Fixes commit #PR or SHA] [Optional Deprecated:] [Optional [API Considerations](https://github.com/envoyproxy/envoy/blob/main/api/review_checklist.md):] --------- Signed-off-by: Prasad I V <prasad.iv@databricks.com> Co-authored-by: Prasad I V <prasad.iv@databricks.com>
Commit Message: use per connection rping timer with jitter similar to HTTP2 keepalives
Additional Description: we have observed higher p90 latencies for requests from upstream to downstream when there are high number(>10k) of active reverse connections due to the for loop for rpings which initiates rpings for all connections without yielding for the requests. Per connection rping timers help with request events not starving
Risk Level: Low
Testing: Unit tests
Docs Changes: N/A
Release Notes: N/A
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional API Considerations:]