Skip to content

Keepalive Throttling#23313

Merged
yashykt merged 10 commits intogrpc:masterfrom
yashykt:keepalivethrottling
Aug 14, 2020
Merged

Keepalive Throttling#23313
yashykt merged 10 commits intogrpc:masterfrom
yashykt:keepalivethrottling

Conversation

@yashykt
Copy link
Copy Markdown
Member

@yashykt yashykt commented Jun 25, 2020

Fix #16210

Changes -

  1. Add a key for keepalive_throttling named "grpc.internal.keepalive_throttling" in the absl::Status object when we notify a state change due to a GOAWAY received with "too_many_pings". The value associated with this key is the new keepalive time period that the channel/subchannel should use.
  2. When the channel sees the notification, it updates its keepalive_time_ value and propagates the value to all its subchannels so that all future transports will be created with the newer value. (The updated value is the max of its current value and the new update, so that we can handle the cases of receiving GOAWAYs on two separate transports for the same channel and that of subchannel sharing between separate channels.)

This change is Reviewable

@yashykt yashykt force-pushed the keepalivethrottling branch from ef90643 to e8d247a Compare July 6, 2020 10:10
@yashykt yashykt marked this pull request as ready for review July 6, 2020 10:10
@yashykt yashykt requested a review from markdroth as a code owner July 6, 2020 10:10
@yashykt yashykt added release notes: yes Indicates if PR needs to be in release notes lang/core labels Jul 6, 2020
Copy link
Copy Markdown
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this! I've reviewed everything except for the keepalive tests.

I suggest splitting this PR into two parts. I think that the change for adding the absl::Status parameter to the connectivity state APIs is sufficiently invasive that it should be in its own PR. We can then have a second PR that adds the keepalive functionality.

I think that we should also add the absl::Status argument to LoadBalancingPolicy::ChannelControlHelper::UpdateState(), and we should plumb the value through the LB policies. In effect, the change to add the absl::Status parameter to the connectivity state APIs should be a logical revert of the change I made last year to remove the grpc_error parameter in #18628. You can look at that PR to see what will need to be changed.

Please let me know if you have any questions about any of this. Thanks!

Reviewed 31 of 34 files at r1.
Reviewable status: 31 of 34 files reviewed, 20 unresolved discussions (waiting on @markdroth and @yashykt)


src/core/ext/filters/client_channel/client_channel.cc, line 1101 at r1 (raw file):

      ConnectivityStateChange state_change = PopConnectivityStateChange();
      absl::optional<absl::Cord> keepalive_throttling =
          state_change.status.GetPayload("grpc.internal.keepalive_throttling");

Let's define this key name as a constant somewhere. Maybe in transport.h?


src/core/ext/filters/client_channel/client_channel.cc, line 1103 at r1 (raw file):

          state_change.status.GetPayload("grpc.internal.keepalive_throttling");
      if (keepalive_throttling.has_value()) {
        parent_->chand_->keepalive_time_ =

In addition to setting this, we also need to pass the new value to all existing subchannels in use by this channel.


src/core/ext/filters/client_channel/client_channel.cc, line 1105 at r1 (raw file):

        parent_->chand_->keepalive_time_ =
            std::max(parent_->chand_->keepalive_time_,
                     atoi(std::string(keepalive_throttling.value()).c_str()));

Please use absl::SimpleAtoi().


src/core/ext/filters/client_channel/client_channel.cc, line 1108 at r1 (raw file):

        if (GRPC_TRACE_FLAG_ENABLED(grpc_client_channel_routing_trace)) {
          gpr_log(GPR_INFO,
                  "Throttling keepalive time chand=%p keepalive_time=%d",

Please format this the same way as all of the other trace messages in this file:

"chand=%p: throttling keepalive time to %d"


src/core/ext/filters/client_channel/client_channel.cc, line 1660 at r1 (raw file):

  target_uri_.reset(proxy_name != nullptr ? proxy_name
                                          : gpr_strdup(server_uri));
  keepalive_time_ = grpc_channel_args_find_integer(

We're jumping through a lot of hoops here to remove the keepalive time from channel args, and I don't see why that's necessary. I think all we need here is the following:

channel_args_ = new_args != nullptr
                    ? new_args
                    : grpc_channel_args_copy(args->channel_args);
keepalive_time_ = grpc_channel_args_find_integer(
    channel_args_, GRPC_ARG_KEEPALIVE_TIME_MS,
    {-1 /* default value, unset */, 1, INT_MAX});

src/core/ext/filters/client_channel/subchannel.h, line 229 at r1 (raw file):

  // Creates a subchannel given \a connector and \a args.
  static Subchannel* Create(OrphanablePtr<SubchannelConnector> connector,
                            const grpc_channel_args* args, int keepalive_time);

One down-side of passing the keepalive time as a parameter to the Subchannel ctor is that it changes the ClientChannelFactory API, which seems a bit viral (e.g., it will require a bunch of internal changes). It would be nice to avoid this.

I also observe that, independent of whether this parameter is here or not, we still need a new method on Subchannel to be able to set the keepalive time after a subchannel is created. The reason for this is that, as per gRFC A8: Client-side Keepalive (see the last paragraph of the "Basic Keepalive" section), when we receive a GOAWAY on one subchannel, we should double the keepalive time used "for new connections on that channel". This means that even if an existing subchannel becomes disconnected and then attempts to reconnect, that reconnection should use the new keepalive time. So we need a way to update subchannels after they are created.

If we are going to add a method for that anyway, let's just use that to set the initial keepalive time as well (i.e., call that method on the subchannel immediately after creating the subchannel). That way, we won't need this new parameter in the ctor, nor will we need to make changes to the ClientChannelFactory API.


src/core/ext/filters/client_channel/subchannel.h, line 311 at r1 (raw file):

    // Notifies all watchers in the list about a change to state.
    void NotifyLocked(Subchannel* subchannel, grpc_connectivity_state state,
                      const absl::Status& status);

General comment for the entire PR:

Instead of all of these APIs taking the Status as a const reference, it should be passed by value, and we should use std::move() as appropriate to avoid making unnecessary copies.


src/core/ext/filters/client_channel/subchannel.cc, line 349 at r1 (raw file):

            c->channelz_node()->SetChildSocket(nullptr);
          }
          absl::optional<absl::Cord> keepalive_throttling =

We can probably do the work of extracting the keepalive time from the Status object at the top of the function, before we acquire the mutex. That will keep the critical section small, because the only part we'll have to do here is to swap out the channel args.


src/core/ext/filters/client_channel/subchannel.cc, line 353 at r1 (raw file):

          if (keepalive_throttling.has_value()) {
            int new_keepalive_time =
                atoi(std::string(keepalive_throttling.value()).c_str());

Instead of atoi(), please use absl::SimpleAtoi().


src/core/ext/filters/client_channel/subchannel.cc, line 480 at r1 (raw file):

    if (state_ != initial_state) {
      new AsyncWatcherNotifierLocked(watcher, subchannel_, state_,
                                     absl::Status());

If the Status was set before the watcher was added, we should return the previously set Status instead of an empty one. This means that we'll need to record the Status when it is set.

See my similar comment in connectivity_state.h.


src/core/ext/filters/client_channel/subchannel.cc, line 868 at r1 (raw file):

  if (health_check_service_name == nullptr) {
    if (state_ != initial_state) {
      new AsyncWatcherNotifierLocked(watcher, this, state_, absl::Status());

If the Status was set before the watcher was added, we should return the previously set Status instead of an empty one. This means that we'll need to record the Status when it is set.

See my similar comment in connectivity_state.h.


src/core/ext/filters/client_channel/health/health_check_client.cc, line 94 at r1 (raw file):

            ConnectivityStateName(state), reason);
  }
  if (watcher_ != nullptr) watcher_->Notify(state, absl::Status());

If state is TRANSIENT_FAILURE, we should pass a Status of UNAVAILABLE with message set to reason.


src/core/ext/filters/client_channel/lb_policy/grpclb/grpclb.cc, line 327 at r1 (raw file):

        // fallback mode immediately.
        gpr_log(GPR_INFO,
                "[grpclb %p] balancer channel in state TRANSIENT_FAILURE; "

Please log the Status in this trace message.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 410 at r1 (raw file):

      // In TRANSIENT_FAILURE.  Notify all watchers of error.
      gpr_log(GPR_INFO,
              "[xds_client %p] xds channel in state TRANSIENT_FAILURE",

Please log the Status in this trace message.


src/core/ext/transport/chttp2/transport/chttp2_transport.cc, line 132 at r1 (raw file):

                                   grpc_connectivity_state state,
                                   const char* reason,
                                   const absl::Status& status);

Please put the status arg before the reason arg, so that the two "real" args are first and the one that's just for trace logging is at the end.


src/core/ext/transport/chttp2/transport/chttp2_transport.cc, line 569 at r1 (raw file):

    connectivity_state_set(
        t, GRPC_CHANNEL_SHUTDOWN, "close_transport",
        absl::Status(absl::StatusCode::kUnavailable, "Transport closed"));

Please add grpc_error_string(error) to the message in this Status object.


src/core/lib/transport/connectivity_state.h, line 119 at r1 (raw file):

  // Not thread safe; access must be serialized with an external lock.
  void SetState(grpc_connectivity_state state, const char* reason,
                const absl::Status& status);

Please put the status argument before the reason argument. That way, the two "real" parameters are first, and the one that's just for logging is at the end.


src/core/lib/transport/connectivity_state.h, line 121 at r1 (raw file):

                const absl::Status& status);

  // Deprecated. Prefer the version above.

I don't think there are that many callers of this. Let's just change all the callers and remove this deprecated variant.


src/core/lib/transport/connectivity_state.cc, line 132 at r1 (raw file):

              ConnectivityStateName(current_state));
    }
    watcher->Notify(current_state, absl::Status());

If SetState() was called to set the Status and then AddWatcher() is called to add a new watcher, we should deliver the Status set via SetState() to the new watcher.

This means that when SetState() is called, we should store the Status in the state tracker object for later use.


test/core/transport/connectivity_state_test.cc, line 54 at r1 (raw file):

  void Notify(grpc_connectivity_state new_state,
              const absl::Status& /* status */) override {

This test should be changed to actually verify the Status value.

Copy link
Copy Markdown
Member Author

@yashykt yashykt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 31 of 34 files reviewed, 20 unresolved discussions (waiting on @markdroth and @yashykt)


src/core/ext/filters/client_channel/subchannel.h, line 311 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

General comment for the entire PR:

Instead of all of these APIs taking the Status as a const reference, it should be passed by value, and we should use std::move() as appropriate to avoid making unnecessary copies.

As discussed in chat, I don't think we know for sure that std::move is going to be better here.


src/core/ext/filters/client_channel/subchannel.cc, line 480 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

If the Status was set before the watcher was added, we should return the previously set Status instead of an empty one. This means that we'll need to record the Status when it is set.

See my similar comment in connectivity_state.h.

Done in #23480


src/core/ext/filters/client_channel/subchannel.cc, line 868 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

If the Status was set before the watcher was added, we should return the previously set Status instead of an empty one. This means that we'll need to record the Status when it is set.

See my similar comment in connectivity_state.h.

Done in #23480


src/core/ext/filters/client_channel/health/health_check_client.cc, line 94 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

If state is TRANSIENT_FAILURE, we should pass a Status of UNAVAILABLE with message set to reason.

Done in #23480


src/core/ext/filters/client_channel/lb_policy/grpclb/grpclb.cc, line 327 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please log the Status in this trace message.

Done in #23480


src/core/ext/filters/client_channel/xds/xds_client.cc, line 410 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please log the Status in this trace message.

Done in #23480


src/core/ext/transport/chttp2/transport/chttp2_transport.cc, line 132 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please put the status arg before the reason arg, so that the two "real" args are first and the one that's just for trace logging is at the end.

Done in #23480


src/core/ext/transport/chttp2/transport/chttp2_transport.cc, line 569 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please add grpc_error_string(error) to the message in this Status object.

Done in #23480


src/core/lib/transport/connectivity_state.h, line 119 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please put the status argument before the reason argument. That way, the two "real" parameters are first, and the one that's just for logging is at the end.

Done in #23480


src/core/lib/transport/connectivity_state.h, line 121 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I don't think there are that many callers of this. Let's just change all the callers and remove this deprecated variant.

Done in #23480


src/core/lib/transport/connectivity_state.cc, line 132 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

If SetState() was called to set the Status and then AddWatcher() is called to add a new watcher, we should deliver the Status set via SetState() to the new watcher.

This means that when SetState() is called, we should store the Status in the state tracker object for later use.

Done in #23480


test/core/transport/connectivity_state_test.cc, line 54 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This test should be changed to actually verify the Status value.

Done in #23480

@yashykt
Copy link
Copy Markdown
Member Author

yashykt commented Jul 18, 2020

I overwrote the files here with the changes from #23480 instead of rebasing this pull request, so that previous commits and comments can be preserved.
The keepalive throttling changes have been recreated on top.

Copy link
Copy Markdown
Member Author

@yashykt yashykt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 2 of 45 files reviewed, 20 unresolved discussions (waiting on @markdroth and @yashykt)


src/core/ext/filters/client_channel/client_channel.cc, line 1101 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Let's define this key name as a constant somewhere. Maybe in transport.h?

Done.


src/core/ext/filters/client_channel/client_channel.cc, line 1103 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

In addition to setting this, we also need to pass the new value to all existing subchannels in use by this channel.

Done.


src/core/ext/filters/client_channel/client_channel.cc, line 1105 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please use absl::SimpleAtoi().

Done.


src/core/ext/filters/client_channel/client_channel.cc, line 1108 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please format this the same way as all of the other trace messages in this file:

"chand=%p: throttling keepalive time to %d"

Done.


src/core/ext/filters/client_channel/client_channel.cc, line 1660 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

We're jumping through a lot of hoops here to remove the keepalive time from channel args, and I don't see why that's necessary. I think all we need here is the following:

channel_args_ = new_args != nullptr
                    ? new_args
                    : grpc_channel_args_copy(args->channel_args);
keepalive_time_ = grpc_channel_args_find_integer(
    channel_args_, GRPC_ARG_KEEPALIVE_TIME_MS,
    {-1 /* default value, unset */, 1, INT_MAX});

Initially, i was removing the keepalive arg from here and adding it on subchannel creation time. I think I had the notion that these channel args had an affect on determining whether the subchannel will be shared or not. Changed it


src/core/ext/filters/client_channel/subchannel.h, line 229 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

One down-side of passing the keepalive time as a parameter to the Subchannel ctor is that it changes the ClientChannelFactory API, which seems a bit viral (e.g., it will require a bunch of internal changes). It would be nice to avoid this.

I also observe that, independent of whether this parameter is here or not, we still need a new method on Subchannel to be able to set the keepalive time after a subchannel is created. The reason for this is that, as per gRFC A8: Client-side Keepalive (see the last paragraph of the "Basic Keepalive" section), when we receive a GOAWAY on one subchannel, we should double the keepalive time used "for new connections on that channel". This means that even if an existing subchannel becomes disconnected and then attempts to reconnect, that reconnection should use the new keepalive time. So we need a way to update subchannels after they are created.

If we are going to add a method for that anyway, let's just use that to set the initial keepalive time as well (i.e., call that method on the subchannel immediately after creating the subchannel). That way, we won't need this new parameter in the ctor, nor will we need to make changes to the ClientChannelFactory API.

Done. (Removed keepalive_time from creation and instead added a ThrottleKeepaliveTime method to Subchannel and SubchannelWrapper


src/core/ext/filters/client_channel/subchannel.cc, line 353 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Instead of atoi(), please use absl::SimpleAtoi().

Done.

Copy link
Copy Markdown
Member Author

@yashykt yashykt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 2 of 45 files reviewed, 20 unresolved discussions (waiting on @markdroth)


src/core/ext/filters/client_channel/subchannel.cc, line 349 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

We can probably do the work of extracting the keepalive time from the Status object at the top of the function, before we acquire the mutex. That will keep the critical section small, because the only part we'll have to do here is to swap out the channel args.

Done.

@yashykt yashykt force-pushed the keepalivethrottling branch from c9be127 to d51601d Compare July 23, 2020 03:02
@yashykt
Copy link
Copy Markdown
Member Author

yashykt commented Jul 23, 2020

@markdroth The merge with master after the merge of the absl::Status PR had a lot of conflicts, so I ended up cherry-picking the keepalive commits on top of master.

Copy link
Copy Markdown
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The keepalive code looks very nice! I like how minimally invasive it turned out to be,

It looks like there's some stuff to clean up after the merge. But otherwise, the only substantive comment is that we need a better test for the multiple-subchannel case.

Please let me know if you have any questions. Thanks!

Reviewed 2 of 34 files at r1, 9 of 45 files at r2, 1 of 2 files at r3, 1 of 26 files at r4, 1 of 1 files at r5.
Reviewable status: all files reviewed, 15 unresolved discussions (waiting on @yashykt)


src/core/ext/filters/client_channel/subchannel.h, line 231 at r5 (raw file):

                            const grpc_channel_args* args);

  // Throttle keepalive time

Please document the exact semantics of this method.


src/core/ext/filters/client_channel/subchannel.cc, line 29 at r5 (raw file):

#include <cstring>

#include "absl/strings/numbers.h"

This doesn't appear to be needed anymore.


src/core/ext/filters/client_channel/health/health_check_client.cc, line 94 at r3 (raw file):

            ConnectivityStateName(state), reason);
  }
  if (watcher_ != nullptr)

It looks like this is still showing some of the changes from #23480. Maybe something happened when you merged?

That having been said, I noticed after #23480 was merged that we should have braces around the body of this if statement, so let's go ahead and fix that in this PR.


src/core/ext/filters/client_channel/lb_policy/grpclb/grpclb.cc, line 307 at r3 (raw file):

    RefCountedPtr<SubchannelInterface> CreateSubchannel(
        const grpc_channel_args& args) override;
    void UpdateState(grpc_connectivity_state state, const absl::Status& status,

The changes in this file all look like they come from #23480.


src/core/ext/filters/client_channel/xds/xds_client.cc, line 402 at r3 (raw file):

 private:
  void OnConnectivityStateChange(grpc_connectivity_state new_state,
                                 const absl::Status& status) override {

The changes in this file all look like they come from #23480.


src/core/lib/transport/connectivity_state.h, line 24 at r3 (raw file):

#include <grpc/support/port_platform.h>

#include "absl/status/status.h"

The changes in this file all look like they come from #23480.


src/core/lib/transport/connectivity_state.cc, line 61 at r3 (raw file):

 public:
  Notifier(RefCountedPtr<AsyncConnectivityStateWatcherInterface> watcher,
           grpc_connectivity_state state, const absl::Status& status,

The changes in this file all look like they come from #23480.


src/core/lib/transport/transport.h, line 465 at r5 (raw file):

// This is the key to be used for loading/storing keepalive_throttling in the
// absl::Status object.
constexpr const char* keepalive_throttling_key =

kKeepaliveThrottlingKey


test/core/end2end/cq_verifier.h, line 42 at r5 (raw file):

/* ensure all expected events (and only those events) are present on the
   bound completion queue withing \a timeout_sec */
void cq_verify_custom_timeout(cq_verifier* v, int timeout_sec);

Instead of adding a second function here, how about simply adding an optional parameter to the existing cq_verify() function?


test/core/transport/connectivity_state_test.cc, line 45 at r3 (raw file):

class Watcher : public ConnectivityStateWatcherInterface {
 public:
  Watcher(int* count, grpc_connectivity_state* output, absl::Status* status,

The changes in this file all look like they come from #23480.


test/core/transport/chttp2/too_many_pings_test.cc, line 180 at r5 (raw file):

grpc_status_code PerformWaitingCall(grpc_channel* channel, grpc_server* server,
                                    grpc_completion_queue* cq,
                                    int expected_keepalive_time) {

This parameter doesn't appear to be used.


test/core/transport/chttp2/too_many_pings_test.cc, line 245 at r5 (raw file):

  return status;
}
TEST(TooManyPings, KeepaliveThrottling) {

Please add a blank line before this.


test/core/transport/chttp2/too_many_pings_test.cc, line 249 at r5 (raw file):

  // create the server with a ping interval of 10 seconds and a single ping
  // strike.
  grpc_arg server_args[2];

Suggest writing this as:

grpc_arg server_args[] = {
  grpc_channel_arg_integer_create(
      const_cast<char*>(GRPC_ARG_HTTP2_MIN_RECV_PING_INTERVAL_WITHOUT_DATA_MS),
      10 * 1000),
  grpc_channel_arg_integer_create(
      const_cast<char*>(GRPC_ARG_HTTP2_MAX_PING_STRIKES), 1),
};

test/core/transport/chttp2/too_many_pings_test.cc, line 265 at r5 (raw file):

  grpc_server_start(server);
  // create the channel with a keepalive ping interval of 1 second.
  grpc_arg client_args[4];

Similar change here.


test/core/transport/chttp2/too_many_pings_test.cc, line 277 at r5 (raw file):

  grpc_channel_args client_channel_args = {GPR_ARRAY_SIZE(client_args),
                                           client_args};
  grpc_channel* channel = grpc_insecure_channel_create(

This test always uses the same channel, and the channel always has exactly one subchannel. Can we also have a test that verifies that we're setting the keepalive time properly when there are multiple subchannels, and maybe multiple channels sharing those subchannels?

Copy link
Copy Markdown
Member Author

@yashykt yashykt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 15 unresolved discussions (waiting on @markdroth and @yashykt)


src/core/ext/filters/client_channel/health/health_check_client.cc, line 94 at r3 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

It looks like this is still showing some of the changes from #23480. Maybe something happened when you merged?

That having been said, I noticed after #23480 was merged that we should have braces around the body of this if statement, so let's go ahead and fix that in this PR.

This PR doesn't make any change in this file anymore. I believe it is an artifact from the previous commits which were removed. Looks like reviewable.io wasn't able to keep up with the merges and the rebasing


src/core/ext/filters/client_channel/lb_policy/grpclb/grpclb.cc, line 307 at r3 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

The changes in this file all look like they come from #23480.

This PR doesn't make any change in this file anymore. I believe it is an artifact from the previous commits which were removed. Looks like reviewable.io wasn't able to keep up with the merges and the rebasing


src/core/ext/filters/client_channel/xds/xds_client.cc, line 402 at r3 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

The changes in this file all look like they come from #23480.

This PR doesn't ma


src/core/lib/transport/connectivity_state.h, line 24 at r3 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

The changes in this file all look like they come from #23480.

This PR doesn't make any change in this file anymore. I believe it is an artifact from the previous commits which were removed. Looks like reviewable.io wasn't able to keep up with the merges and the rebasing


src/core/lib/transport/connectivity_state.cc, line 61 at r3 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

The changes in this file all look like they come from #23480.

This PR doesn't make any change in this file anymore. I believe it is an artifact from the previous commits which were removed. Looks like reviewable.io wasn't able to keep up with the merges and the rebasing


test/core/transport/connectivity_state_test.cc, line 45 at r3 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

The changes in this file all look like they come from #23480.

This PR doesn't make any change in this file anymore. I believe it is an artifact from the previous commits which were removed. Looks like reviewable.io wasn't able to keep up with the merges and the rebasing

Copy link
Copy Markdown
Member Author

@yashykt yashykt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 15 unresolved discussions (waiting on @markdroth and @yashykt)


src/core/ext/filters/client_channel/xds/xds_client.cc, line 402 at r3 (raw file):

Previously, yashykt (Yash Tibrewal) wrote…

This PR doesn't ma

This PR doesn't make any change in this file anymore. I believe it is an artifact from the previous commits which were removed. Looks like reviewable.io wasn't able to keep up with the merges and the rebasing

Copy link
Copy Markdown
Member Author

@yashykt yashykt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markdroth PTAL I've added a few tests. Getting the timing is kind of tricky with round_robin but is more manageable with pick_first.

Reviewable status: 0 of 8 files reviewed, 15 unresolved discussions (waiting on @markdroth)


src/core/ext/filters/client_channel/subchannel.h, line 231 at r5 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please document the exact semantics of this method.

Done.


src/core/ext/filters/client_channel/subchannel.cc, line 29 at r5 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This doesn't appear to be needed anymore.

Done.


src/core/lib/transport/transport.h, line 465 at r5 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

kKeepaliveThrottlingKey

Done.


test/core/end2end/cq_verifier.h, line 42 at r5 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Instead of adding a second function here, how about simply adding an optional parameter to the existing cq_verify() function?

Done.


test/core/transport/chttp2/too_many_pings_test.cc, line 180 at r5 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This parameter doesn't appear to be used.

Removed.


test/core/transport/chttp2/too_many_pings_test.cc, line 245 at r5 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please add a blank line before this.

Done.


test/core/transport/chttp2/too_many_pings_test.cc, line 249 at r5 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Suggest writing this as:

grpc_arg server_args[] = {
  grpc_channel_arg_integer_create(
      const_cast<char*>(GRPC_ARG_HTTP2_MIN_RECV_PING_INTERVAL_WITHOUT_DATA_MS),
      10 * 1000),
  grpc_channel_arg_integer_create(
      const_cast<char*>(GRPC_ARG_HTTP2_MAX_PING_STRIKES), 1),
};

Done.


test/core/transport/chttp2/too_many_pings_test.cc, line 265 at r5 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Similar change here.

Done.


test/core/transport/chttp2/too_many_pings_test.cc, line 277 at r5 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This test always uses the same channel, and the channel always has exactly one subchannel. Can we also have a test that verifies that we're setting the keepalive time properly when there are multiple subchannels, and maybe multiple channels sharing those subchannels?

Done. Getting the timing right is slightly tricky especially with round_robin but I got it working with pick_first since it is easier to control when the subchannel gets connected

Copy link
Copy Markdown
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking really good! Just one substantive comment about the test.

Please let me know if you have any questions.

Reviewed 1 of 45 files at r2, 8 of 8 files at r6.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @yashykt)


src/core/ext/filters/client_channel/subchannel.h, line 233 at r6 (raw file):

  // Throttles keepalive time to \a new_keepalive_time iff \a new_keepalive_time
  // is larger than the subchannel's current keepalive time. The updated value
  // would have an affect when the subchannel creates a new ConnectedSubchannel.

s/would/will/


src/core/ext/filters/client_channel/health/health_check_client.cc, line 94 at r3 (raw file):

Previously, yashykt (Yash Tibrewal) wrote…

This PR doesn't make any change in this file anymore. I believe it is an artifact from the previous commits which were removed. Looks like reviewable.io wasn't able to keep up with the merges and the rebasing

Please add the braces here as I suggested.


test/core/transport/chttp2/too_many_pings_test.cc, line 265 at r5 (raw file):

Previously, yashykt (Yash Tibrewal) wrote…

Done.

Need an = here.


test/core/transport/chttp2/too_many_pings_test.cc, line 277 at r5 (raw file):

Previously, yashykt (Yash Tibrewal) wrote…

Done. Getting the timing right is slightly tricky especially with round_robin but I got it working with pick_first since it is easier to control when the subchannel gets connected

I think round_robin would actually be easier, because you wouldn't need to mess with the fake resolver. Instead, you can force the subchannels to reconnect by stopping and restarting the servers. You can also make one RPC for each backend, and you can check the peer address of each call to ensure that each one is covering a different backend.

Copy link
Copy Markdown
Member Author

@yashykt yashykt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 6 of 9 files reviewed, 4 unresolved discussions (waiting on @markdroth)


src/core/ext/filters/client_channel/subchannel.h, line 233 at r6 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

s/would/will/

Done.


src/core/ext/filters/client_channel/health/health_check_client.cc, line 94 at r3 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please add the braces here as I suggested.

Done.


test/core/transport/chttp2/too_many_pings_test.cc, line 265 at r5 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Need an = here.

Done


test/core/transport/chttp2/too_many_pings_test.cc, line 277 at r5 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I think round_robin would actually be easier, because you wouldn't need to mess with the fake resolver. Instead, you can force the subchannels to reconnect by stopping and restarting the servers. You can also make one RPC for each backend, and you can check the peer address of each call to ensure that each one is covering a different backend.

Done. (Didn't need to check the address though)

Copy link
Copy Markdown
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! Feel free to merge after addressing the remaining minor comments.

Thanks!

Reviewed 3 of 3 files at r7, 1 of 1 files at r8.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @yashykt)


test/core/transport/chttp2/too_many_pings_test.cc, line 277 at r5 (raw file):

Previously, yashykt (Yash Tibrewal) wrote…

Done. (Didn't need to check the address though)

The new test looks good! Thanks for doing that.


test/core/transport/chttp2/too_many_pings_test.cc, line 385 at r8 (raw file):

// Tests that when new subchannels are created due to a change in resolved
// addresses, the new subchannels use the updated keepalive time.
TEST_F(KeepaliveThrottlingTest, KeepaliveThrottlingMultipleSubchannels1) {

Suggest calling this NewSubchannelsUseUpdatedKeepaliveTime.


test/core/transport/chttp2/too_many_pings_test.cc, line 463 at r8 (raw file):

// "too_many_pings" on one of them, all subchannels start any new transports
// with an updated keepalive time.
TEST_F(KeepaliveThrottlingTest, KeepaliveThrottlingMultipleSubchannels2) {

Suggest calling this ExistingSubchannelsUseNewKeepaliveTimeWhenReconnecting.

Copy link
Copy Markdown
Member Author

@yashykt yashykt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 8 of 9 files reviewed, 2 unresolved discussions (waiting on @markdroth)


test/core/transport/chttp2/too_many_pings_test.cc, line 385 at r8 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Suggest calling this NewSubchannelsUseUpdatedKeepaliveTime.

Done.


test/core/transport/chttp2/too_many_pings_test.cc, line 463 at r8 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Suggest calling this ExistingSubchannelsUseNewKeepaliveTimeWhenReconnecting.

Done.

@yashykt
Copy link
Copy Markdown
Member Author

yashykt commented Aug 14, 2020

Thanks for reviewing!

@yashykt yashykt merged commit d8d6114 into grpc:master Aug 14, 2020
@yashykt yashykt deleted the keepalivethrottling branch May 18, 2023 20:32
copybara-service bot pushed a commit that referenced this pull request Nov 7, 2025
)

Currently, the client channel maintains two different data structures for tracking subchannels:
- `subchannel_refcount_map_`, which tracks the number of subchannel wrappers per subchannel.  This is used to determine when to create and remove channelz node linkage.
- `subchannel_wrappers_`, which tracks the set of all subchannel wrappers across all subchannels.  This was originally introduced back in #20039 as part of propagating the health check service name to subchannels, but we switched to using a different approach for the health check service name in #26441.  However, in between those two, we started using this data structure for handling keepalive time in #23313 -- which is actually somewhat inefficient, since we may wind up setting the keepalive time on the underlying subchannel more than once in the case where there is more than one subchannel wrapper per subchannel (which happens frequently during updates).

This PR combines these two data structures into one: a map from subchannel to a set of subchannel wrappers for that subchannel.  This is used both for channelz node updates and keepalive propagation -- and it's more efficient for the latter, because we can now update each subchannel exactly once.

This also paves the way for a subsequent change that will be needed as part of the MAX_CONCURRENT_STREAMS design.

Closes #40880

COPYBARA_INTEGRATE_REVIEW=#40880 from markdroth:client_channel_combine_subchannel_maps f289c20
PiperOrigin-RevId: 829602334
yuanweiz pushed a commit to yuanweiz/grpc that referenced this pull request Nov 12, 2025
…c#40880)

Currently, the client channel maintains two different data structures for tracking subchannels:
- `subchannel_refcount_map_`, which tracks the number of subchannel wrappers per subchannel.  This is used to determine when to create and remove channelz node linkage.
- `subchannel_wrappers_`, which tracks the set of all subchannel wrappers across all subchannels.  This was originally introduced back in grpc#20039 as part of propagating the health check service name to subchannels, but we switched to using a different approach for the health check service name in grpc#26441.  However, in between those two, we started using this data structure for handling keepalive time in grpc#23313 -- which is actually somewhat inefficient, since we may wind up setting the keepalive time on the underlying subchannel more than once in the case where there is more than one subchannel wrapper per subchannel (which happens frequently during updates).

This PR combines these two data structures into one: a map from subchannel to a set of subchannel wrappers for that subchannel.  This is used both for channelz node updates and keepalive propagation -- and it's more efficient for the latter, because we can now update each subchannel exactly once.

This also paves the way for a subsequent change that will be needed as part of the MAX_CONCURRENT_STREAMS design.

Closes grpc#40880

COPYBARA_INTEGRATE_REVIEW=grpc#40880 from markdroth:client_channel_combine_subchannel_maps f289c20
PiperOrigin-RevId: 829602334
sreenithi pushed a commit to sreenithi/grpc that referenced this pull request Nov 26, 2025
…c#40880)

Currently, the client channel maintains two different data structures for tracking subchannels:
- `subchannel_refcount_map_`, which tracks the number of subchannel wrappers per subchannel.  This is used to determine when to create and remove channelz node linkage.
- `subchannel_wrappers_`, which tracks the set of all subchannel wrappers across all subchannels.  This was originally introduced back in grpc#20039 as part of propagating the health check service name to subchannels, but we switched to using a different approach for the health check service name in grpc#26441.  However, in between those two, we started using this data structure for handling keepalive time in grpc#23313 -- which is actually somewhat inefficient, since we may wind up setting the keepalive time on the underlying subchannel more than once in the case where there is more than one subchannel wrapper per subchannel (which happens frequently during updates).

This PR combines these two data structures into one: a map from subchannel to a set of subchannel wrappers for that subchannel.  This is used both for channelz node updates and keepalive propagation -- and it's more efficient for the latter, because we can now update each subchannel exactly once.

This also paves the way for a subsequent change that will be needed as part of the MAX_CONCURRENT_STREAMS design.

Closes grpc#40880

COPYBARA_INTEGRATE_REVIEW=grpc#40880 from markdroth:client_channel_combine_subchannel_maps f289c20
PiperOrigin-RevId: 829602334
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lang/core release notes: yes Indicates if PR needs to be in release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

gRPC core keepalive doesn't throttle when receiving GOAWAY with "too many pings"

2 participants