Treating a bind failure as a connection failure rather than crashing Envoy by alyssawilk · Pull Request #1564 · envoyproxy/envoy

alyssawilk · 2017-08-29T16:52:30Z

htuch

Implementation looks good, some questions on test.

htuch · 2017-08-29T22:06:30Z

test/common/network/connection_impl_test.cc

 }

+TEST_P(ConnectionImplTest, BindFailureTest) {
+  std::string address_string = TestUtility::getIpv4Loopback();


Tiny nit: prefer this initialization in the Network::Address::IpVersion::v6 clause below.

htuch · 2017-08-29T22:06:50Z

test/common/network/connection_impl_test.cc

+        new Network::Address::Ipv6Instance(address_string, 0)};
+  }
+
+  if (dispatcher_.get() == nullptr) {


Why do we need to condition this?

htuch · 2017-08-29T22:07:47Z

test/integration/integration_test.cc

       [&]() -> void { fake_upstream_connection_->waitForDisconnect(); }});
 }

+TEST_P(BindIntegrationTest, DISABLED_TestFailedBind) {


htuch · 2017-08-29T22:07:52Z

test/integration/integration_test.cc

 class BindIntegrationTest : public IntegrationTest {
 public:
-  void SetUp() override {
+  // Delay base class SetUp until initialize().


Why is this needed?

To allow the bind-fail test to override the address before set-up.

I'm fairly convinced that in the long run we should ditch SetUp for Initialize() (to allow common config manipulation) but I figured on doing that in one of the PRs for simplifying config set-up

Sure. Maybe just add an explanation of why the delay then in the comment.

mattklein123 · 2017-08-30T04:30:21Z

source/common/network/connection_impl.cc

+  if (bind_to_address != nullptr) {
+    int rc = bind_to_address->bind(fd);
+    if (rc < 0) {
+      ENVOY_LOG_MISC(warn, "Bind failure. Failed to bind to {}: {}", bind_to_address->asString(),


If something goes wrong here, this is going to mega spew the logs in a bad way. I'm guessing internally you have rate limiting on individual log messages. We don't have that in public code. I feel like we should have some kind of stat here for a bind error on connect? Perhaps this could be done like the buffer stats (which would be installed by the time you do the deferred close). At minimum we should probably have some kind of TODO here to think a bit more about how we would want to handle this operationally if we are not going to crash.

Maybe a counter and a debug level log would make sense.

Man, if only we had some utility for logging important error messages infrequently. Like a macro which logged with exponential back-off with a super cheap (if somewhat lossy) atomic counter to minimize CPU usage and cross thread contention. And then eventually maybe even tie it into the stats system with the macro taking a custom label or line number for easy monitoring and trace-back of unexpected but not fatal occurrences... That seems like it'd be super useful for debugging production issues for a large scale proxy.
I'll add porting this feature to my non-critical TODO list :-)

FWIW, this is going to be a very infrequently used feature (you are probably the only one who will use it for now), so if you have this covered in some other way and you want to put in a TODO that's fine.

alyssawilk · 2017-08-30T15:50:54Z

Oh, I'm downgrading it to debug and adding a stat per harvey's suggestion. I think it's a reasonable interim step, just cleaning up the tests now :-)

…

On Wed, Aug 30, 2017 at 11:49 AM, Matt Klein ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In source/common/network/connection_impl.cc <#1564 (comment)>: > @@ -94,6 +80,18 @@ ConnectionImpl::ConnectionImpl(Event::DispatcherImpl& dispatcher, int fd, file_event_ = dispatcher_.createFileEvent( fd_, [this](uint32_t events) -> void { onFileEvent(events); }, Event::FileTriggerType::Edge, Event::FileReadyType::Read | Event::FileReadyType::Write); + + if (bind_to_address != nullptr) { + int rc = bind_to_address->bind(fd); + if (rc < 0) { + ENVOY_LOG_MISC(warn, "Bind failure. Failed to bind to {}: {}", bind_to_address->asString(), FWIW, this is going to be a very infrequently used feature (you are probably the only one who will use it for now), so if you have this covered in some other way and you want to put in a TODO that's fine. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1564 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ARYFvfsLpeVQ2jKwvuWYuLQni0H99HYZks5sdYSMgaJpZM4PGP1B> .

dnoe · 2017-08-30T16:33:45Z

include/envoy/upstream/upstream.h

  COUNTER(upstream_flow_control_resumed_reading_total)                                             \
  COUNTER(upstream_flow_control_backed_up_total)                                                   \
  COUNTER(upstream_flow_control_drained_total)                                                     \
+  COUNTER(bind_errors)                                                                             \


Should we call the stat upstream_bind_errors ? It won't count bind errors on the listening socket.

I agree upstream_bind_errors is confusing then. I think bind_errors is fine.

alyssawilk · 2017-08-30T17:10:46Z

Wasn't sure - it's a bind error on the socket to communicate with upstream but upstream protocol errors is protocol errors *from* upstream and so I thought there would be a case for confusion.

…

On Wed, Aug 30, 2017 at 1:06 PM, Dan Noé ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In include/envoy/upstream/upstream.h <#1564 (comment)>: > @@ -217,6 +217,7 @@ class HostSet { COUNTER(upstream_flow_control_resumed_reading_total) \ COUNTER(upstream_flow_control_backed_up_total) \ COUNTER(upstream_flow_control_drained_total) \ + COUNTER(bind_errors) \ Should we call the stat upstream_bind_errors ? It won't count bind errors on the listening socket. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1564 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ARYFvUwuNrAeFwRGJ4hKku8lu0iexeSxks5sdZaKgaJpZM4PGP1B> .

mattklein123 · 2017-08-30T18:08:01Z

source/common/http/http1/conn_pool.cc

  parent_.conn_connect_ms_ =
      parent_.host_->cluster().stats().upstream_cx_connect_ms_.allocateSpan();
  Upstream::Host::CreateConnectionData data = parent_.host_->createConnection(parent_.dispatcher_);
+  if (data.connection_->state() != Network::Connection::State::Open) {


IMO this approach is a bit fragile in that we don't really know that this is a bind error. Perhaps something else might change in the future? It also has the downside of not covering other cases such as tcp_proxy, redis, etc.

I see two options here:

Rename/extend setBufferStats (https://github.com/lyft/envoy/blob/master/include/envoy/network/connection.h#L147) to include other stats also. Then a call could pass in a bind failure stat if they want.

Make createConnection() take a struct with optional stats in it, which will force people to plumb things through as needed (though this would potentially require changes to more callsites).

Thoughts?

We could avoid the first concern by just adding an ASSERT in the connection that if the connection isn't connected there is a bind failure, and adding a comment pointer to anyone changing the code. Agree, the second, that this misses tcp_proxy, is more of a problem though.

setting buffer stats on creation looks super duper messy since for example in the case of the tcp proxy filter, the connection is created prior to the read callbacks which is where buffer stats are set

I'd be inclined to go with renaming setBufferStats to be more general, and then somewhat unintuitively incrementing the bind error where we raise the error onFileEvent with a comment that we defer until the caller has a chance to set [buffer] stats. If that doesn't sound too hacky to you I'll take a whack at that tomorrow

I think I'm fine with that unless you can think of a better option. I would rename to something like setConnectionStats or setStats or something.

htuch

LGTM modulo minor comment.

htuch · 2017-08-31T22:12:24Z

include/envoy/network/connection.h

    Stats::Gauge& read_current_;
    Stats::Counter& write_total_;
    Stats::Gauge& write_current_;
+    Stats::Counter* bind_errors_;


Can you add a comment here on why this is a pointer rather than ref (upstream/downstream..)?

mattklein123

lgtm just needs master merge

alyssawilk · 2017-09-01T00:11:40Z

Haha, *just* beat me to it :-P

…

On Thu, Aug 31, 2017 at 8:04 PM, Matt Klein ***@***.***> wrote: ***@***.**** commented on this pull request. lgtm just needs master merge — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1564 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ARYFva7VnmmO_-QFkaWhY4wa97VWZV4Mks5sd0nzgaJpZM4PGP1B> .

Description: Add explicit calls to be used when releasing core types, as a step towards formalizing management of bridge memory. Risk Level: Low Testing: CI Signed-off-by: Mike Schore <mike.schore@gmail.com> Signed-off-by: JP Simard <jp@jpsim.com>

**Description** This refactors tests/extproc integration tests: 1. Renamed as tests/data-plane to be usable for Dynamic Module too #1564 2. Extract tests/extproc/mcp into tests/data-plane-mcp to have a separate concern in each integration tests **Related Issues/PRs (if applicable)** Preparation for #90 --------- Signed-off-by: Takeshi Yoneda <t.y.mathetake@gmail.com>

**Description** This upgrades Envoy version used in data-plane tests. This is mostly decouple unrelated change from #1564 **Related Issues/PRs (if applicable)** Follow up on #1660 Signed-off-by: Takeshi Yoneda <t.y.mathetake@gmail.com>

treating a bind failure as a connection error rather than crashing Envoy

59e0414

htuch reviewed Aug 29, 2017

View reviewed changes

mattklein123 reviewed Aug 30, 2017

View reviewed changes

actually integration testing, now with less logspam and more stats

78b7614

dnoe reviewed Aug 30, 2017

View reviewed changes

mattklein123 reviewed Aug 30, 2017

View reviewed changes

reworking bind_fail stats

2f53130

htuch reviewed Aug 31, 2017

View reviewed changes

updating comment

b7a89be

mattklein123 reviewed Sep 1, 2017

View reviewed changes

Merge branch 'refs/heads/master' into bind_soft_fail

e12813f

mattklein123 approved these changes Sep 1, 2017

View reviewed changes

alyssawilk merged commit 79f4c6a into envoyproxy:master Sep 5, 2017

alyssawilk deleted the bind_soft_fail branch September 7, 2017 19:38

Conversation

alyssawilk commented Aug 29, 2017

Uh oh!

htuch left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alyssawilk commented Aug 30, 2017 via email

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alyssawilk commented Aug 30, 2017 via email

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

htuch left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattklein123 left a comment

Choose a reason for hiding this comment

Uh oh!

alyssawilk commented Sep 1, 2017 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants