Treating a bind failure as a connection failure rather than crashing Envoy#1564
Treating a bind failure as a connection failure rather than crashing Envoy#1564alyssawilk merged 5 commits intoenvoyproxy:masterfrom
Conversation
htuch
left a comment
There was a problem hiding this comment.
Implementation looks good, some questions on test.
| } | ||
|
|
||
| TEST_P(ConnectionImplTest, BindFailureTest) { | ||
| std::string address_string = TestUtility::getIpv4Loopback(); |
There was a problem hiding this comment.
Tiny nit: prefer this initialization in the Network::Address::IpVersion::v6 clause below.
| new Network::Address::Ipv6Instance(address_string, 0)}; | ||
| } | ||
|
|
||
| if (dispatcher_.get() == nullptr) { |
There was a problem hiding this comment.
Why do we need to condition this?
test/integration/integration_test.cc
Outdated
| [&]() -> void { fake_upstream_connection_->waitForDisconnect(); }}); | ||
| } | ||
|
|
||
| TEST_P(BindIntegrationTest, DISABLED_TestFailedBind) { |
test/integration/integration_test.cc
Outdated
| class BindIntegrationTest : public IntegrationTest { | ||
| public: | ||
| void SetUp() override { | ||
| // Delay base class SetUp until initialize(). |
There was a problem hiding this comment.
To allow the bind-fail test to override the address before set-up.
I'm fairly convinced that in the long run we should ditch SetUp for Initialize() (to allow common config manipulation) but I figured on doing that in one of the PRs for simplifying config set-up
There was a problem hiding this comment.
Sure. Maybe just add an explanation of why the delay then in the comment.
| if (bind_to_address != nullptr) { | ||
| int rc = bind_to_address->bind(fd); | ||
| if (rc < 0) { | ||
| ENVOY_LOG_MISC(warn, "Bind failure. Failed to bind to {}: {}", bind_to_address->asString(), |
There was a problem hiding this comment.
If something goes wrong here, this is going to mega spew the logs in a bad way. I'm guessing internally you have rate limiting on individual log messages. We don't have that in public code. I feel like we should have some kind of stat here for a bind error on connect? Perhaps this could be done like the buffer stats (which would be installed by the time you do the deferred close). At minimum we should probably have some kind of TODO here to think a bit more about how we would want to handle this operationally if we are not going to crash.
There was a problem hiding this comment.
Maybe a counter and a debug level log would make sense.
There was a problem hiding this comment.
Man, if only we had some utility for logging important error messages infrequently. Like a macro which logged with exponential back-off with a super cheap (if somewhat lossy) atomic counter to minimize CPU usage and cross thread contention. And then eventually maybe even tie it into the stats system with the macro taking a custom label or line number for easy monitoring and trace-back of unexpected but not fatal occurrences... That seems like it'd be super useful for debugging production issues for a large scale proxy.
I'll add porting this feature to my non-critical TODO list :-)
There was a problem hiding this comment.
FWIW, this is going to be a very infrequently used feature (you are probably the only one who will use it for now), so if you have this covered in some other way and you want to put in a TODO that's fine.
|
Oh, I'm downgrading it to debug and adding a stat per harvey's suggestion.
I think it's a reasonable interim step, just cleaning up the tests now :-)
…On Wed, Aug 30, 2017 at 11:49 AM, Matt Klein ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In source/common/network/connection_impl.cc
<#1564 (comment)>:
> @@ -94,6 +80,18 @@ ConnectionImpl::ConnectionImpl(Event::DispatcherImpl& dispatcher, int fd,
file_event_ = dispatcher_.createFileEvent(
fd_, [this](uint32_t events) -> void { onFileEvent(events); }, Event::FileTriggerType::Edge,
Event::FileReadyType::Read | Event::FileReadyType::Write);
+
+ if (bind_to_address != nullptr) {
+ int rc = bind_to_address->bind(fd);
+ if (rc < 0) {
+ ENVOY_LOG_MISC(warn, "Bind failure. Failed to bind to {}: {}", bind_to_address->asString(),
FWIW, this is going to be a very infrequently used feature (you are
probably the only one who will use it for now), so if you have this covered
in some other way and you want to put in a TODO that's fine.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#1564 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ARYFvfsLpeVQ2jKwvuWYuLQni0H99HYZks5sdYSMgaJpZM4PGP1B>
.
|
| COUNTER(upstream_flow_control_resumed_reading_total) \ | ||
| COUNTER(upstream_flow_control_backed_up_total) \ | ||
| COUNTER(upstream_flow_control_drained_total) \ | ||
| COUNTER(bind_errors) \ |
There was a problem hiding this comment.
Should we call the stat upstream_bind_errors ? It won't count bind errors on the listening socket.
There was a problem hiding this comment.
I agree upstream_bind_errors is confusing then. I think bind_errors is fine.
|
Wasn't sure - it's a bind error on the socket to communicate with upstream
but upstream protocol errors is protocol errors *from* upstream and so I
thought there would be a case for confusion.
…On Wed, Aug 30, 2017 at 1:06 PM, Dan Noé ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In include/envoy/upstream/upstream.h
<#1564 (comment)>:
> @@ -217,6 +217,7 @@ class HostSet {
COUNTER(upstream_flow_control_resumed_reading_total) \
COUNTER(upstream_flow_control_backed_up_total) \
COUNTER(upstream_flow_control_drained_total) \
+ COUNTER(bind_errors) \
Should we call the stat upstream_bind_errors ? It won't count bind errors
on the listening socket.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#1564 (review)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ARYFvUwuNrAeFwRGJ4hKku8lu0iexeSxks5sdZaKgaJpZM4PGP1B>
.
|
| parent_.conn_connect_ms_ = | ||
| parent_.host_->cluster().stats().upstream_cx_connect_ms_.allocateSpan(); | ||
| Upstream::Host::CreateConnectionData data = parent_.host_->createConnection(parent_.dispatcher_); | ||
| if (data.connection_->state() != Network::Connection::State::Open) { |
There was a problem hiding this comment.
IMO this approach is a bit fragile in that we don't really know that this is a bind error. Perhaps something else might change in the future? It also has the downside of not covering other cases such as tcp_proxy, redis, etc.
I see two options here:
- Rename/extend
setBufferStats(https://github.com/lyft/envoy/blob/master/include/envoy/network/connection.h#L147) to include other stats also. Then a call could pass in a bind failure stat if they want. - Make
createConnection()take a struct with optional stats in it, which will force people to plumb things through as needed (though this would potentially require changes to more callsites).
Thoughts?
There was a problem hiding this comment.
We could avoid the first concern by just adding an ASSERT in the connection that if the connection isn't connected there is a bind failure, and adding a comment pointer to anyone changing the code. Agree, the second, that this misses tcp_proxy, is more of a problem though.
setting buffer stats on creation looks super duper messy since for example in the case of the tcp proxy filter, the connection is created prior to the read callbacks which is where buffer stats are set
I'd be inclined to go with renaming setBufferStats to be more general, and then somewhat unintuitively incrementing the bind error where we raise the error onFileEvent with a comment that we defer until the caller has a chance to set [buffer] stats. If that doesn't sound too hacky to you I'll take a whack at that tomorrow
There was a problem hiding this comment.
I think I'm fine with that unless you can think of a better option. I would rename to something like setConnectionStats or setStats or something.
| Stats::Gauge& read_current_; | ||
| Stats::Counter& write_total_; | ||
| Stats::Gauge& write_current_; | ||
| Stats::Counter* bind_errors_; |
There was a problem hiding this comment.
Can you add a comment here on why this is a pointer rather than ref (upstream/downstream..)?
mattklein123
left a comment
There was a problem hiding this comment.
lgtm just needs master merge
|
Haha, *just* beat me to it :-P
…On Thu, Aug 31, 2017 at 8:04 PM, Matt Klein ***@***.***> wrote:
***@***.**** commented on this pull request.
lgtm just needs master merge
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#1564 (review)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ARYFva7VnmmO_-QFkaWhY4wa97VWZV4Mks5sd0nzgaJpZM4PGP1B>
.
|
Description: Add explicit calls to be used when releasing core types, as a step towards formalizing management of bridge memory. Risk Level: Low Testing: CI Signed-off-by: Mike Schore <mike.schore@gmail.com> Signed-off-by: JP Simard <jp@jpsim.com>
Description: Add explicit calls to be used when releasing core types, as a step towards formalizing management of bridge memory. Risk Level: Low Testing: CI Signed-off-by: Mike Schore <mike.schore@gmail.com> Signed-off-by: JP Simard <jp@jpsim.com>
**Description** This refactors tests/extproc integration tests: 1. Renamed as tests/data-plane to be usable for Dynamic Module too #1564 2. Extract tests/extproc/mcp into tests/data-plane-mcp to have a separate concern in each integration tests **Related Issues/PRs (if applicable)** Preparation for #90 --------- Signed-off-by: Takeshi Yoneda <t.y.mathetake@gmail.com>
Fixes #1411