Skip to content

Treating a bind failure as a connection failure rather than crashing Envoy#1564

Merged
alyssawilk merged 5 commits intoenvoyproxy:masterfrom
alyssawilk:bind_soft_fail
Sep 5, 2017
Merged

Treating a bind failure as a connection failure rather than crashing Envoy#1564
alyssawilk merged 5 commits intoenvoyproxy:masterfrom
alyssawilk:bind_soft_fail

Conversation

@alyssawilk
Copy link
Copy Markdown
Contributor

Fixes #1411

Copy link
Copy Markdown
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation looks good, some questions on test.

}

TEST_P(ConnectionImplTest, BindFailureTest) {
std::string address_string = TestUtility::getIpv4Loopback();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny nit: prefer this initialization in the Network::Address::IpVersion::v6 clause below.

new Network::Address::Ipv6Instance(address_string, 0)};
}

if (dispatcher_.get() == nullptr) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to condition this?

[&]() -> void { fake_upstream_connection_->waitForDisconnect(); }});
}

TEST_P(BindIntegrationTest, DISABLED_TestFailedBind) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disabled?

class BindIntegrationTest : public IntegrationTest {
public:
void SetUp() override {
// Delay base class SetUp until initialize().
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To allow the bind-fail test to override the address before set-up.

I'm fairly convinced that in the long run we should ditch SetUp for Initialize() (to allow common config manipulation) but I figured on doing that in one of the PRs for simplifying config set-up

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Maybe just add an explanation of why the delay then in the comment.

if (bind_to_address != nullptr) {
int rc = bind_to_address->bind(fd);
if (rc < 0) {
ENVOY_LOG_MISC(warn, "Bind failure. Failed to bind to {}: {}", bind_to_address->asString(),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If something goes wrong here, this is going to mega spew the logs in a bad way. I'm guessing internally you have rate limiting on individual log messages. We don't have that in public code. I feel like we should have some kind of stat here for a bind error on connect? Perhaps this could be done like the buffer stats (which would be installed by the time you do the deferred close). At minimum we should probably have some kind of TODO here to think a bit more about how we would want to handle this operationally if we are not going to crash.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a counter and a debug level log would make sense.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Man, if only we had some utility for logging important error messages infrequently. Like a macro which logged with exponential back-off with a super cheap (if somewhat lossy) atomic counter to minimize CPU usage and cross thread contention. And then eventually maybe even tie it into the stats system with the macro taking a custom label or line number for easy monitoring and trace-back of unexpected but not fatal occurrences... That seems like it'd be super useful for debugging production issues for a large scale proxy.
I'll add porting this feature to my non-critical TODO list :-)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, this is going to be a very infrequently used feature (you are probably the only one who will use it for now), so if you have this covered in some other way and you want to put in a TODO that's fine.

@alyssawilk
Copy link
Copy Markdown
Contributor Author

alyssawilk commented Aug 30, 2017 via email

COUNTER(upstream_flow_control_resumed_reading_total) \
COUNTER(upstream_flow_control_backed_up_total) \
COUNTER(upstream_flow_control_drained_total) \
COUNTER(bind_errors) \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we call the stat upstream_bind_errors ? It won't count bind errors on the listening socket.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree upstream_bind_errors is confusing then. I think bind_errors is fine.

@alyssawilk
Copy link
Copy Markdown
Contributor Author

alyssawilk commented Aug 30, 2017 via email

parent_.conn_connect_ms_ =
parent_.host_->cluster().stats().upstream_cx_connect_ms_.allocateSpan();
Upstream::Host::CreateConnectionData data = parent_.host_->createConnection(parent_.dispatcher_);
if (data.connection_->state() != Network::Connection::State::Open) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this approach is a bit fragile in that we don't really know that this is a bind error. Perhaps something else might change in the future? It also has the downside of not covering other cases such as tcp_proxy, redis, etc.

I see two options here:

  1. Rename/extend setBufferStats (https://github.com/lyft/envoy/blob/master/include/envoy/network/connection.h#L147) to include other stats also. Then a call could pass in a bind failure stat if they want.
  2. Make createConnection() take a struct with optional stats in it, which will force people to plumb things through as needed (though this would potentially require changes to more callsites).

Thoughts?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could avoid the first concern by just adding an ASSERT in the connection that if the connection isn't connected there is a bind failure, and adding a comment pointer to anyone changing the code. Agree, the second, that this misses tcp_proxy, is more of a problem though.

setting buffer stats on creation looks super duper messy since for example in the case of the tcp proxy filter, the connection is created prior to the read callbacks which is where buffer stats are set

I'd be inclined to go with renaming setBufferStats to be more general, and then somewhat unintuitively incrementing the bind error where we raise the error onFileEvent with a comment that we defer until the caller has a chance to set [buffer] stats. If that doesn't sound too hacky to you I'll take a whack at that tomorrow

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm fine with that unless you can think of a better option. I would rename to something like setConnectionStats or setStats or something.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Copy link
Copy Markdown
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo minor comment.

Stats::Gauge& read_current_;
Stats::Counter& write_total_;
Stats::Gauge& write_current_;
Stats::Counter* bind_errors_;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment here on why this is a pointer rather than ref (upstream/downstream..)?

Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm just needs master merge

@alyssawilk
Copy link
Copy Markdown
Contributor Author

alyssawilk commented Sep 1, 2017 via email

@alyssawilk alyssawilk merged commit 79f4c6a into envoyproxy:master Sep 5, 2017
@alyssawilk alyssawilk deleted the bind_soft_fail branch September 7, 2017 19:38
jpsim pushed a commit that referenced this pull request Nov 28, 2022
Description: Add explicit calls to be used when releasing core types, as a step towards formalizing management of bridge memory.
Risk Level: Low
Testing: CI

Signed-off-by: Mike Schore <mike.schore@gmail.com>
Signed-off-by: JP Simard <jp@jpsim.com>
jpsim pushed a commit that referenced this pull request Nov 29, 2022
Description: Add explicit calls to be used when releasing core types, as a step towards formalizing management of bridge memory.
Risk Level: Low
Testing: CI

Signed-off-by: Mike Schore <mike.schore@gmail.com>
Signed-off-by: JP Simard <jp@jpsim.com>
mathetake added a commit that referenced this pull request Mar 3, 2026
**Description**

This refactors tests/extproc integration tests:
1. Renamed as tests/data-plane to be usable for Dynamic Module too #1564
2. Extract tests/extproc/mcp into tests/data-plane-mcp to have a
separate concern in each integration tests

**Related Issues/PRs (if applicable)**

Preparation for #90

---------

Signed-off-by: Takeshi Yoneda <t.y.mathetake@gmail.com>
mathetake added a commit that referenced this pull request Mar 3, 2026
**Description**

This upgrades Envoy version used in data-plane tests. This is mostly
decouple unrelated change from #1564

**Related Issues/PRs (if applicable)**

Follow up on #1660

Signed-off-by: Takeshi Yoneda <t.y.mathetake@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants