Skip to content

TCP listener connection rebalancing#8422

Merged
mattklein123 merged 14 commits intomasterfrom
cx_rebalance
Oct 11, 2019
Merged

TCP listener connection rebalancing#8422
mattklein123 merged 14 commits intomasterfrom
cx_rebalance

Conversation

@mattklein123
Copy link
Copy Markdown
Member

@mattklein123 mattklein123 commented Sep 28, 2019

This commit introduces optional connection rebalancing
for TCP listeners, targeted as cases where there are a
small number of long lived connections such as service
mesh HTTP2/gRPC egress.

Part of this change involved tracking connection counts
at the per-listener level, which made it clear that we
have quite a bit of tech debt in some of our interfaces
in this area. I did various cleanups in service of this
change which leave the connection handler / accept path
in a cleaner state.

Fixes #4602

Risk Level: Medium
Testing: Unit/integration/manual
Docs Changes: Added
Release Notes: Added

@repokitteh-read-only
Copy link
Copy Markdown

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to api/.

🐱

Caused by: #8422 was opened by mattklein123.

see: more, trace.

This commit introduces optional connection rebalancing
for TCP listeners, targeted as cases where there are a
small number of long lived connections such as service
mesh HTTP2/gRPC egress.

Part of this change involved tracking connection counts
at the per-listener level, which made it clear that we
have quite a bit of tech debt in some of our interfaces
in this area. I did various cleanups in service of this
change which leave the connection handler / accept path
in a cleaner state.

Fixes #4602

Signed-off-by: Matt Klein <mklein@lyft.com>
@mattklein123 mattklein123 changed the title [WIP] TCP listener connection rebalancing TCP listener connection rebalancing Sep 29, 2019
@mattklein123 mattklein123 marked this pull request as ready for review September 29, 2019 20:35
@mattklein123 mattklein123 requested a review from lizan as a code owner September 29, 2019 20:35
@mattklein123
Copy link
Copy Markdown
Member Author

This is ready for review. @danzh2010 @jmarantz do you mind taking a first pass?

@mattklein123
Copy link
Copy Markdown
Member Author

/azp run envoy-macos

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mattklein123
Copy link
Copy Markdown
Member Author

cc @oschaaf also if you want to take a look.

Signed-off-by: Matt Klein <mklein@lyft.com>
Copy link
Copy Markdown
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flushing comments. looks really good so far.

Copy link
Copy Markdown
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I have been through all the code and left some superficial comments. I have a very high level understanding of the change you are making here -- which sounds awesome -- and a very low level understanding of the changes in lines of code you've made. But I don't have a deep enough understanding of how this all works together.

We can to 3 ways here:

  • I'm happy to learn the connection architecture more deeply but that will take some time.
  • We can bring in another reviewer who has more familiarity the subsystems you are modifying to help as
    well
  • We can take it on faith that these are good modifications to an already-robust system and go from there.

One medium-level comment: I see changes to the integration test infrastructure, and to unit tests, but I was thinking there might be a new integration test demonstrating the new dynamics.

Copy link
Copy Markdown
Contributor

@danzh2010 danzh2010 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. It took me a while to understand the whole workflow. But overall it's a very clear change.

@oschaaf
Copy link
Copy Markdown
Member

oschaaf commented Oct 2, 2019

So I wanted to share some thoughts. As I understand it, currently a connection/transaction is fully handled on a single thread. If we migrate sockets because of rebalancing, will we see transaction handling for client <-> Envoy and Envoy <-> Origin split over different threads?
I wonder, how would that impact performance? And also, will plugins/extensions get caught off-guard?

@mattklein123
Copy link
Copy Markdown
Member Author

We can bring in another reviewer who has more familiarity the subsystems you are modifying to help as well

I think @danzh2010 is an expert in this code so she can cover.

One medium-level comment: I see changes to the integration test infrastructure, and to unit tests, but I was thinking there might be a new integration test demonstrating the new dynamics.

See integration_test.cc. I modified the stats test to actually do balancing. I also verified that without balancing that test does not work reliably over 1000 runs.

@mattklein123
Copy link
Copy Markdown
Member Author

So I wanted to share some thoughts. As I understand it, currently a connection/transaction is fully handled on a single thread. If we migrate sockets because of rebalancing, will we see transaction handling for client <-> Envoy and Envoy <-> Origin split over different threads?
I wonder, how would that impact performance? And also, will plugins/extensions get caught off-guard?

We only do rebalancing at listener accept, so once a connection is accepted, it will stay on that worker forever and have similar characteristics as today. I don't foresee any performance implications beyond the (hopefully positive) ones that this change is targeted for in particular use cases. I also don't think this will have any effect on extensions as this happens before any extensions are run.

Signed-off-by: Matt Klein <mklein@lyft.com>
@mattklein123 mattklein123 added the no stalebot Disables stalebot from closing an issue label Oct 4, 2019
Signed-off-by: Matt Klein <mklein@lyft.com>
jmarantz
jmarantz previously approved these changes Oct 8, 2019
Copy link
Copy Markdown
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

basically looks good modulo tiny code nits and requests for more comments,

Signed-off-by: Matt Klein <mklein@lyft.com>
Signed-off-by: Matt Klein <mklein@lyft.com>
@mattklein123
Copy link
Copy Markdown
Member Author

@jmarantz @danzh2010 updated per comments. @jmarantz I tried to address your comment about the assert and lack of test. It's a little ugly but I think covers it. Let me know what you think.

jmarantz
jmarantz previously approved these changes Oct 9, 2019
Signed-off-by: Matt Klein <mklein@lyft.com>
Signed-off-by: Matt Klein <mklein@lyft.com>
Signed-off-by: Matt Klein <mklein@lyft.com>
@mattklein123
Copy link
Copy Markdown
Member Author

@danzh2010 @jmarantz updated per comments to date.

Signed-off-by: Matt Klein <mklein@lyft.com>
Signed-off-by: Matt Klein <mklein@lyft.com>
Signed-off-by: Matt Klein <mklein@lyft.com>
@mattklein123
Copy link
Copy Markdown
Member Author

/azp run envoy-macos

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

danzh2010
danzh2010 previously approved these changes Oct 10, 2019
jmarantz
jmarantz previously approved these changes Oct 10, 2019
Copy link
Copy Markdown
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a couple of nits; up to you if you want to apply fixes.

Signed-off-by: Matt Klein <mklein@lyft.com>
Signed-off-by: Matt Klein <mklein@lyft.com>
@mattklein123 mattklein123 dismissed stale reviews from jmarantz and danzh2010 via 323a87e October 11, 2019 14:48
@mattklein123 mattklein123 merged commit 587e079 into master Oct 11, 2019
@mattklein123 mattklein123 deleted the cx_rebalance branch October 11, 2019 17:10
nandu-vinodan pushed a commit to nandu-vinodan/envoy that referenced this pull request Oct 17, 2019
This commit introduces optional connection rebalancing
for TCP listeners, targeted as cases where there are a
small number of long lived connections such as service
mesh HTTP2/gRPC egress.

Part of this change involved tracking connection counts
at the per-listener level, which made it clear that we
have quite a bit of tech debt in some of our interfaces
in this area. I did various cleanups in service of this
change which leave the connection handler / accept path
in a cleaner state.

Fixes envoyproxy#4602

Signed-off-by: Matt Klein <mklein@lyft.com>
@hobbytp
Copy link
Copy Markdown

hobbytp commented Jun 19, 2020

@mattklein123 at the end of istio/istio#18152, I add a test result based on envoy proxy 1.14.1(istio1.6.1), but seems the balancing does not work, could you please help to check there is any configuration error or else I missed? thanks,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

no stalebot Disables stalebot from closing an issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Investigate worker connection accept balance

5 participants