Skip to content

subset lb: avoid partitioning host lists on worker threads#6302

Merged
mattklein123 merged 4 commits intoenvoyproxy:masterfrom
snowp:subset-optimize
Apr 5, 2019
Merged

subset lb: avoid partitioning host lists on worker threads#6302
mattklein123 merged 4 commits intoenvoyproxy:masterfrom
snowp:subset-optimize

Conversation

@snowp
Copy link
Copy Markdown
Contributor

@snowp snowp commented Mar 16, 2019

This changes the subset LB to not read host.health() on the worker
thread, instead relying on the partitioning made on the main thread.
This ensures consistency as host.health() might change due to health
checks, config updates, etc.

Also ensures consistent handling of metadata reads by caching the result
of the predicate. This is likely to come with a perf improvement, as
we're now making fewer metadata reads.

Fixes #6301

Signed-off-by: Snow Pettersen snowp@squareup.com

Risk Level: Medium, changes the algorithm for processing subset host updates.
Testing: Existing unit test coverage, there should be no change in behavior.
Docs Changes: n/a
Release Notes: n/a

This changes the subset LB to not read host.health() on the worker
thread, instead relying on the partitioning made on the main thread.
This ensures consistency as host.health() might change due to health
checks, config updates, etc.

Also ensures consistent handling of metadata reads by caching the result
of the predicate. This is likely to come with a perf improvement, as
we're now making fewer metadata reads.

Addresses envoyproxy#6301

Signed-off-by: Snow Pettersen <snowp@squareup.com>
@snowp
Copy link
Copy Markdown
Contributor Author

snowp commented Mar 16, 2019

Didn't add any tests since there were no behavior changes, but it might make sense to add a regression test that verifies that health() is not called during the update

@zuercher
Copy link
Copy Markdown
Member

/retest (error was no space left on device)

@repokitteh-read-only
Copy link
Copy Markdown

🔨 rebuilding ci/circleci: asan (failed build)

🐱

Caused by: a #6302 (comment) was created by @zuercher.

see: more, trace.

Copy link
Copy Markdown
Member

@zuercher zuercher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Would calling health() just be a performance regression? If so, I think you can leave the extra test off.

@snowp
Copy link
Copy Markdown
Contributor Author

snowp commented Mar 18, 2019

The regression I was thinking about is the fact that calling health() multiple times for the same host. Since the value might change between calls, it would give rise to subtle bugs. Not sure if it's worth a test, wdyt? Maybe a comment is good enough?

@snowp
Copy link
Copy Markdown
Contributor Author

snowp commented Mar 18, 2019

/retest

(timeout in mysql test)

@repokitteh-read-only
Copy link
Copy Markdown

🔨 rebuilding ci/circleci: asan (failed build)

🐱

Caused by: a #6302 (comment) was created by @snowp.

see: more, trace.

@zuercher
Copy link
Copy Markdown
Member

I see. Seems worth it if it's not too complicated to write. You might be able to piggy back on an existing subset lb test where the same host appears in multiple sets and just change a mock to error on multiple calls.

@stale
Copy link
Copy Markdown

stale bot commented Mar 25, 2019

This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Mar 25, 2019
std::unordered_set<HostSharedPtr> predicate_added;
HostVectorSharedPtr hosts(new HostVector());
HostVectorSharedPtr healthy_hosts(new HostVector());
HostVectorSharedPtr degraded_hosts(new HostVector());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: isn't std::make_shared() preferred here?

@stale stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Mar 25, 2019

// TODO(snowp): If we had a unhealthyHosts() function we could avoid potentially traversing
// the list of hosts twice.
for (const auto host : original_host_set_.hosts()) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't const auto& host cheaper here? [avoid copying the shared_ptr, which implies increasing refcount, etc]

for (const auto host : hosts_removed) {
if (predicate(*host)) {
filtered_removed.emplace_back(host);
for (const auto host : original_host_set_.healthyHosts()) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto?

Snow Pettersen added 2 commits March 25, 2019 18:29
Signed-off-by: Snow Pettersen <snowp@squareup.com>
Signed-off-by: Snow Pettersen <snowp@squareup.com>
@stale
Copy link
Copy Markdown

stale bot commented Apr 2, 2019

This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Apr 2, 2019
Signed-off-by: Snow Pettersen <snowp@squareup.com>
@stale stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Apr 2, 2019
@snowp
Copy link
Copy Markdown
Contributor Author

snowp commented Apr 3, 2019

@zuercher I added the regression test finally, PTAL

Copy link
Copy Markdown
Member

@zuercher zuercher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@mattklein123 mattklein123 merged commit dcbe3fe into envoyproxy:master Apr 5, 2019
mpuncel added a commit to mpuncel/envoy that referenced this pull request Apr 8, 2019
* master: (137 commits)
  test: router upstream log to v2 config stubs (envoyproxy#6499)
  remove idle timeout validation (envoyproxy#6500)
  build: Change namespace of chromium_url. (envoyproxy#6506)
  coverage: exclude chromium_url (envoyproxy#6498)
  fix(tracing): allow 256 chars in path tag (envoyproxy#6492)
  Common: Introduce StopAllIteration filter status for decoding and encoding filters (envoyproxy#5954)
  build: update PGV url (envoyproxy#6495)
  subset lb: avoid partitioning host lists on worker threads (envoyproxy#6302)
  ci: Make envoy_select_quiche no-op. (envoyproxy#6393)
  watcher: notify when watched files are modified (envoyproxy#6215)
  stat: Add counterFromStatName(), gaugeFromStatName(), and histogramFromStatName() (envoyproxy#6475)
  bump to 1.11.0-dev (envoyproxy#6490)
  release: bump to 1.10.0 (envoyproxy#6489)
  hcm: path normalization. (#1)
  build: import manually minified Chrome URL lib. (envoyproxy#3)
  codec: reject embedded NUL in headers. (envoyproxy#2)
  Added veryfication if path contains query params and add them to path header (envoyproxy#6466)
  redis: basic integration test for redis_proxy (envoyproxy#6450)
  stats: report sample count as an integer to prevent loss of precision (envoyproxy#6274)
  Added VHDS protobuf message and updated RouteConfig to include it. (envoyproxy#6418)
  ...

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants