Skip to content

Start slow start window on first successful HC, make slow start re-enterable#23946

Merged
mattklein123 merged 18 commits intoenvoyproxy:mainfrom
nezdolik:slow-start-first-hc
Dec 2, 2022
Merged

Start slow start window on first successful HC, make slow start re-enterable#23946
mattklein123 merged 18 commits intoenvoyproxy:mainfrom
nezdolik:slow-start-first-hc

Conversation

@nezdolik
Copy link
Copy Markdown
Member

Signed-off-by: Kateryna Nezdolii nezdolik@spotify.com

New changes to "slow start" logic:

  • When active HC is enabled per cluster, start "slow start" calculation after first passing HC (drop cluster membership duration condition).
  • When no active HC is enabled per cluster, enter "slow start" immediately (drop cluster membership duration condition).
  • Endpoint can re-enter "slow start" if active HC is configured per cluster, on each "unhealthy -> healthy" state transition.

Those changes make "slow start" in Envoy to align more with ELB slow start feature.

Commit Message:
Additional Description:
Risk Level: Medium
Testing: Done
Docs Changes: Done
Release Notes: Not Done
Platform Specific Features:
Fixes #18351

Kateryna Nezdolii added 2 commits November 10, 2022 23:12
Signed-off-by: Kateryna Nezdolii <nezdolik@spotify.com>
Signed-off-by: Kateryna Nezdolii <nezdolik@spotify.com>
@phlax
Copy link
Copy Markdown
Member

phlax commented Nov 11, 2022

/retest

@repokitteh-read-only
Copy link
Copy Markdown

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #23946 (comment) was created by @phlax.

see: more, trace.

Signed-off-by: Kateryna Nezdolii <nezdolik@spotify.com>
Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. I think this makes a lot of sense. Some small comments and also it looks like tidy is still broken. Thank you.

/wait

* Set the timestamp of when the host has transitioned from unhealthy to healthy state via an
* active healchecking.
*/
virtual void setLastHcPassTime(MonotonicTime&& last_hc_pass_time) PURE;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: realistically it will likely be faster to pass this by value as it's just an integer

Comment on lines +278 to +280
if (first_check_ || host_->healthFlagGet(Host::HealthFlag::FAILED_ACTIVE_HC)) {
host_->setLastHcPassTime(time_source_.monotonicTime());
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we should actually do this in the block below when we reach the healthy threshold? It seems like we should start slow start when the host actually becomes healthy? wdyt?

Copy link
Copy Markdown
Member Author

@nezdolik nezdolik Nov 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, did not realise that hosts have Host::HealthFlag::FAILED_ACTIVE_HC flag set by default during initialisation

Co-authored-by: Matt Klein <mattklein123@gmail.com>
Signed-off-by: Kateryna Nezdolii <nezdolik@spotify.com>
Kateryna Nezdolii added 5 commits November 25, 2022 13:55
Signed-off-by: Kateryna Nezdolii <nezdolik@spotify.com>
Signed-off-by: Kateryna Nezdolii <nezdolik@spotify.com>
Signed-off-by: Kateryna Nezdolii <nezdolik@spotify.com>
Signed-off-by: Kateryna Nezdolii <nezdolik@spotify.com>
Signed-off-by: Kateryna Nezdolii <nezdolik@spotify.com>
@nezdolik
Copy link
Copy Markdown
Member Author

need to increase test coverage

Signed-off-by: Kateryna Nezdolii <nezdolik@spotify.com>
Signed-off-by: Kateryna Nezdolii <nezdolik@spotify.com>
Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks makes sense to me.

/wait

Comment on lines +22 to +27
moved the strict_dns, original_dst, and logical_dns clusters to extensions. If you use these clusters and override extensions_build_config.bzl you will now need to include it explicitly.
moved the static, strict_dns and original_dst clusters to extensions. If you use these clusters and override extensions_build_config.bzl you will now need to include it explicitly.
- area: loadbalancing
change: |
When active HC is enabled per cluster, slow start calculation now starts after first passing HC. Cluster membership duration condition is dropped from slow start calculation.
Endpoint can now re-enter slow start if active HC is configured per cluster, on each "unhealthy -> healthy" state transition.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you have some kind of merge issue here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeap, fixed.

Signed-off-by: Kateryna Nezdolii <nezdolik@spotify.com>
Signed-off-by: Kateryna Nezdolii <nezdolik@spotify.com>
@nezdolik
Copy link
Copy Markdown
Member Author

There was a failure in unrelated test, merged latest main. Let's see if it helps

Kateryna Nezdolii added 2 commits November 30, 2022 13:07
Signed-off-by: Kateryna Nezdolii <nezdolik@spotify.com>
Signed-off-by: Kateryna Nezdolii <nezdolik@spotify.com>
Kateryna Nezdolii added 2 commits November 30, 2022 15:17
Signed-off-by: Kateryna Nezdolii <nezdolik@spotify.com>
Signed-off-by: Kateryna Nezdolii <nezdolik@spotify.com>
@nezdolik
Copy link
Copy Markdown
Member Author

@mattklein123 so far the CI has been complaining about overall low test coverage for class RealHostDescription in logical_host.h. That class is not used anywhere in code and in tests. I have added test for RealHostDescription, but had to change visibility for the library that included it (now format check if complaining because of changed visibility) Wonder if better approach would have been to remove that class altogether?

@mattklein123
Copy link
Copy Markdown
Member

@mattklein123 so far the CI has been complaining about overall low test coverage for class RealHostDescription in logical_host.h. That class is not used anywhere in code and in tests. I have added test for RealHostDescription, but had to change visibility for the library that included it (now format check if complaining because of changed visibility) Wonder if better approach would have been to remove that class altogether?

Yes please remove if it's dead code! Thank you.

/wait

Signed-off-by: Kateryna Nezdolii <nezdolik@spotify.com>
EXPECT_EQ("", data.host_description_->locality().zone());
EXPECT_EQ("", data.host_description_->locality().sub_zone());
EXPECT_EQ("foo.bar.com", data.host_description_->hostname());
EXPECT_FALSE(data.host_description_->lastHcPassTime());
Copy link
Copy Markdown
Member Author

@nezdolik nezdolik Dec 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was wrong, class RealHostDescription is used in one function. So i reverted previous commit that changes visibility level and instead increased coverage in existing test. Those functions had 0% coverage. There were few more (mostly setters) with 0%, but this should be good enough to pass coverage threshold.

Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@mattklein123 mattklein123 merged commit 008c6d7 into envoyproxy:main Dec 2, 2022
jpsim added a commit to alyssawilk/envoy that referenced this pull request Dec 2, 2022
* origin/main:
  Start slow start window on first successful HC, make slow start re-enterable (envoyproxy#23946)
  contrib-sipproxy: rework sipproxy stats (envoyproxy#24165)
  build(deps): bump github/codeql-action from 2.1.32 to 2.1.35 (envoyproxy#24292)
  build(deps): bump node from `c59fb39` to `80844b6` in /examples/ext_authz/auth/http-service (envoyproxy#24274)
  build(deps): bump actions/upload-artifact from 2 to 3 (envoyproxy#24121)
  build(deps): bump mysql from `96439dd` to `66efaaa` in /examples/mysql (envoyproxy#24273)
  build(deps): bump grpcio from 1.50.0 to 1.51.1 in /examples/grpc-bridge/client (envoyproxy#24272)
  Allow mobile/library/common/jni/ to be built on non-Android platforms. (envoyproxy#24299)
  generic proxy: added drain support to generic proxy to doing graceful closes on connections when possible (envoyproxy#24220)
  Reduce Route/VirtualHost memory utilization by avoiding RateLimitPolicy instances when not needed (envoyproxy#24243)
  Reduce VirtualHost memory utilization by avoiding CatchAllVirtualCluster when not needed (envoyproxy#24182)
  Fix mobile/tools/check_format.sh to re-apply envoyproxy#2698 (envoyproxy#24300)
  upstream: don't require `source_address` in `upstream_bind_config` (envoyproxy#24250)
  Quiche roll 20221201150327 (envoyproxy#24290)

Signed-off-by: JP Simard <jp@jpsim.com>
jpsim added a commit that referenced this pull request Dec 2, 2022
…/docs/sphinx-5.3.0

* origin/main: (50 commits)
  ci: Bump mobile build images (#24317)
  ci: update llvm on bazel CI (#24296)
  ci: Fix flaky coverage limit (#24320)
  ci: Fix change detection (part 2) (#24264)
  Start slow start window on first successful HC, make slow start re-enterable (#23946)
  contrib-sipproxy: rework sipproxy stats (#24165)
  build(deps): bump github/codeql-action from 2.1.32 to 2.1.35 (#24292)
  build(deps): bump node from `c59fb39` to `80844b6` in /examples/ext_authz/auth/http-service (#24274)
  build(deps): bump actions/upload-artifact from 2 to 3 (#24121)
  build(deps): bump mysql from `96439dd` to `66efaaa` in /examples/mysql (#24273)
  build(deps): bump grpcio from 1.50.0 to 1.51.1 in /examples/grpc-bridge/client (#24272)
  Allow mobile/library/common/jni/ to be built on non-Android platforms. (#24299)
  generic proxy: added drain support to generic proxy to doing graceful closes on connections when possible (#24220)
  Reduce Route/VirtualHost memory utilization by avoiding RateLimitPolicy instances when not needed (#24243)
  Reduce VirtualHost memory utilization by avoiding CatchAllVirtualCluster when not needed (#24182)
  Fix mobile/tools/check_format.sh to re-apply #2698 (#24300)
  upstream: don't require `source_address` in `upstream_bind_config` (#24250)
  Quiche roll 20221201150327 (#24290)
  Cleanup threading/ownership semantics of PlatformBridgeCertValidator (#2713)
  Add unit tests of the PlatformBridgeCertValidator (#2704)
  ...

Signed-off-by: JP Simard <jp@jpsim.com>
jpsim added a commit that referenced this pull request Dec 2, 2022
…/docs/pygments-2.13.0

* origin/main: (25 commits)
  ci: Bump mobile build images (#24317)
  ci: update llvm on bazel CI (#24296)
  ci: Fix flaky coverage limit (#24320)
  ci: Fix change detection (part 2) (#24264)
  Start slow start window on first successful HC, make slow start re-enterable (#23946)
  contrib-sipproxy: rework sipproxy stats (#24165)
  build(deps): bump github/codeql-action from 2.1.32 to 2.1.35 (#24292)
  build(deps): bump node from `c59fb39` to `80844b6` in /examples/ext_authz/auth/http-service (#24274)
  build(deps): bump actions/upload-artifact from 2 to 3 (#24121)
  build(deps): bump mysql from `96439dd` to `66efaaa` in /examples/mysql (#24273)
  build(deps): bump grpcio from 1.50.0 to 1.51.1 in /examples/grpc-bridge/client (#24272)
  Allow mobile/library/common/jni/ to be built on non-Android platforms. (#24299)
  generic proxy: added drain support to generic proxy to doing graceful closes on connections when possible (#24220)
  Reduce Route/VirtualHost memory utilization by avoiding RateLimitPolicy instances when not needed (#24243)
  Reduce VirtualHost memory utilization by avoiding CatchAllVirtualCluster when not needed (#24182)
  Fix mobile/tools/check_format.sh to re-apply #2698 (#24300)
  upstream: don't require `source_address` in `upstream_bind_config` (#24250)
  Quiche roll 20221201150327 (#24290)
  Cleanup threading/ownership semantics of PlatformBridgeCertValidator (#2713)
  Add unit tests of the PlatformBridgeCertValidator (#2704)
  ...

Signed-off-by: JP Simard <jp@jpsim.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Start slow start window on first successful HC

3 participants