Skip to content

server: add support for range trigger overload actions#11697

Closed
akonradi wants to merge 22 commits intoenvoyproxy:masterfrom
akonradi:overload-scaled-timeouts
Closed

server: add support for range trigger overload actions#11697
akonradi wants to merge 22 commits intoenvoyproxy:masterfrom
akonradi:overload-scaled-timeouts

Conversation

@akonradi
Copy link
Copy Markdown
Contributor

@akonradi akonradi commented Jun 22, 2020

Commit Message:
Change the Overload Manager API to extend the overload action state with non-binary values. This will allow future overload actions to take effect in response to increasing load, instead of to the existing inactive/active binary values. This PR also adds a range field to the overload trigger config, though no actions currently trigger on the 'scaling' state.

Additional Description:
The existing behavior is preserved by replacing all places where OverloadActionState::Active was being used with OverloadActionState::saturated().

Risk Level: medium
Testing: ran unit and integration tests
Docs Changes: none
Release Notes: add "range" trigger for OverloadManager actions
#11427

/cc @antoniovicente
/cc @eziskind

akonradi added 6 commits June 22, 2020 13:51
Make ThreadLocalOverloadState an interface that the Overload Manager can
return an implementation of. This decouples the behavior specified by
the header from the details of the implementation.

Signed-off-by: Alex Konradi <akonradi@google.com>
Add a hook to allow the Overload Manager to be inserted in the timer
creation process. The returned function is currently pass-through, but
can be used in the future to create timers that are responsive to
overload conditions.

Signed-off-by: Alex Konradi <akonradi@google.com>
The only user of a null OverloadManager was the admin console. Switch
that to using a no-op OverloadManager subclass so that the OM can be
provided via reference in all cases. This makes the HCM initialization
simpler since it doesn't need to worry about the null pointer.

Signed-off-by: Alex Konradi <akonradi@google.com>
Use the newly-added timer factory on the overload manager to construct
timers. The labels are currently unused, but will allow the overload
manager to apply different modifications per-timer type.

Signed-off-by: Alex Konradi <akonradi@google.com>
In order to implement linearly-decreasing connection timeouts in
response to load, the Overload Manager needs a way to apply actions that
take load into account. This commit extends the OM OverloadActionState
so that instead of a binary 'Inactive'/'Active' state, it records one of
'Inactive'/'Scaling'/'Saturated' states, where 'Scaling' comes with a
scaled value between 0 and 1 for the level of saturation.

Signed-off-by: Alex Konradi <akonradi@google.com>
Add a config option and implementation for the Overload Manager to
enable range triggers, which do not saturate immediately, but can
provide a scaled value based on the resources being monitored.

Signed-off-by: Alex Konradi <akonradi@google.com>
@repokitteh-read-only
Copy link
Copy Markdown

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to api/.
CC @envoyproxy/api-watchers: FYI only for changes made to api/.

🐱

Caused by: #11697 was opened by akonradi.

see: more, trace.

Signed-off-by: Alex Konradi <akonradi@google.com>
@akonradi akonradi force-pushed the overload-scaled-timeouts branch from 4d08910 to ee9338d Compare June 22, 2020 18:57
@antoniovicente
Copy link
Copy Markdown
Contributor

/assign antoniovicente

@akonradi
Copy link
Copy Markdown
Contributor Author

I just realized that the overload action state work is probably separable from the timer indirection. Let me know if I should revert those commits here and put them on a separate PR.

akonradi added 2 commits June 22, 2020 15:13
Signed-off-by: Alex Konradi <akonradi@google.com>
Signed-off-by: Alex Konradi <akonradi@google.com>
Copy link
Copy Markdown
Contributor

@eziskind eziskind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

if (trigger_new_state == OverloadActionState::inactive()) {
active_gauge_.set(0);
} else if (trigger_new_state == OverloadActionState::saturated()) {
active_gauge_.set(1);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be useful to also have a gauge stat which can capture the "scaling" state. Feel to leave as a TODO for now.

double value = 1 [(validate.rules).double = {lte: 1.0 gte: 0.0}];
}

message RangeTrigger {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you enforce min/max value while also use the existing range types?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what we actually have here is closer to https://github.com/envoyproxy/envoy/blob/master/api/envoy/type/v3/percent.proto. Maybe add a range there?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems inconsistent with the existing ThresholdTrigger definition, which uses a double in the range [0, 1].

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

htuch: we need a low percentage and a high percentage at which behavior should transition to linear scaling. We need to define a new message type anyway.

}

message RangeTrigger {
// If the resource pressure is greater than this value, the trigger will enter the scaling state.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the first time the reader encounters "scaling state". It should either be explain in detail or a link to the architecture docs added with :ref: added.

Copy link
Copy Markdown
Contributor

@antoniovicente antoniovicente left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking forward to extensions to overload manager capabilities. Thanks for taking this on.

// Drop new requests when overloaded as soon as we have decoded the headers.
if (connection_manager_.overload_stop_accepting_requests_ref_ ==
Server::OverloadActionState::Active) {
Server::OverloadActionState::saturated()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defining saturation/inactive via equality doesn't seem like the most clear possible signal. How about adding a bool isSaturated() const accessor?

if (connection_manager_.overload_stop_accepting_requests_ref_.isSaturated()) {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't seem like it scales very well; we'd end up with isSaturated(), isInactive(), and isScaling() to cover the three possible ranges. The options I explored were

  1. using a variant<inactive{}, scaling{double}, saturated{}>, but that makes call sites messier and wastes space with the tag
  2. adding a state() method that returns an enum that is one of {Inactive, Scaling, Saturated}
  3. adding a value() method that returns the scale value for the scaling case - but then what does it return for the other cases?

Here's another option: I could get rid of the == and != operators and force callers to rely on <, <=, >, and >=. It makes it clear that there's an ordering for the values and doesn't expose double equality comparison.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see us wanting methods that tell us if the resource is saturated or if it is operating in the scaled linear region.

isInactive() seems like an anti-pattern. You want !isSaturated() or scaleFactor() == 0 depending on what kind of resource you're operating on.

Are the comparison operators really needed?

CMP_OPERATOR(<=);
CMP_OPERATOR(>=);
CMP_OPERATOR(>);
#undef CMP_OPERATOR
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there precedent for implementing all of these operators?

Are operators other than == and != used?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, only == and != are used currently, but we'll want the rest of these for comparing state objects in the scaling regime.

}

OverloadTimerFactory getTimerFactory() override {
return [this](absl::string_view /*unused*/, std::function<void()> callback) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use of an enum type that we can cast to to int would provide more efficient lookup than string_view.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want this to be available as an extension point? I could see implementers of custom filters and such wanting to register their own timers for scaling in response to load.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it matter what kind of timer is being created? The actual timeout value is an input to the enable method.

What if Timer had a method enableTimerRange(const std::chrono::milliseconds& min, const std::chrono::milliseconds& max) that scales timeout based on overload manager state?

connection_idle_timer_ = read_callbacks_->connection().dispatcher().createTimer(
[this]() -> void { onIdleTimeout(); });
connection_idle_timer_ =
overload_timer_factory_(kConnectionIdleTimerLabel, [this]() -> void { onIdleTimeout(); });
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider some changes to how timers are created and enabled?

The thing to realize here is the pattern of creating a timer and immediately enabling it. Should the factory create and enable the timer internally? Doing that would provide us with some freedom regarding representation of timers that have similar deadline deltas. The current Timer class that accepts a delta on enableTimer, we may want a class that takes the timeout delta as a constructor argument and provides methods to register/unregister for that timeout.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not against changing the timer interface, but I think that should be addressed in another PR. That would also allow us to get rid of the redundancies of the enableTimer and enableHRTimer methods - if you construct a timer with a high-resolution value, it is a HR timer, otherwise a regular timer.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can call enableTimer with different values on the same timer object.

The issue of standard vs HR timers is separate. I think that based on how timers are usually created we only need enableHRTimer. I have a pending change that may obsolete the need for enableHRTimer.

})) {
if (overload_manager.registerForAction(
action_name, dispatcher, [this](Server::OverloadActionState state) {
active_ = (state == Server::OverloadActionState::saturated());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

variable naming consistency: active_ is true when state is saturated(). Should the variable be called saturated_?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to touch this code as little as possible, hence the minimal change. I'd suggest that since the heap shrinker can't be partially active, it doesn't make sense to talk about saturation here, and to leave this as a boolean active_.

double value = 1 [(validate.rules).double = {lte: 1.0 gte: 0.0}];
}

message RangeTrigger {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you enforce min/max value while also use the existing range types?

return false;
}
const auto trigger_new_state = it->second->actionState();
if (trigger_new_state == OverloadActionState::inactive()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comparing to OverloadActionState::inactive() seems like an anti-pattern.

/**
* Factory function exposed by the overload manager for creating timers.
*/
using OverloadTimerFactory = std::function<Event::TimerPtr(absl::string_view, Event::TimerCb)>;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you document the meaning of the first argument to this factory method function signature?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Signed-off-by: Alex Konradi <akonradi@google.com>
@repokitteh-read-only
Copy link
Copy Markdown

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to api/envoy/.
CC @envoyproxy/api-watchers: FYI only for changes made to api/envoy/.

🐱

Caused by: #11697 was synchronize by akonradi.

see: more, trace.

akonradi added 4 commits June 25, 2020 11:31
Signed-off-by: Alex Konradi <akonradi@google.com>
Log only when it changes region, not when it changes value within the
scaling region.

Signed-off-by: Alex Konradi <akonradi@google.com>
Use the "active" stat only for threshold triggers, and "scale_value"
stat for range triggers.

Signed-off-by: Alex Konradi <akonradi@google.com>
Signed-off-by: Alex Konradi <akonradi@google.com>
@akonradi
Copy link
Copy Markdown
Contributor Author

Followed @eziskind's suggestion to have different stats for different triggers and added more documentation for the overload manager.

akonradi added 3 commits June 25, 2020 15:29
This should not be used by consumers of the ThreadLocalOverloadState
API, only by the OverloadManagerImpl.

Signed-off-by: Alex Konradi <akonradi@google.com>
…imeouts

Signed-off-by: Alex Konradi <akonradi@google.com>
Signed-off-by: Alex Konradi <akonradi@google.com>
@akonradi
Copy link
Copy Markdown
Contributor Author

I'm realizing that inserting the Overload Manager in front of every HCM timer creation is premature. I'm going to revert that here and leave that for another PR, so this one can focus on the range trigger addition.

akonradi added 5 commits June 26, 2020 14:06
This reverts commit 8833112.

Signed-off-by: Alex Konradi <akonradi@google.com>
This reverts commit 2b4ea3a.

Signed-off-by: Alex Konradi <akonradi@google.com>
@stale
Copy link
Copy Markdown

stale bot commented Jul 3, 2020

This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Jul 3, 2020
@stale
Copy link
Copy Markdown

stale bot commented Jul 11, 2020

This pull request has been automatically closed because it has not had activity in the last 14 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@stale stale bot closed this Jul 11, 2020
htuch pushed a commit that referenced this pull request Aug 12, 2020
Change the Overload Manager API to extend the overload action state with non-binary values. This will allow future overload actions to take effect in response to increasing load, instead of to the existing inactive/active binary values. This PR also adds a range field to the overload trigger config, though no actions currently trigger on the 'scaling' state.

The existing behavior is preserved by replacing all places where OverloadActionState::Active was being used with OverloadActionState::saturated(). This is a minimal re-hashing of #11697 for ##11427.

Risk Level: medium
Testing: ran unit and integration tests
Docs Changes: none
Release Notes: add "scaling" trigger for OverloadManager actions

Signed-off-by: Alex Konradi <akonradi@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api stale stalebot believes this issue/PR has not been touched recently

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants