outlier: tooling for success rate ejection#618
Conversation
include/envoy/upstream/upstream.h
Outdated
| * -1 means that the host did not have enough request volume to calculate success rate | ||
| * or the cluster did not have enough hosts to run through success rate outlier ejection. | ||
| */ | ||
| virtual double successRate() const PURE; |
There was a problem hiding this comment.
Both of these should come directly out of the outlier detector sink via here: https://github.com/lyft/envoy/blob/ce8eb8634f0c6aa24485e17fbd563ecc18d73982/include/envoy/upstream/host_description.h#L51
Don't add to host object.
docs/operations/admin.rst
Outdated
| weight, Integer, Load balancing weight (1-100) | ||
| zone, String, Service zone | ||
| canary, Boolean, Whether the host is a canary | ||
| success_rate, Double, Request success rate (0-100). -1 if there was not enough request volume to calculate it |
There was a problem hiding this comment.
can you link to the info about "not enough request volume"?
| * Log an ejection event. | ||
| * @param host supplies the host that generated the event. | ||
| * @param type supplies the type of the event. | ||
| * @param enforced is true if the ejection took place, false if only logging took place. |
| variance /= valid_success_rate_hosts.size(); | ||
| double stdev = std::sqrt(variance); | ||
|
|
||
| // Set the Detector's member variable |
| virtual const Optional<SystemTime>& lastUnejectionTime() PURE; | ||
|
|
||
| /** | ||
| * @return the success rate of the host in the last calculated interval, in the range -1-100. |
There was a problem hiding this comment.
-1-100 is a bit misleading, as it's [0;100] with -1 as a separate value. I like the way you cover this in docs, maybe do the same here.
docs/operations/admin.rst
Outdated
| weight, Integer, Load balancing weight (1-100) | ||
| zone, String, Service zone | ||
| canary, Boolean, Whether the host is a canary | ||
| success_rate, Double, "Request success rate (0-100). -1 if there was not enough |
There was a problem hiding this comment.
Link to some docs here (or write it here) that describe what interval this is over.
| * Set the success rate of the host. | ||
| * @param new_success_rate the new success rate calculated for the host in the last interval. | ||
| */ | ||
| virtual void successRate(double new_success_rate) PURE; |
There was a problem hiding this comment.
This doesn't need to be in public interface.
| double Utility::successRateEjectionThreshold( | ||
| double success_rate_sum, const std::vector<HostSuccessRatePair>& valid_success_rate_hosts) { | ||
| double success_rate_sum, const std::vector<HostSuccessRatePair>& valid_success_rate_hosts, | ||
| double& success_rate_average) { |
There was a problem hiding this comment.
Instead of passing an out param, can you return a struct. It's easier to grok what is going on.
| } | ||
|
|
||
| void EventLoggerImpl::logEject(HostDescriptionConstSharedPtr host, EjectionType type) { | ||
| void EventLoggerImpl::logEject(HostDescriptionConstSharedPtr host, EjectionType type, |
There was a problem hiding this comment.
Wouldn't it be useful to log some of the SR stats here also to see the SR, the cutoff, etc.?
include/envoy/upstream/upstream.h
Outdated
| */ | ||
| virtual ClusterInfoConstSharedPtr info() const PURE; | ||
|
|
||
| virtual Outlier::DetectorSharedPtr outlierDetector() const PURE; |
There was a problem hiding this comment.
comment, this should return a pointer to a const detector.
| static double | ||
| successRateEjectionThreshold(double success_rate_sum, | ||
| const std::vector<HostSuccessRatePair>& valid_success_rate_hosts); | ||
| const std::vector<HostSuccessRatePair>& valid_success_rate_hosts, |
| Upstream::Outlier::DetectorSharedPtr outlier_detector, | ||
| Buffer::Instance& response) { | ||
| if (outlier_detector) { | ||
| response.add(fmt::format("{}::outlier::success_rate_average::{}", cluster_name, |
There was a problem hiding this comment.
put all these into docs?
docs/intro/arch_overview/outlier.rst
Outdated
| If ``action`` is ``eject``, species the type of ejection that took place. Currently this can | ||
| only be ``5xx``. | ||
| If ``action`` is ``eject``, specifies the type of ejection that took place. Currently this can be: | ||
| ``5xx``, and ``SuccessRate``. |
There was a problem hiding this comment.
I thought you were a supporter of the oxford comma ;)
There was a problem hiding this comment.
Yes, when there are 3 or more things, not 2.
There was a problem hiding this comment.
(Comma still needs to be deleted)
docs/intro/arch_overview/outlier.rst
Outdated
|
|
||
| host_success_rate | ||
| If ``action`` is ``eject``, and ``type`` is ``SuccessRate``, specifies the host's success rate | ||
| at the time of the ejection event on a ``0-100`` range. A ``-1`` value means that a success |
There was a problem hiding this comment.
-1 can only happen if type is 5xx, right? This is kind of confusing. Can we just not write out host_succes_rate in the case of 5xx? Same for all the ones below.
There was a problem hiding this comment.
-1 can also happen if the type is success rate, i.e if there is no enough request volume. However, you are right it would never happen in the logs. Because, if a host is ejected it is because there was enough data to do that. -1 would happen in the admin /clusters endpoint, not here.
| zone, String, Service zone | ||
| canary, Boolean, Whether the host is a canary | ||
| success_rate, Double, "Request success rate (0-100). -1 if there was not enough | ||
| :ref:`request volume<config_cluster_manager_cluster_outlier_detection_success_rate_request_volume>` |
There was a problem hiding this comment.
Did you actually look at these docs rendered? This is almost definitely not correct and will look broken.
There was a problem hiding this comment.
Yeah, I rendered them and they look correct
| /** | ||
| * @return a shared pointer to the cluster's outlier detector. If an outlier detector has not been | ||
| * installed, returns a nullptr. | ||
| */ |
There was a problem hiding this comment.
return const shared pointer has no meaning. You need to return a pointer to a const Detector.
There was a problem hiding this comment.
Ah I see this is a const pointer not a pointer to a const object. What we want is to return a pointer to a const object that way we can't change the object through the pointer.
| host->cluster().name(), host->address()->asString(), | ||
| typeToString(type), host->outlierDetector().numEjections(), enforced, | ||
| host->outlierDetector().successRate(), detector.successRateAverage(), | ||
| detector.successRateEjectionThreshold())); |
There was a problem hiding this comment.
add break here for next person that comes along.
| class Utility { | ||
| public: | ||
| struct EjectionPair { | ||
| EjectionPair(double success_rate_average, double ejection_threshold) |
There was a problem hiding this comment.
You don't need constructor here. You can just do aggregate init ( {...} )
| * @param success_rate_data is the vector containing the individual success rate data points. | ||
| * @param valid_success_rate_hosts is the vector containing the individual success rate data | ||
| * points. | ||
| * @return the success rate threshold. |
|
Looks good other than comma still in docs. Do you have 100% coverage? |
363fc10 to
3825f62
Compare
docs/intro/arch_overview/outlier.rst
Outdated
| "host_success_rate": "...", | ||
| "cluster_success_rate_average": "...", | ||
| "cluster_success_rate_ejection_threshold": "..." | ||
|
|
docs/intro/arch_overview/outlier.rst
Outdated
| If ``action`` is ``eject``, species the type of ejection that took place. Currently this can | ||
| only be ``5xx``. | ||
| If ``action`` is ``eject``, specifies the type of ejection that took place. Currently this can be: | ||
| ``5xx`` and ``SuccessRate``. |
There was a problem hiding this comment.
The second sentence doesn't make sense. Can you rewrite it? Also colon after be isn't needed.
docs/intro/arch_overview/outlier.rst
Outdated
| secs_since_last_action | ||
| The time in seconds since the last action (either an ejection or unejection) | ||
| took place. This time will be -1 for the first ejection given there is no | ||
| took place. This time will be ``-1`` for the first ejection given there is no |
There was a problem hiding this comment.
change this time will be to the value will be
docs/intro/arch_overview/outlier.rst
Outdated
| reason and then re-added). | ||
|
|
||
| enforced | ||
| If ``action`` is ``eject``, specifies if the ejection was actually performed (``true``), or |
There was a problem hiding this comment.
this sentence is too long and confusing. Please break it up.
| : detector_(detector), host_(host) { | ||
| // Point the success_rate_accumulator_bucket_ pointer to a bucket. | ||
| updateCurrentSuccessRateBucket(); | ||
| success_rate_ = -1; |
| /** | ||
| * This function returns the success rate threshold for success rate outlier detection. If a | ||
| * host's success rate is under this threshold the host is an outlier. | ||
| * This function returns the an EjectionPair for success rate outlier detection. The pair contains |
| * host's success rate is under this threshold the host is an outlier. | ||
| * This function returns the an EjectionPair for success rate outlier detection. The pair contains | ||
| * the average success rate of all valid hosts in the cluster, and the ejection threshold. | ||
| * If a host's success rate is under this threshold the host is an outlier |
There was a problem hiding this comment.
nit full stop at the end.
There was a problem hiding this comment.
comma after the first sentence
| * @param success_rate_data is the vector containing the individual success rate data points. | ||
| * @return the success rate threshold. | ||
| * @param valid_success_rate_hosts is the vector containing the individual success rate data | ||
| * points. |
|
|
||
| // Need to update the writer bucket to keep the data valid. | ||
| host.second->updateCurrentSuccessRateBucket(); | ||
| // Refresh host success rate stat for the /clusters endpoint. If there is a new valid value it |
There was a problem hiding this comment.
comma after the first sentence in the if sentence.
| * This function returns the success rate threshold for success rate outlier detection. If a | ||
| * host's success rate is under this threshold the host is an outlier. | ||
| * This function returns the an EjectionPair for success rate outlier detection. The pair contains | ||
| * the average success rate of all valid hosts in the cluster, and the ejection threshold. |
There was a problem hiding this comment.
no comma needed before the and
| std::vector<HostSuccessRatePair> valid_success_rate_hosts; | ||
| double success_rate_sum = 0; | ||
|
|
||
| // Reset the Detector's success rate mean and stdev |
| std::list<ChangeStateCb> callbacks_; | ||
| std::unordered_map<HostSharedPtr, DetectorHostSinkImpl*> host_sinks_; | ||
| EventLoggerSharedPtr event_logger_; | ||
| double success_rate_average_; |
There was a problem hiding this comment.
do we need default values for these?
There was a problem hiding this comment.
including success_rate_ejection_threshold_
| virtual void logUneject(HostDescriptionConstSharedPtr host) PURE; | ||
| }; | ||
|
|
||
| typedef std::shared_ptr<EventLogger> EventLoggerSharedPtr; |
There was a problem hiding this comment.
nit: new line after type def
ccaraman
left a comment
There was a problem hiding this comment.
Looks good after few new line nits
* Integrate with new mixer control library. * Update to use splitted control lib. * remove quota config files. * Update per comment
Signed-off-by: Mandar U Jog <mjog@google.com>
Fixes envoyproxy/envoy-mobile#617, which resulted in duplicate symbols from protobuf being present in the final Envoy Mobile binary. The [root cause](bazelbuild/bazel#10484) is being [patched in Bazel](bazelbuild/bazel#10493). In the meantime, we can revert these two commits in protobuf ([first](protocolbuffers/protobuf@7b28278), [second](protocolbuffers/protobuf@a03d332)) and apply the diff as a patch until we're able to update Bazel with the fix. More context on the protobuf issue is available [here](protocolbuffers/protobuf#7046). Notes: - This should be reverted after updating Bazel with the fix - The version of protobuf specified in the `http_archive` should be in lockstep with upstream Envoy's version - Upstream Envoy's `@envoy//bazel:protobuf.patch` patch is also included here Signed-off-by: Michael Rebello <me@michaelrebello.com> Signed-off-by: JP Simard <jp@jpsim.com>
Fixes envoyproxy/envoy-mobile#617, which resulted in duplicate symbols from protobuf being present in the final Envoy Mobile binary. The [root cause](bazelbuild/bazel#10484) is being [patched in Bazel](bazelbuild/bazel#10493). In the meantime, we can revert these two commits in protobuf ([first](protocolbuffers/protobuf@7b28278), [second](protocolbuffers/protobuf@a03d332)) and apply the diff as a patch until we're able to update Bazel with the fix. More context on the protobuf issue is available [here](protocolbuffers/protobuf#7046). Notes: - This should be reverted after updating Bazel with the fix - The version of protobuf specified in the `http_archive` should be in lockstep with upstream Envoy's version - Upstream Envoy's `@envoy//bazel:protobuf.patch` patch is also included here Signed-off-by: Michael Rebello <me@michaelrebello.com> Signed-off-by: JP Simard <jp@jpsim.com>
**Commit Message** Separate envoy ai gateway crds to its own chart so helm upgrade can update the crd when there are changes. **Related Issues/PRs (if applicable)** Fixes #513 --------- Signed-off-by: Dan Sun <dsun20@bloomberg.net>
**Commit Message** This switches `ai-gateway-crd-helm` to `ai-gateway-crds-helm` for consistency with Envoy Gateway project. The chart will be pushed to https://hub.docker.com/r/envoyproxy/ai-gateway-crds-helm **Related Issues/PRs (if applicable)** Follow up on #618 --------- Signed-off-by: Takeshi Yoneda <t.y.mathetake@gmail.com>
No description provided.