Add ingress fault detection proposal by sgreene570 · Pull Request #438 · openshift/enhancements

sgreene570 · 2020-08-17T15:54:27Z

This enhancement proposal outlines an extension to the OpenShift Ingress Operator that gives the Ingress Operator the ability to alert OpenShift users when cluster Routes appear to be malfunctioning. This enhancement proposal also outlines the details of an Ingress "debug" container that can be invoked on an OpenShift cluster to pinpoint potential Ingress issues.

Problems with OpenShift routes can be difficult to diagnose for both OpenShift cluster administrators as well as OpenShift developers. The main idea of this EP is to make issues with Cluster Ingress more apparent to cluster administrators, as well to help minimize time spent debugging cluster Ingress issues by OpenShift developers.

Related to https://issues.redhat.com/browse/NE-199 (Red Hat internal).

wking · 2020-08-18T03:58:12Z

+## Proposal
+
+Once the Ingress Operator has successfully deployed an [Ingress Controller](https://github.com/openshift/cluster-ingress-operator/blob/master/README.md)
+within a cluster, the Ingress Operator will spawn a simple HTTP echo container, such as [Hello Openshift](https://hub.docker.com/r/openshift/hello-openshift/).


Before moving in this direction, we probably want to get hello-openshift images for different architectures (rhbz#1766287). Although I guess we'll be adding this image to the core images referenced from the release image, so it gets mirrored into disconnected environments, and in that case ART's current arch-specific images feeding arch-specific release images should cover the arch issue.

@wking thanks for bringing this concern to my attention. I feel that hello-openshift should become a part of the release image if and when we go forward with this proposal, given that hello-openshift is only ~ 6 mb in size, and that it is useful as a testing application for OCP users.

Alternatively, the Ingress fault detection setup could use a base image already included within the release image, and create a simple socat echo server at runtime on top of the base image, as seen in the router's CI. I would prefer to have hello-openshift available, but I am mentioning this method of creating an echo container if getting hello-openshift into the release image turns out to be problematic.

I wish I'd caught this before (alas), but hello-openshift probably shouldn't have been chosen. I would have expected the ingress operator image to contain an alternate command that exposed a specific HTTP server, and the operator image to start that variant.

I'd like to see us remove hello-openshift in a future release and switch to using a command off the operator (ingress-operator serve-healthcheck) with the appropriate inside the operator binary. Teams should use their operator images and operator commands as swiss army knives for this sort of internal testing.

I see. @wking correct me if im wrong, but there was some other motivation besides this enhancement for adding hello-openshift to the relase image, correct?
That being said, the ingress-operator could certainly switch over to an alternative command baked into the ingress-operators image. Just wondering if other components (cluster-samples?) would also be affected by the removal of hello-openshift from future release images.

Trevor may have been referring to the docker problem - that was fixed in a different way (e2e tests verifying http should be using the agnhost image provided by kube), and hello-openshift was purged from there.

I don't know of any reason in product it should remain and I'd like to see it excised. Operators doing testing should test using a known quantity (a subcommand they check), not a general purpose example image that can't be changed to exercise test behavior.

hello-openshift can be summarily removed from the samples operator - if we haven't shipped it in 4.6 we definitely should not ship it in 4.7. If we did ship it, we should mark it deprecated so no customer depends on it.

The hello-openshift imagestream fixes the remaining use cases of "openshift/hello-openshift", both wrt pulling from Docker Hub and wrt multi-arch. At least console and serverless currently still reference this in code and or tests.

@yselkowitz would it be possible for the components you mentioned to switch over to an alternative image (via a sub-command, etc). for 4.8? Are we definitely removing hello-openshift in 4.8?

as per @smarterclayton, hello-openshift will be removed in 4.8. I have updated this enhancement to explain that the ingress operator will pivot off of hello-openshift in future releases.

wking · 2020-08-18T04:06:07Z

+
+### Version Skew Strategy
+
+N/A


Will the debug container depend on particular cluster-version-linked details? E.g. what happens if you launch a 4.7 debug container into a 4.1 cluster? Or a 4.20 cluster where some aspects of the ingress/router system have changed beyond recognition? Or if a cloud provider has grown a new LB flavor and the 4.20 clusters have pivoted to using that new flavor to back routes? I expect in most of these situations, the behavior we want is "debug tool gracefully errors out and says 'you might want to use a debug tool extracted from the cluster you're inspecting'" or some such. Are we actually going to distribute the debug tool as an image referenced from the release image, like we distribute the must-gather, tools, etc. images today?

If the debug container were to be distributed as a part of the release image, that would solve the issue of version skew. The oc cli debug container invocation could use the debug container included in the current release image, and if none is available in the release image, gracefully exit. Thoughts?

sgreene570 · 2020-08-21T18:44:12Z

/assign @Miciah @danehans
/cc @frobware @mcurry-rh @knobunc

russellb · 2020-08-21T20:06:28Z

Follow up to https://issues.redhat.com/browse/NE-199.

I don't think this is publicly available, so could you update the commit message and PR description with whatever context and explanation is useful? Thanks!

sgreene570 · 2020-08-21T20:22:39Z

I don't think this is publicly available, so could you update the commit message and PR description with whatever context and explanation is useful? Thanks!

Done, thanks for the suggestion!

frobware · 2020-09-07T19:13:12Z

+within a cluster, the Ingress Operator will spawn a simple HTTP echo container, such as [Hello Openshift](https://hub.docker.com/r/openshift/hello-openshift/).
+The Ingress Operator will also create the necessary Service and Route resources to provide Ingress to the echo container within the `openshift-ingress-operator`
+namespace. Then, the Ingress Operator will periodically attempt to send a request to the echo container, and await a proper response. These requests will be sent
+so that they are routed outside of the cluster to the cloud load balancer first, in order to ensure that each peice of the Ingress "chain" is being tested.


We should also ensure that we cover HTTP/2 enabled routes (where we can) and will therefore need a custom certificate.

mentioned. Thanks!

RiRa12621 · 2020-09-10T17:18:28Z

+If the Ingress Operator is unable to send the request to the echo container, the Ingress Operator fails to receive a response
+from the echo container, or the entire request and response flow takes an unusual amount of time, the Ingress Operator will fire an alert via Prometheus.
+Additionally, the echo container will have several endpoint resources associated with it. A static route will be used to infer baseline cluster networking latency
+and jitter. A set of routes that are rotated in and out by periodic creation and deletion will be used to verify that route creation leads to functioning Ingress traffic


Have we considered to probe those routes via synthetic probes as available in the latest prometheus operator release and therefore coming to OCP sooner than later?

No, we have not yet considered synthetic probes. Could you provide a link to some docs or code so I can understand the advantage of using synthetic probes via the Prometheus operator?

It was added pretty recently with this PR.
The idea is basically that you can define blackbox probes using https://github.com/prometheus/blackbox_exporter directly via a CR.

This would avoid a need for networking edge to come up with tooling to probe the routes and evaluate those probes.
Instead a synthetic probe would come by default to clusters with the cluster monitoring operator.

cc @s-urbaniak who has more context on prometheus operator and CMO inner workins as well as potential timelines.
Serg, please also correct me if I explained something wrong.

@RiRa12621 Thanks for the explanation! For this case, Synthetic Probes sound like a pretty enticing option. Do your or @s-urbaniak have an idea of when we can expect synthetic probes to be added to OCP via the CMO? 😄

@s-urbaniak can probably put a finger on it better, but as far as I can see, it should land in the next promtheus-operator release, which if I'm not wrong should be ready in OCP 4.8, but that's just my external view on it and I would like @s-urbaniak to confirm that.

Cool. This enhancement is targeted for 4.7, so we might have to start off without synthetic probes, but we could definitely look into switching over to using them in the future 😃

Set the "request-port" header on hello-openshift HTTP responses so that clients reaching the hello-openshift application know which port the HTTP listener was accessed on. This will be useful for the Ingress Operator fault detection work. See openshift/enhancements#438 and https://issues.redhat.com/browse/NE-199 for additional context.

smarterclayton · 2020-12-17T15:15:01Z

/hold

on image placement (i'd also like to see this updated and merged now that this is actually in shipping code).

sgreene570 · 2021-02-12T17:51:55Z

commit 3e87942 fixes a typo and resolves lint errors.

This enhancement proposal outlines an extension to the OpenShift Ingress Operator that gives the Ingress Operator the ability to alert OpenShift users when Cluster Ingress appears to be malfunctioning. This enhancement proposal also outlines the details of an Ingress "debug" container that can be invoked on an OpenShift cluster to pinpoint potential Ingress issues.

Miciah · 2021-02-12T18:15:53Z

/hold cancel
/lgtm
/approve

openshift-ci-robot · 2021-02-12T18:16:08Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Miciah

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [Miciah]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 17, 2020

openshift-ci-robot requested review from derekwaynecarr and stbenjam August 17, 2020 15:54

sgreene570 force-pushed the NE-199 branch 3 times, most recently from 06250e1 to f096cf9 Compare August 17, 2020 19:42

wking reviewed Aug 18, 2020

View reviewed changes

sgreene570 force-pushed the NE-199 branch 3 times, most recently from 8fbf71e to e6f9b8c Compare August 19, 2020 19:23

sgreene570 changed the title ~~[WIP] Add ingress fault detection proposal~~ Add ingress fault detection proposal Aug 20, 2020

openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 20, 2020

sgreene570 force-pushed the NE-199 branch from e6f9b8c to ac6acf6 Compare August 20, 2020 19:49

openshift-ci-robot assigned danehans and Miciah Aug 21, 2020

openshift-ci-robot requested review from frobware, knobunc and mcurry-rh August 21, 2020 18:44

sgreene570 force-pushed the NE-199 branch from ac6acf6 to fdf2232 Compare August 21, 2020 20:16

frobware reviewed Sep 7, 2020

View reviewed changes

sgreene570 force-pushed the NE-199 branch from fdf2232 to 52c0873 Compare September 8, 2020 20:14

RiRa12621 reviewed Sep 10, 2020

View reviewed changes

sgreene570 mentioned this pull request Oct 14, 2020

hello-openshift: Echo listening port in HTTP responses via headers openshift/origin#25609

Merged

sgreene570 force-pushed the NE-199 branch 2 times, most recently from 866b45f to 7bdb8b2 Compare October 15, 2020 14:19

sgreene570 force-pushed the NE-199 branch from 7bdb8b2 to f597fcc Compare November 23, 2020 21:18

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 17, 2020

sgreene570 force-pushed the NE-199 branch 2 times, most recently from 6b25b85 to 8538daa Compare January 5, 2021 20:12

smarterclayton mentioned this pull request Jan 7, 2021

Bug 1766287: Improve usability of hello-openshift imagestream openshift/cluster-samples-operator#350

Merged