Add ingress fault detection proposal#438
Conversation
06250e1 to
f096cf9
Compare
| ## Proposal | ||
|
|
||
| Once the Ingress Operator has successfully deployed an [Ingress Controller](https://github.com/openshift/cluster-ingress-operator/blob/master/README.md) | ||
| within a cluster, the Ingress Operator will spawn a simple HTTP echo container, such as [Hello Openshift](https://hub.docker.com/r/openshift/hello-openshift/). |
There was a problem hiding this comment.
Before moving in this direction, we probably want to get hello-openshift images for different architectures (rhbz#1766287). Although I guess we'll be adding this image to the core images referenced from the release image, so it gets mirrored into disconnected environments, and in that case ART's current arch-specific images feeding arch-specific release images should cover the arch issue.
There was a problem hiding this comment.
@wking thanks for bringing this concern to my attention. I feel that hello-openshift should become a part of the release image if and when we go forward with this proposal, given that hello-openshift is only ~ 6 mb in size, and that it is useful as a testing application for OCP users.
Alternatively, the Ingress fault detection setup could use a base image already included within the release image, and create a simple socat echo server at runtime on top of the base image, as seen in the router's CI. I would prefer to have hello-openshift available, but I am mentioning this method of creating an echo container if getting hello-openshift into the release image turns out to be problematic.
There was a problem hiding this comment.
I wish I'd caught this before (alas), but hello-openshift probably shouldn't have been chosen. I would have expected the ingress operator image to contain an alternate command that exposed a specific HTTP server, and the operator image to start that variant.
I'd like to see us remove hello-openshift in a future release and switch to using a command off the operator (ingress-operator serve-healthcheck) with the appropriate inside the operator binary. Teams should use their operator images and operator commands as swiss army knives for this sort of internal testing.
There was a problem hiding this comment.
I see. @wking correct me if im wrong, but there was some other motivation besides this enhancement for adding hello-openshift to the relase image, correct?
That being said, the ingress-operator could certainly switch over to an alternative command baked into the ingress-operators image. Just wondering if other components (cluster-samples?) would also be affected by the removal of hello-openshift from future release images.
There was a problem hiding this comment.
Trevor may have been referring to the docker problem - that was fixed in a different way (e2e tests verifying http should be using the agnhost image provided by kube), and hello-openshift was purged from there.
I don't know of any reason in product it should remain and I'd like to see it excised. Operators doing testing should test using a known quantity (a subcommand they check), not a general purpose example image that can't be changed to exercise test behavior.
There was a problem hiding this comment.
hello-openshift can be summarily removed from the samples operator - if we haven't shipped it in 4.6 we definitely should not ship it in 4.7. If we did ship it, we should mark it deprecated so no customer depends on it.
There was a problem hiding this comment.
The hello-openshift imagestream fixes the remaining use cases of "openshift/hello-openshift", both wrt pulling from Docker Hub and wrt multi-arch. At least console and serverless currently still reference this in code and or tests.
There was a problem hiding this comment.
@yselkowitz would it be possible for the components you mentioned to switch over to an alternative image (via a sub-command, etc). for 4.8? Are we definitely removing hello-openshift in 4.8?
There was a problem hiding this comment.
as per @smarterclayton, hello-openshift will be removed in 4.8. I have updated this enhancement to explain that the ingress operator will pivot off of hello-openshift in future releases.
|
|
||
| ### Version Skew Strategy | ||
|
|
||
| N/A |
There was a problem hiding this comment.
Will the debug container depend on particular cluster-version-linked details? E.g. what happens if you launch a 4.7 debug container into a 4.1 cluster? Or a 4.20 cluster where some aspects of the ingress/router system have changed beyond recognition? Or if a cloud provider has grown a new LB flavor and the 4.20 clusters have pivoted to using that new flavor to back routes? I expect in most of these situations, the behavior we want is "debug tool gracefully errors out and says 'you might want to use a debug tool extracted from the cluster you're inspecting'" or some such. Are we actually going to distribute the debug tool as an image referenced from the release image, like we distribute the must-gather, tools, etc. images today?
There was a problem hiding this comment.
If the debug container were to be distributed as a part of the release image, that would solve the issue of version skew. The oc cli debug container invocation could use the debug container included in the current release image, and if none is available in the release image, gracefully exit. Thoughts?
8fbf71e to
e6f9b8c
Compare
|
/assign @Miciah @danehans |
I don't think this is publicly available, so could you update the commit message and PR description with whatever context and explanation is useful? Thanks! |
Done, thanks for the suggestion! |
| within a cluster, the Ingress Operator will spawn a simple HTTP echo container, such as [Hello Openshift](https://hub.docker.com/r/openshift/hello-openshift/). | ||
| The Ingress Operator will also create the necessary Service and Route resources to provide Ingress to the echo container within the `openshift-ingress-operator` | ||
| namespace. Then, the Ingress Operator will periodically attempt to send a request to the echo container, and await a proper response. These requests will be sent | ||
| so that they are routed outside of the cluster to the cloud load balancer first, in order to ensure that each peice of the Ingress "chain" is being tested. |
There was a problem hiding this comment.
We should also ensure that we cover HTTP/2 enabled routes (where we can) and will therefore need a custom certificate.
There was a problem hiding this comment.
mentioned. Thanks!
| If the Ingress Operator is unable to send the request to the echo container, the Ingress Operator fails to receive a response | ||
| from the echo container, or the entire request and response flow takes an unusual amount of time, the Ingress Operator will fire an alert via Prometheus. | ||
| Additionally, the echo container will have several endpoint resources associated with it. A static route will be used to infer baseline cluster networking latency | ||
| and jitter. A set of routes that are rotated in and out by periodic creation and deletion will be used to verify that route creation leads to functioning Ingress traffic |
There was a problem hiding this comment.
Have we considered to probe those routes via synthetic probes as available in the latest prometheus operator release and therefore coming to OCP sooner than later?
There was a problem hiding this comment.
No, we have not yet considered synthetic probes. Could you provide a link to some docs or code so I can understand the advantage of using synthetic probes via the Prometheus operator?
There was a problem hiding this comment.
It was added pretty recently with this PR.
The idea is basically that you can define blackbox probes using https://github.com/prometheus/blackbox_exporter directly via a CR.
This would avoid a need for networking edge to come up with tooling to probe the routes and evaluate those probes.
Instead a synthetic probe would come by default to clusters with the cluster monitoring operator.
cc @s-urbaniak who has more context on prometheus operator and CMO inner workins as well as potential timelines.
Serg, please also correct me if I explained something wrong.
There was a problem hiding this comment.
@RiRa12621 Thanks for the explanation! For this case, Synthetic Probes sound like a pretty enticing option. Do your or @s-urbaniak have an idea of when we can expect synthetic probes to be added to OCP via the CMO? 😄
There was a problem hiding this comment.
@s-urbaniak can probably put a finger on it better, but as far as I can see, it should land in the next promtheus-operator release, which if I'm not wrong should be ready in OCP 4.8, but that's just my external view on it and I would like @s-urbaniak to confirm that.
There was a problem hiding this comment.
Cool. This enhancement is targeted for 4.7, so we might have to start off without synthetic probes, but we could definitely look into switching over to using them in the future 😃
Set the "request-port" header on hello-openshift HTTP responses so that clients reaching the hello-openshift application know which port the HTTP listener was accessed on. This will be useful for the Ingress Operator fault detection work. See openshift/enhancements#438 and https://issues.redhat.com/browse/NE-199 for additional context.
866b45f to
7bdb8b2
Compare
|
/hold on image placement (i'd also like to see this updated and merged now that this is actually in shipping code). |
6b25b85 to
8538daa
Compare
|
commit |
This enhancement proposal outlines an extension to the OpenShift Ingress Operator that gives the Ingress Operator the ability to alert OpenShift users when Cluster Ingress appears to be malfunctioning. This enhancement proposal also outlines the details of an Ingress "debug" container that can be invoked on an OpenShift cluster to pinpoint potential Ingress issues.
|
/hold cancel |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Miciah The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This enhancement proposal outlines an extension to the OpenShift Ingress Operator that gives the Ingress Operator the ability to alert OpenShift users when cluster Routes appear to be malfunctioning. This enhancement proposal also outlines the details of an Ingress "debug" container that can be invoked on an OpenShift cluster to pinpoint potential Ingress issues.
Problems with OpenShift routes can be difficult to diagnose for both OpenShift cluster administrators as well as OpenShift developers. The main idea of this EP is to make issues with Cluster Ingress more apparent to cluster administrators, as well to help minimize time spent debugging cluster Ingress issues by OpenShift developers.
Related to https://issues.redhat.com/browse/NE-199 (Red Hat internal).