Skip to content

Add ingress fault detection proposal#438

Merged
openshift-merge-robot merged 1 commit into
openshift:masterfrom
sgreene570:NE-199
Feb 12, 2021
Merged

Add ingress fault detection proposal#438
openshift-merge-robot merged 1 commit into
openshift:masterfrom
sgreene570:NE-199

Conversation

@sgreene570

@sgreene570 sgreene570 commented Aug 17, 2020

Copy link
Copy Markdown
Contributor

This enhancement proposal outlines an extension to the OpenShift Ingress Operator that gives the Ingress Operator the ability to alert OpenShift users when cluster Routes appear to be malfunctioning. This enhancement proposal also outlines the details of an Ingress "debug" container that can be invoked on an OpenShift cluster to pinpoint potential Ingress issues.

Problems with OpenShift routes can be difficult to diagnose for both OpenShift cluster administrators as well as OpenShift developers. The main idea of this EP is to make issues with Cluster Ingress more apparent to cluster administrators, as well to help minimize time spent debugging cluster Ingress issues by OpenShift developers.

Related to https://issues.redhat.com/browse/NE-199 (Red Hat internal).

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 17, 2020
@sgreene570 sgreene570 force-pushed the NE-199 branch 3 times, most recently from 06250e1 to f096cf9 Compare August 17, 2020 19:42
## Proposal

Once the Ingress Operator has successfully deployed an [Ingress Controller](https://github.com/openshift/cluster-ingress-operator/blob/master/README.md)
within a cluster, the Ingress Operator will spawn a simple HTTP echo container, such as [Hello Openshift](https://hub.docker.com/r/openshift/hello-openshift/).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before moving in this direction, we probably want to get hello-openshift images for different architectures (rhbz#1766287). Although I guess we'll be adding this image to the core images referenced from the release image, so it gets mirrored into disconnected environments, and in that case ART's current arch-specific images feeding arch-specific release images should cover the arch issue.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wking thanks for bringing this concern to my attention. I feel that hello-openshift should become a part of the release image if and when we go forward with this proposal, given that hello-openshift is only ~ 6 mb in size, and that it is useful as a testing application for OCP users.

Alternatively, the Ingress fault detection setup could use a base image already included within the release image, and create a simple socat echo server at runtime on top of the base image, as seen in the router's CI. I would prefer to have hello-openshift available, but I am mentioning this method of creating an echo container if getting hello-openshift into the release image turns out to be problematic.

@smarterclayton smarterclayton Dec 17, 2020

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish I'd caught this before (alas), but hello-openshift probably shouldn't have been chosen. I would have expected the ingress operator image to contain an alternate command that exposed a specific HTTP server, and the operator image to start that variant.

I'd like to see us remove hello-openshift in a future release and switch to using a command off the operator (ingress-operator serve-healthcheck) with the appropriate inside the operator binary. Teams should use their operator images and operator commands as swiss army knives for this sort of internal testing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. @wking correct me if im wrong, but there was some other motivation besides this enhancement for adding hello-openshift to the relase image, correct?
That being said, the ingress-operator could certainly switch over to an alternative command baked into the ingress-operators image. Just wondering if other components (cluster-samples?) would also be affected by the removal of hello-openshift from future release images.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trevor may have been referring to the docker problem - that was fixed in a different way (e2e tests verifying http should be using the agnhost image provided by kube), and hello-openshift was purged from there.

I don't know of any reason in product it should remain and I'd like to see it excised. Operators doing testing should test using a known quantity (a subcommand they check), not a general purpose example image that can't be changed to exercise test behavior.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hello-openshift can be summarily removed from the samples operator - if we haven't shipped it in 4.6 we definitely should not ship it in 4.7. If we did ship it, we should mark it deprecated so no customer depends on it.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hello-openshift imagestream fixes the remaining use cases of "openshift/hello-openshift", both wrt pulling from Docker Hub and wrt multi-arch. At least console and serverless currently still reference this in code and or tests.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yselkowitz would it be possible for the components you mentioned to switch over to an alternative image (via a sub-command, etc). for 4.8? Are we definitely removing hello-openshift in 4.8?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as per @smarterclayton, hello-openshift will be removed in 4.8. I have updated this enhancement to explain that the ingress operator will pivot off of hello-openshift in future releases.


### Version Skew Strategy

N/A

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the debug container depend on particular cluster-version-linked details? E.g. what happens if you launch a 4.7 debug container into a 4.1 cluster? Or a 4.20 cluster where some aspects of the ingress/router system have changed beyond recognition? Or if a cloud provider has grown a new LB flavor and the 4.20 clusters have pivoted to using that new flavor to back routes? I expect in most of these situations, the behavior we want is "debug tool gracefully errors out and says 'you might want to use a debug tool extracted from the cluster you're inspecting'" or some such. Are we actually going to distribute the debug tool as an image referenced from the release image, like we distribute the must-gather, tools, etc. images today?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the debug container were to be distributed as a part of the release image, that would solve the issue of version skew. The oc cli debug container invocation could use the debug container included in the current release image, and if none is available in the release image, gracefully exit. Thoughts?

@sgreene570 sgreene570 force-pushed the NE-199 branch 3 times, most recently from 8fbf71e to e6f9b8c Compare August 19, 2020 19:23
@sgreene570 sgreene570 changed the title [WIP] Add ingress fault detection proposal Add ingress fault detection proposal Aug 20, 2020
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 20, 2020
@sgreene570

Copy link
Copy Markdown
Contributor Author

/assign @Miciah @danehans
/cc @frobware @mcurry-rh @knobunc

@russellb

Copy link
Copy Markdown
Contributor

Follow up to https://issues.redhat.com/browse/NE-199.

I don't think this is publicly available, so could you update the commit message and PR description with whatever context and explanation is useful? Thanks!

@sgreene570

Copy link
Copy Markdown
Contributor Author

I don't think this is publicly available, so could you update the commit message and PR description with whatever context and explanation is useful? Thanks!

Done, thanks for the suggestion!

within a cluster, the Ingress Operator will spawn a simple HTTP echo container, such as [Hello Openshift](https://hub.docker.com/r/openshift/hello-openshift/).
The Ingress Operator will also create the necessary Service and Route resources to provide Ingress to the echo container within the `openshift-ingress-operator`
namespace. Then, the Ingress Operator will periodically attempt to send a request to the echo container, and await a proper response. These requests will be sent
so that they are routed outside of the cluster to the cloud load balancer first, in order to ensure that each peice of the Ingress "chain" is being tested.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also ensure that we cover HTTP/2 enabled routes (where we can) and will therefore need a custom certificate.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mentioned. Thanks!

If the Ingress Operator is unable to send the request to the echo container, the Ingress Operator fails to receive a response
from the echo container, or the entire request and response flow takes an unusual amount of time, the Ingress Operator will fire an alert via Prometheus.
Additionally, the echo container will have several endpoint resources associated with it. A static route will be used to infer baseline cluster networking latency
and jitter. A set of routes that are rotated in and out by periodic creation and deletion will be used to verify that route creation leads to functioning Ingress traffic

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we considered to probe those routes via synthetic probes as available in the latest prometheus operator release and therefore coming to OCP sooner than later?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we have not yet considered synthetic probes. Could you provide a link to some docs or code so I can understand the advantage of using synthetic probes via the Prometheus operator?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was added pretty recently with this PR.
The idea is basically that you can define blackbox probes using https://github.com/prometheus/blackbox_exporter directly via a CR.

This would avoid a need for networking edge to come up with tooling to probe the routes and evaluate those probes.
Instead a synthetic probe would come by default to clusters with the cluster monitoring operator.

cc @s-urbaniak who has more context on prometheus operator and CMO inner workins as well as potential timelines.
Serg, please also correct me if I explained something wrong.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RiRa12621 Thanks for the explanation! For this case, Synthetic Probes sound like a pretty enticing option. Do your or @s-urbaniak have an idea of when we can expect synthetic probes to be added to OCP via the CMO? 😄

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@s-urbaniak can probably put a finger on it better, but as far as I can see, it should land in the next promtheus-operator release, which if I'm not wrong should be ready in OCP 4.8, but that's just my external view on it and I would like @s-urbaniak to confirm that.

@sgreene570 sgreene570 Sep 11, 2020

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. This enhancement is targeted for 4.7, so we might have to start off without synthetic probes, but we could definitely look into switching over to using them in the future 😃

sgreene570 added a commit to sgreene570/origin that referenced this pull request Oct 14, 2020
Set the "request-port" header on hello-openshift HTTP
responses so that clients reaching the hello-openshift
application know which port the HTTP listener was accessed
on. This will be useful for the Ingress Operator fault
detection work. See openshift/enhancements#438
and https://issues.redhat.com/browse/NE-199 for additional context.
@sgreene570 sgreene570 force-pushed the NE-199 branch 2 times, most recently from 866b45f to 7bdb8b2 Compare October 15, 2020 14:19
@smarterclayton

Copy link
Copy Markdown
Contributor

/hold

on image placement (i'd also like to see this updated and merged now that this is actually in shipping code).

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 17, 2020
@sgreene570 sgreene570 force-pushed the NE-199 branch 2 times, most recently from 6b25b85 to 8538daa Compare January 5, 2021 20:12
Comment thread enhancements/ingress/ingress-fault-detection.md Outdated
@sgreene570

Copy link
Copy Markdown
Contributor Author

commit 3e87942 fixes a typo and resolves lint errors.

Comment thread enhancements/ingress/ingress-fault-detection.md
Comment thread enhancements/ingress/ingress-fault-detection.md Outdated
Comment thread enhancements/ingress/ingress-fault-detection.md Outdated
Comment thread enhancements/ingress/ingress-fault-detection.md Outdated
Comment thread enhancements/ingress/ingress-fault-detection.md Outdated
This enhancement proposal outlines
an extension to the OpenShift Ingress Operator
that gives the Ingress Operator the ability to alert
OpenShift users when Cluster Ingress appears to be
malfunctioning. This enhancement proposal also outlines the
details of an Ingress "debug" container that can be invoked
on an OpenShift cluster to pinpoint potential Ingress issues.
@Miciah

Miciah commented Feb 12, 2021

Copy link
Copy Markdown
Contributor

/hold cancel
/lgtm
/approve

@openshift-ci-robot openshift-ci-robot added lgtm Indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Feb 12, 2021
@openshift-ci-robot

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Miciah

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 12, 2021
@openshift-merge-robot openshift-merge-robot merged commit bea23d3 into openshift:master Feb 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.