Skip to content

security: branch policy and availability clarifications.#6940

Merged
htuch merged 6 commits intoenvoyproxy:masterfrom
htuch:sec-policy-master
May 23, 2019
Merged

security: branch policy and availability clarifications.#6940
htuch merged 6 commits intoenvoyproxy:masterfrom
htuch:sec-policy-master

Conversation

@htuch
Copy link
Copy Markdown
Member

@htuch htuch commented May 14, 2019

Signed-off-by: Harvey Tuch htuch@google.com

Signed-off-by: Harvey Tuch <htuch@google.com>
@htuch
Copy link
Copy Markdown
Member Author

htuch commented May 14, 2019

@PiotrSikora @alyssawilk @mattklein123 for a first pass on this. I say some controversial things about availability; this is a strawman to start conversation. I think we may want to be stricter in our treatment of availability w.r.t. the security release process, but we also need to be realistic as to how often we invoke this process given the large number of availability issues we're still encountering.

Signed-off-by: Harvey Tuch <htuch@google.com>
Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this. Small comment to discuss.

/wait-any

highest priority concerns. Availability, in particular in areas relating to DoS and resource
exhaustion, is a serious security concern that we are making a best effort attempt to address in
Envoy today. We will not activate the security release process for vulnerabilities that only affect
availability until we are confident that Envoy is production hardened with respect to availability.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really sure how we are ever going to be confident in this, and this is also very dependent on proper configuration and deployment details. I might just drop or slightly loosen this statement but up to you.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a good example is something like Slowloris. Would we go through the CVE and security release process for a report of an attack like this? Do operators placing Envoy in edge ingress configurations expect guarantees around this and released versions?

My intuition is we need to incorporate availability into our threat model given edge use of Envoy, but at the same time, if we did this today, we would be doing half-a-dozen or so security releases a quarter, which is unsustainable to both the Envoy security team and our distributors; we'd be on a continuous CVE treadmill.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's true, my point though is that everyone has a different definition of what "availability" means, and I don't know there will ever be consensus here. Today, I think in general we have all the gross knobs in place to prevent most basic attacks (if they are properly configured). We can spend a ton of time doing better and better, and likely better than most other publicly available solutions, but when are we done? What's the bar? That's my main point here.

Copy link
Copy Markdown
Member Author

@htuch htuch May 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's another concrete example. Let's say we hear that sending a header "foobar: baz" from a client causes Envoy to segfault. This is ordinarily a functional bug, but when Envoy is placed at the edge of a network for ingress purposes, it comes a major availability one, as the network ingress becomes trivially DoS-able. What's the correct response from a security perspective?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the correct response from a security perspective?

We invoke the security procedure IMO.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to agenda for 5/21/2019 meeting, let's continue the discussion there.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how a data plane "packet of death" or trivial resource exhaustion leading to OOM could be treated with anything else than the highest priority, and stating that such types of attacks won't even trigger the security release process cannot possibly make our users or distributors happy.

Yes, this could result in a non-zero amount of point releases, but those are growing pains, unfortunately, and the situation will get better over time.

I do agree that we need to be somehow reasonable, and if the attack requires some less popular extension to be configured, then perhaps we need to treat it with less urgency than attacks on HTTP or TCP proxy, but the boundary should be defined somewhere, perhaps by defining different tiers of support for various extensions?

We could also state that a control plane "packet of death" doesn't trigger the security release process, since the control plane should be trusted, but "packet of death" from either client and/or backend (including health-check responses) should be a fair game, IMHO.

As for the attacks reported publicly on GitHub, and not to envoy-security@, we could potentially treat them as "actively exploited issues", and trigger the security release process without embargo / waiting period.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to what @PiotrSikora said. This is my exact thinking also.

Copy link
Copy Markdown
Member Author

@htuch htuch May 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fundamentally agree that availability is an important consideration, but let me play the role of devil's advocate in order to allow us to better hash out these issues and try and surface some of the underlying tensions that might make this less clear cut.

I think it's reasonable to treat attacks that result in process compromise or leak of data as distinct and far more serious than availability attacks. The whole point of priorities is.. you prioritize. I.e. a CVSS score of "critical" has a higher priority than a "medium" vulnerability. So, clearly something like Heartbleed would be higher priority than an availability attack, and it's not helpful to claim that all security issues are the highest priority.

At the end of the day, security is a practice that must consider economics. There is a finite amount of bandwidth we have in each quarter for security releases in the Envoy projects and for distributors. Doing a security release results in high cost to organizations performing this, I've seen this first hand. Also, you don't want to be the child who cried Wolf, or your signal disappears. So, given that you have the bandwidth for N security releases in a quarter (5 at most perhaps?), the question is do you want to burn this on lower priority or higher priority issues?

I think we have to be very careful about how we express the policy in terms of predicating it on configuration. My main concern is that Envoy in its "default configuration" state is pretty terrible for resource exhaustion attacks. Should this be improved? Maybe, but we have to balance concerns with backwards compatibility of configuration and Envoy having a performant configuration with minimal surprising behavior out-of-the-box.

Leaving aside extensions, which are admittedly something we can treat with different tiered support (which will need to be explicit), we'll need to be clear on what configuration options need to be set to make this work.

If a resource or QoD uses only core features (no extensions), but requires some configuration that seems uncommon, do we reduce priority? I think the answer is yes, but then how do we ascertain objectively what is and isn't common? The Envoy community is large and we have no objective data to make these decisions on.

Also, I'm curious if the following question can be answered today: I am an Envoy user with X MB of RAM. What configuration options must I set in Envoy to guarantee the absence of resource DoS?

If we can't answer ^^ in a precise way, I think we have a trivial resource exhaustion issue inherent in Envoy, so creating a policy that ignores this reality is not practical at this point.

I'm hoping that we can get to this kind of detail in our policy, this is the objective of this PR.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Availability wording updated based on the community call and various conversations I've had 1:1 with folks. I'm hoping that what we have now addresses @PiotrSikora and @alyssawilk (and probably other) concerns, while balancing with the reality of Envoy's availability status quo. PTAL.

mattklein123
mattklein123 previously approved these changes May 23, 2019
Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks great to me.

alyssawilk
alyssawilk previously approved these changes May 23, 2019
Copy link
Copy Markdown
Contributor

@alyssawilk alyssawilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I think we really ought to have a best practices doc for configuring Envoy to be safe for availability if we're going to say things like this, but no need to block on it.

@mattklein123
Copy link
Copy Markdown
Member

I think we really ought to have a best practices doc for configuring Envoy to be safe for availability if we're going to say things like this, but no need to block on it.

+100. I think we need a FAQ entry on config best practices when tuning for perf, edge safety, etc.

Signed-off-by: Harvey Tuch <htuch@google.com>
PiotrSikora
PiotrSikora previously approved these changes May 23, 2019
Copy link
Copy Markdown
Contributor

@PiotrSikora PiotrSikora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with one small nit.

appear to present a risk profile that is significantly greater than the current Envoy availability
hardening status quo. Examples of disclosures that would elicit this response:
* QoD; where a single query from a client can bring down an Envoy server.
* High amplification attacks, where very little traffic, e.g. that delivered by a single client, can
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: "amplification attack" in context of DDoS is usually used to describe attacks where attacker sends small packets with a spoofed source IP address set to the victim of attack to vulnerable servers, which then send much bigger response (hence the name) to the victim of attack.

Having said that, I don't have a better suggestion. Perhaps simply "resource exhaustion attacks"?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PiotrSikora "highly asymmetric resource exhaustion attacks, where very little.."?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, thanks!

Signed-off-by: Harvey Tuch <htuch@google.com>
@htuch htuch dismissed stale reviews from PiotrSikora, alyssawilk, and mattklein123 via 977b30b May 23, 2019 20:02
@htuch htuch merged commit 5ec5680 into envoyproxy:master May 23, 2019
@htuch htuch deleted the sec-policy-master branch May 23, 2019 20:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants