security: branch policy and availability clarifications. by htuch · Pull Request #6940 · envoyproxy/envoy

htuch · 2019-05-14T20:05:39Z

Signed-off-by: Harvey Tuch htuch@google.com

Signed-off-by: Harvey Tuch <htuch@google.com>

htuch · 2019-05-14T20:07:06Z

@PiotrSikora @alyssawilk @mattklein123 for a first pass on this. I say some controversial things about availability; this is a strawman to start conversation. I think we may want to be stricter in our treatment of availability w.r.t. the security release process, but we also need to be realistic as to how often we invoke this process given the large number of availability issues we're still encountering.

Signed-off-by: Harvey Tuch <htuch@google.com>

mattklein123

Thanks for this. Small comment to discuss.

/wait-any

mattklein123 · 2019-05-14T22:40:22Z

SECURITY_RELEASE_PROCESS.md

+highest priority concerns. Availability, in particular in areas relating to DoS and resource
+exhaustion, is a serious security concern that we are making a best effort attempt to address in
+Envoy today. We will not activate the security release process for vulnerabilities that only affect
+availability until we are confident that Envoy is production hardened with respect to availability.


I'm not really sure how we are ever going to be confident in this, and this is also very dependent on proper configuration and deployment details. I might just drop or slightly loosen this statement but up to you.

I think a good example is something like Slowloris. Would we go through the CVE and security release process for a report of an attack like this? Do operators placing Envoy in edge ingress configurations expect guarantees around this and released versions?

My intuition is we need to incorporate availability into our threat model given edge use of Envoy, but at the same time, if we did this today, we would be doing half-a-dozen or so security releases a quarter, which is unsustainable to both the Envoy security team and our distributors; we'd be on a continuous CVE treadmill.

I think that's true, my point though is that everyone has a different definition of what "availability" means, and I don't know there will ever be consensus here. Today, I think in general we have all the gross knobs in place to prevent most basic attacks (if they are properly configured). We can spend a ton of time doing better and better, and likely better than most other publicly available solutions, but when are we done? What's the bar? That's my main point here.

Here's another concrete example. Let's say we hear that sending a header "foobar: baz" from a client causes Envoy to segfault. This is ordinarily a functional bug, but when Envoy is placed at the edge of a network for ingress purposes, it comes a major availability one, as the network ingress becomes trivially DoS-able. What's the correct response from a security perspective?

What's the correct response from a security perspective?

We invoke the security procedure IMO.

Added to agenda for 5/21/2019 meeting, let's continue the discussion there.

I don't see how a data plane "packet of death" or trivial resource exhaustion leading to OOM could be treated with anything else than the highest priority, and stating that such types of attacks won't even trigger the security release process cannot possibly make our users or distributors happy.

Yes, this could result in a non-zero amount of point releases, but those are growing pains, unfortunately, and the situation will get better over time.

I do agree that we need to be somehow reasonable, and if the attack requires some less popular extension to be configured, then perhaps we need to treat it with less urgency than attacks on HTTP or TCP proxy, but the boundary should be defined somewhere, perhaps by defining different tiers of support for various extensions?

We could also state that a control plane "packet of death" doesn't trigger the security release process, since the control plane should be trusted, but "packet of death" from either client and/or backend (including health-check responses) should be a fair game, IMHO.

As for the attacks reported publicly on GitHub, and not to envoy-security@, we could potentially treat them as "actively exploited issues", and trigger the security release process without embargo / waiting period.

+1 to what @PiotrSikora said. This is my exact thinking also.

I fundamentally agree that availability is an important consideration, but let me play the role of devil's advocate in order to allow us to better hash out these issues and try and surface some of the underlying tensions that might make this less clear cut.

I think it's reasonable to treat attacks that result in process compromise or leak of data as distinct and far more serious than availability attacks. The whole point of priorities is.. you prioritize. I.e. a CVSS score of "critical" has a higher priority than a "medium" vulnerability. So, clearly something like Heartbleed would be higher priority than an availability attack, and it's not helpful to claim that all security issues are the highest priority.

At the end of the day, security is a practice that must consider economics. There is a finite amount of bandwidth we have in each quarter for security releases in the Envoy projects and for distributors. Doing a security release results in high cost to organizations performing this, I've seen this first hand. Also, you don't want to be the child who cried Wolf, or your signal disappears. So, given that you have the bandwidth for N security releases in a quarter (5 at most perhaps?), the question is do you want to burn this on lower priority or higher priority issues?

I think we have to be very careful about how we express the policy in terms of predicating it on configuration. My main concern is that Envoy in its "default configuration" state is pretty terrible for resource exhaustion attacks. Should this be improved? Maybe, but we have to balance concerns with backwards compatibility of configuration and Envoy having a performant configuration with minimal surprising behavior out-of-the-box.

Leaving aside extensions, which are admittedly something we can treat with different tiered support (which will need to be explicit), we'll need to be clear on what configuration options need to be set to make this work.

If a resource or QoD uses only core features (no extensions), but requires some configuration that seems uncommon, do we reduce priority? I think the answer is yes, but then how do we ascertain objectively what is and isn't common? The Envoy community is large and we have no objective data to make these decisions on.

Also, I'm curious if the following question can be answered today: I am an Envoy user with X MB of RAM. What configuration options must I set in Envoy to guarantee the absence of resource DoS?

If we can't answer ^^ in a precise way, I think we have a trivial resource exhaustion issue inherent in Envoy, so creating a policy that ignores this reality is not practical at this point.

I'm hoping that we can get to this kind of detail in our policy, this is the objective of this PR.

Availability wording updated based on the community call and various conversations I've had 1:1 with folks. I'm hoping that what we have now addresses @PiotrSikora and @alyssawilk (and probably other) concerns, while balancing with the reality of Envoy's availability status quo. PTAL.

Signed-off-by: Harvey Tuch <htuch@google.com>

mattklein123

Thanks, this looks great to me.

alyssawilk

LGTM

I think we really ought to have a best practices doc for configuring Envoy to be safe for availability if we're going to say things like this, but no need to block on it.

mattklein123 · 2019-05-23T15:13:39Z

I think we really ought to have a best practices doc for configuring Envoy to be safe for availability if we're going to say things like this, but no need to block on it.

+100. I think we need a FAQ entry on config best practices when tuning for perf, edge safety, etc.

Signed-off-by: Harvey Tuch <htuch@google.com>

PiotrSikora

LGTM, with one small nit.

PiotrSikora · 2019-05-23T18:16:39Z

SECURITY_RELEASE_PROCESS.md

+appear to present a risk profile that is significantly greater than the current Envoy availability
+hardening status quo. Examples of disclosures that would elicit this response:
+* QoD; where a single query from a client can bring down an Envoy server.
+* High amplification attacks, where very little traffic, e.g. that delivered by a single client, can


Nit: "amplification attack" in context of DDoS is usually used to describe attacks where attacker sends small packets with a spoofed source IP address set to the victim of attack to vulnerable servers, which then send much bigger response (hence the name) to the victim of attack.

Having said that, I don't have a better suggestion. Perhaps simply "resource exhaustion attacks"?

@PiotrSikora "highly asymmetric resource exhaustion attacks, where very little.."?

Sounds good, thanks!

Signed-off-by: Harvey Tuch <htuch@google.com>

security: branch policy and availability clarifications.

6a43d03

Signed-off-by: Harvey Tuch <htuch@google.com>

htuch requested review from PiotrSikora, alyssawilk and mattklein123 May 14, 2019 20:05

Typo.

3f82d82

Signed-off-by: Harvey Tuch <htuch@google.com>

htuch assigned PiotrSikora, mattklein123 and alyssawilk May 14, 2019

mattklein123 reviewed May 14, 2019

View reviewed changes

repokitteh-read-only bot added waiting:any and removed waiting:any labels May 14, 2019

mattklein123 added the waiting:any label May 15, 2019

repokitteh-read-only bot removed the waiting:any label May 15, 2019

mattklein123 added the waiting label May 16, 2019

htuch added 2 commits May 22, 2019 18:55

Merge remote-tracking branch 'upstream/master' into sec-policy-master

83ddbac

Clarify availability some more.

342aa3a

Signed-off-by: Harvey Tuch <htuch@google.com>

repokitteh-read-only bot removed the waiting label May 22, 2019

mattklein123 previously approved these changes May 23, 2019

View reviewed changes

alyssawilk previously approved these changes May 23, 2019

View reviewed changes

wip

1bd7c09

Signed-off-by: Harvey Tuch <htuch@google.com>

PiotrSikora previously approved these changes May 23, 2019

View reviewed changes

Rewording suggested by Piotr.

977b30b

Signed-off-by: Harvey Tuch <htuch@google.com>

htuch dismissed stale reviews from PiotrSikora, alyssawilk, and mattklein123 via 977b30b May 23, 2019 20:02

htuch merged commit 5ec5680 into envoyproxy:master May 23, 2019

htuch deleted the sec-policy-master branch May 23, 2019 20:02

Conversation

htuch commented May 14, 2019

Uh oh!

htuch commented May 14, 2019

Uh oh!

mattklein123 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

htuch May 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

htuch May 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattklein123 left a comment

Choose a reason for hiding this comment

Uh oh!

alyssawilk left a comment

Choose a reason for hiding this comment

Uh oh!

mattklein123 commented May 23, 2019

Uh oh!

PiotrSikora left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

htuch May 14, 2019 •

edited

Loading

htuch May 16, 2019 •

edited

Loading