dev docs: clarify error handling philosophy by mattklein123 · Pull Request #854 · envoyproxy/envoy

mattklein123 · 2017-04-27T19:38:29Z

No description provided.

mattklein123 · 2017-04-27T19:38:35Z

htuch

Thanks for codifying this!

htuch · 2017-04-27T19:44:02Z

STYLE.md

+  happen should lead to process death, under the assumption that the additional burden of defensive
+  coding and testing is not an effective use of time for an error that should not happen given
+  proper system setup. Examples of these types of errors include not being able to open the shared
+  memory region, an invalid initial JSON config read from disk, system calls that should not fail


Can you provide 2 or 3 examples of these systems calls that shouldn't fail?

htuch · 2017-04-27T19:45:10Z

STYLE.md

+* OOM events (both memory and FDs) are considered fatal crashing errors. This rule is again based
+  on the philosophy that the engineering costs of properly handling these cases is not worth it.
+  Time is better spent designing proper system controls that shed load if resource usage becomes
+  too high, etc.


I think part of this philosophy is also that restarts are fast, reliable and cheap. Can you add something to this effect? This won't be universally true, so it'd help to have it as a strawman.

htuch · 2017-04-27T19:46:28Z

STYLE.md

+* Although we strongly recommend that any type of startup error leads to a fatal error, since this
+  is almost always a result of faulty configuration which should be caught during a canary process,
+  there may be cases in which we want some classes of startup errors to be non-fatal. For example,
+  if a misconfigured option is not necessary for server operation. Although this is discouraged, we


Do you want to give the example of the admin access path? Also, what's the philosophy on reporting this degraded mode and do we have a consistent story for signaling via logs or other means (the admin interface) the degraded state?

htuch · 2017-04-27T19:51:02Z

STYLE.md

+  Time is better spent designing proper system controls that shed load if resource usage becomes
+  too high, etc.
+* Although we strongly recommend that any type of startup error leads to a fatal error, since this
+  is almost always a result of faulty configuration which should be caught during a canary process,


Do we want to consider the situations where canary doesn't catch something (e.g. an unusual syscall being exercised post-config check) and we get into a crash loop? This can then lead to cascading failure. Is it considered better here to just crash loop all frontends or to have them operate in a degraded state?

I think this is covered by the fast startup aspect. If you have something this low rate and it lands in production, you should be able to roll back and have it not be a crisis, so I don't view this is a real operational issue. I will clarify.

mattklein123 · 2017-04-28T02:44:29Z

@htuch updated

htuch · 2017-04-28T13:45:32Z

STYLE.md

+  via `RELEASE_ASSERT(condition)` or `PANIC(message)` if there is no other sensible behavior, e.g.
+  in OOM (memory/FD) scenarios. Only `RELEASE_ASSERT(condition)` should be used to validate
+  conditions that might be imposed by the external environment. `ASSERT(condition)` should be used
+  to document (and check in debug-only builds) program invariants. Use `ASSERT` liberally, but do


I'm thinking there is a gray line here between external environment and program invariants. For example, memory corruption due to a security issue (a bug, deliberate buffer overflow etc.) might manifest as a violation of program invariants or as a detectable condition in the external environment (e.g. some library returning a highly unexpected error code or buffer contents). My understanding here is that we do want to RELEASE_ASSERT as there is no other sensible behavior. We would only catch this due to the RELEASE_ASSERT on the external environment however, since we compile out the program invariant checks for release.

The above isn't actually a question or request to change anything (other than to maybe elaborate a bit), just an observation. Maybe the wider question is how much defense-in-depth is acceptable for security reasons (as opposed to just being pedantic about man page return codes) and what level of RELEASE_ASSERT sanity checking is permissible vs. the performance cost it imposes.

That's a valid question, but one that will quickly get us into the weeds. Ultimately I think this comes down to experience, judgement, and code review. I will try to put something to that effect in.

htuch · 2017-04-28T13:50:05Z

STYLE.md

+* OOM events (both memory and FDs) are considered fatal crashing errors. This rule is again based
+  on the philosophy that the engineering costs of properly handling these cases is not worth it.
+  Time is better spent designing proper system controls that shed load if resource usage becomes
+  too high, etc.


Maybe explain that we should never silently ignore OOM (as covered in #567), but instead know that this is treated as an exception or RELEAST_ASSERT on return codes.

tschroed · 2017-04-28T13:54:10Z

STYLE.md

+  case is the --admin-address-path option). **If degraded mode error handling is implemented, we require
+  that there is complete test coverage for the degraded case.** Additionally, the user should be
+  aware of the degraded state minimally via an error log of level warn or greater, and optimally
+  via the increment of a stat.


I would use stronger language and require stats tracking degraded features.

htuch · 2017-04-28T13:55:35Z

STYLE.md

+  that there is complete test coverage for the degraded case.** Additionally, the user should be
+  aware of the degraded state minimally via an error log of level warn or greater, and optimally
+  via the increment of a stat.
+* There error handling philosophy described herein is based on the assumption that Envoy is deployed


Nit: s/There error/The error/

mattklein123 · 2017-04-28T15:40:17Z

@htuch @tschroed updated, please review.

htuch

LGTM, I think this will be very helpful in both ensuring we have a consistent error handling philosophy right now in the code base as we continue to add code and help inform some of the discussions around this.

htuch · 2017-04-28T15:43:34Z

STYLE.md

+  there may be cases in which we want some classes of startup errors to be non-fatal. For example,
+  if a misconfigured option is not necessary for server operation. Although this is discouraged, we
+  will discuss these on a case by case basis case basis during code review (an example of this
+  is the --admin-address-path option). **If degraded mode error handling is implemented, we require


Nit: --admin-address-path

htuch · 2017-04-28T15:45:30Z

STYLE.md

+  to document (and check in debug-only builds) program invariants. Use `ASSERT` liberally, but do
+  not use it for things that will crash in an obvious way in a subsequent line. E.g., do not do
+  `ASSERT(foo != nullptr); foo->doSomething();`. Note that there is a gray line between external
+  environment and program invariants. For example, memory corruption due to a security issue (a bug,


Reword: "external environment failures and program invariant violations."

Description: upgrade to bazel 3.1.0 to avoid #11210. Additionally, avoid issue from using mobile-install on the CI machines #853. Risk Level: low, finally a green build. Testing: CI is green! Signed-off-by: Jose Nino <jnino@lyft.com> Signed-off-by: JP Simard <jp@jpsim.com>

dev docs: clarify error handling philosophy

821f582

htuch reviewed Apr 27, 2017

View reviewed changes

htuch mentioned this pull request Apr 27, 2017

Backtrace on fatal signals. #855

Merged

comments

28e9f1b

htuch reviewed Apr 28, 2017

View reviewed changes

tschroed reviewed Apr 28, 2017

View reviewed changes

htuch reviewed Apr 28, 2017

View reviewed changes

mattklein123 added 2 commits April 28, 2017 08:21

Merge branch 'master' into failure_docs

7d0caab

comments

1e53106

htuch approved these changes Apr 28, 2017

View reviewed changes

comment

1fa2b3b

htuch approved these changes Apr 28, 2017

View reviewed changes

mattklein123 merged commit f935973 into master Apr 28, 2017

mattklein123 deleted the failure_docs branch April 28, 2017 17:06

Conversation

mattklein123 commented Apr 27, 2017

Uh oh!

mattklein123 commented Apr 27, 2017

Uh oh!

htuch left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattklein123 commented Apr 28, 2017

Uh oh!

htuch Apr 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattklein123 commented Apr 28, 2017

Uh oh!

htuch left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

htuch Apr 28, 2017 •

edited

Loading