dev docs: clarify error handling philosophy#854
Conversation
| happen should lead to process death, under the assumption that the additional burden of defensive | ||
| coding and testing is not an effective use of time for an error that should not happen given | ||
| proper system setup. Examples of these types of errors include not being able to open the shared | ||
| memory region, an invalid initial JSON config read from disk, system calls that should not fail |
There was a problem hiding this comment.
Can you provide 2 or 3 examples of these systems calls that shouldn't fail?
STYLE.md
Outdated
| * OOM events (both memory and FDs) are considered fatal crashing errors. This rule is again based | ||
| on the philosophy that the engineering costs of properly handling these cases is not worth it. | ||
| Time is better spent designing proper system controls that shed load if resource usage becomes | ||
| too high, etc. |
There was a problem hiding this comment.
I think part of this philosophy is also that restarts are fast, reliable and cheap. Can you add something to this effect? This won't be universally true, so it'd help to have it as a strawman.
| * Although we strongly recommend that any type of startup error leads to a fatal error, since this | ||
| is almost always a result of faulty configuration which should be caught during a canary process, | ||
| there may be cases in which we want some classes of startup errors to be non-fatal. For example, | ||
| if a misconfigured option is not necessary for server operation. Although this is discouraged, we |
There was a problem hiding this comment.
Do you want to give the example of the admin access path? Also, what's the philosophy on reporting this degraded mode and do we have a consistent story for signaling via logs or other means (the admin interface) the degraded state?
| Time is better spent designing proper system controls that shed load if resource usage becomes | ||
| too high, etc. | ||
| * Although we strongly recommend that any type of startup error leads to a fatal error, since this | ||
| is almost always a result of faulty configuration which should be caught during a canary process, |
There was a problem hiding this comment.
Do we want to consider the situations where canary doesn't catch something (e.g. an unusual syscall being exercised post-config check) and we get into a crash loop? This can then lead to cascading failure. Is it considered better here to just crash loop all frontends or to have them operate in a degraded state?
There was a problem hiding this comment.
I think this is covered by the fast startup aspect. If you have something this low rate and it lands in production, you should be able to roll back and have it not be a crisis, so I don't view this is a real operational issue. I will clarify.
|
@htuch updated |
| via `RELEASE_ASSERT(condition)` or `PANIC(message)` if there is no other sensible behavior, e.g. | ||
| in OOM (memory/FD) scenarios. Only `RELEASE_ASSERT(condition)` should be used to validate | ||
| conditions that might be imposed by the external environment. `ASSERT(condition)` should be used | ||
| to document (and check in debug-only builds) program invariants. Use `ASSERT` liberally, but do |
There was a problem hiding this comment.
I'm thinking there is a gray line here between external environment and program invariants. For example, memory corruption due to a security issue (a bug, deliberate buffer overflow etc.) might manifest as a violation of program invariants or as a detectable condition in the external environment (e.g. some library returning a highly unexpected error code or buffer contents). My understanding here is that we do want to RELEASE_ASSERT as there is no other sensible behavior. We would only catch this due to the RELEASE_ASSERT on the external environment however, since we compile out the program invariant checks for release.
The above isn't actually a question or request to change anything (other than to maybe elaborate a bit), just an observation. Maybe the wider question is how much defense-in-depth is acceptable for security reasons (as opposed to just being pedantic about man page return codes) and what level of RELEASE_ASSERT sanity checking is permissible vs. the performance cost it imposes.
There was a problem hiding this comment.
That's a valid question, but one that will quickly get us into the weeds. Ultimately I think this comes down to experience, judgement, and code review. I will try to put something to that effect in.
STYLE.md
Outdated
| * OOM events (both memory and FDs) are considered fatal crashing errors. This rule is again based | ||
| on the philosophy that the engineering costs of properly handling these cases is not worth it. | ||
| Time is better spent designing proper system controls that shed load if resource usage becomes | ||
| too high, etc. |
There was a problem hiding this comment.
Maybe explain that we should never silently ignore OOM (as covered in #567), but instead know that this is treated as an exception or RELEAST_ASSERT on return codes.
STYLE.md
Outdated
| case is the --admin-address-path option). **If degraded mode error handling is implemented, we require | ||
| that there is complete test coverage for the degraded case.** Additionally, the user should be | ||
| aware of the degraded state minimally via an error log of level warn or greater, and optimally | ||
| via the increment of a stat. |
There was a problem hiding this comment.
I would use stronger language and require stats tracking degraded features.
STYLE.md
Outdated
| that there is complete test coverage for the degraded case.** Additionally, the user should be | ||
| aware of the degraded state minimally via an error log of level warn or greater, and optimally | ||
| via the increment of a stat. | ||
| * There error handling philosophy described herein is based on the assumption that Envoy is deployed |
htuch
left a comment
There was a problem hiding this comment.
LGTM, I think this will be very helpful in both ensuring we have a consistent error handling philosophy right now in the code base as we continue to add code and help inform some of the discussions around this.
STYLE.md
Outdated
| there may be cases in which we want some classes of startup errors to be non-fatal. For example, | ||
| if a misconfigured option is not necessary for server operation. Although this is discouraged, we | ||
| will discuss these on a case by case basis case basis during code review (an example of this | ||
| is the --admin-address-path option). **If degraded mode error handling is implemented, we require |
STYLE.md
Outdated
| to document (and check in debug-only builds) program invariants. Use `ASSERT` liberally, but do | ||
| not use it for things that will crash in an obvious way in a subsequent line. E.g., do not do | ||
| `ASSERT(foo != nullptr); foo->doSomething();`. Note that there is a gray line between external | ||
| environment and program invariants. For example, memory corruption due to a security issue (a bug, |
There was a problem hiding this comment.
Reword: "external environment failures and program invariant violations."
No description provided.