Skip to content

dev docs: clarify error handling philosophy#854

Merged
mattklein123 merged 5 commits intomasterfrom
failure_docs
Apr 28, 2017
Merged

dev docs: clarify error handling philosophy#854
mattklein123 merged 5 commits intomasterfrom
failure_docs

Conversation

@mattklein123
Copy link
Copy Markdown
Member

No description provided.

@mattklein123
Copy link
Copy Markdown
Member Author

@htuch

Copy link
Copy Markdown
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for codifying this!

happen should lead to process death, under the assumption that the additional burden of defensive
coding and testing is not an effective use of time for an error that should not happen given
proper system setup. Examples of these types of errors include not being able to open the shared
memory region, an invalid initial JSON config read from disk, system calls that should not fail
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide 2 or 3 examples of these systems calls that shouldn't fail?

STYLE.md Outdated
* OOM events (both memory and FDs) are considered fatal crashing errors. This rule is again based
on the philosophy that the engineering costs of properly handling these cases is not worth it.
Time is better spent designing proper system controls that shed load if resource usage becomes
too high, etc.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think part of this philosophy is also that restarts are fast, reliable and cheap. Can you add something to this effect? This won't be universally true, so it'd help to have it as a strawman.

* Although we strongly recommend that any type of startup error leads to a fatal error, since this
is almost always a result of faulty configuration which should be caught during a canary process,
there may be cases in which we want some classes of startup errors to be non-fatal. For example,
if a misconfigured option is not necessary for server operation. Although this is discouraged, we
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to give the example of the admin access path? Also, what's the philosophy on reporting this degraded mode and do we have a consistent story for signaling via logs or other means (the admin interface) the degraded state?

Time is better spent designing proper system controls that shed load if resource usage becomes
too high, etc.
* Although we strongly recommend that any type of startup error leads to a fatal error, since this
is almost always a result of faulty configuration which should be caught during a canary process,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to consider the situations where canary doesn't catch something (e.g. an unusual syscall being exercised post-config check) and we get into a crash loop? This can then lead to cascading failure. Is it considered better here to just crash loop all frontends or to have them operate in a degraded state?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is covered by the fast startup aspect. If you have something this low rate and it lands in production, you should be able to roll back and have it not be a crisis, so I don't view this is a real operational issue. I will clarify.

@mattklein123
Copy link
Copy Markdown
Member Author

@htuch updated

via `RELEASE_ASSERT(condition)` or `PANIC(message)` if there is no other sensible behavior, e.g.
in OOM (memory/FD) scenarios. Only `RELEASE_ASSERT(condition)` should be used to validate
conditions that might be imposed by the external environment. `ASSERT(condition)` should be used
to document (and check in debug-only builds) program invariants. Use `ASSERT` liberally, but do
Copy link
Copy Markdown
Member

@htuch htuch Apr 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking there is a gray line here between external environment and program invariants. For example, memory corruption due to a security issue (a bug, deliberate buffer overflow etc.) might manifest as a violation of program invariants or as a detectable condition in the external environment (e.g. some library returning a highly unexpected error code or buffer contents). My understanding here is that we do want to RELEASE_ASSERT as there is no other sensible behavior. We would only catch this due to the RELEASE_ASSERT on the external environment however, since we compile out the program invariant checks for release.

The above isn't actually a question or request to change anything (other than to maybe elaborate a bit), just an observation. Maybe the wider question is how much defense-in-depth is acceptable for security reasons (as opposed to just being pedantic about man page return codes) and what level of RELEASE_ASSERT sanity checking is permissible vs. the performance cost it imposes.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a valid question, but one that will quickly get us into the weeds. Ultimately I think this comes down to experience, judgement, and code review. I will try to put something to that effect in.

STYLE.md Outdated
* OOM events (both memory and FDs) are considered fatal crashing errors. This rule is again based
on the philosophy that the engineering costs of properly handling these cases is not worth it.
Time is better spent designing proper system controls that shed load if resource usage becomes
too high, etc.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe explain that we should never silently ignore OOM (as covered in #567), but instead know that this is treated as an exception or RELEAST_ASSERT on return codes.

STYLE.md Outdated
case is the --admin-address-path option). **If degraded mode error handling is implemented, we require
that there is complete test coverage for the degraded case.** Additionally, the user should be
aware of the degraded state minimally via an error log of level warn or greater, and optimally
via the increment of a stat.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use stronger language and require stats tracking degraded features.

STYLE.md Outdated
that there is complete test coverage for the degraded case.** Additionally, the user should be
aware of the degraded state minimally via an error log of level warn or greater, and optimally
via the increment of a stat.
* There error handling philosophy described herein is based on the assumption that Envoy is deployed
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: s/There error/The error/

@mattklein123
Copy link
Copy Markdown
Member Author

@htuch @tschroed updated, please review.

Copy link
Copy Markdown
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I think this will be very helpful in both ensuring we have a consistent error handling philosophy right now in the code base as we continue to add code and help inform some of the discussions around this.

STYLE.md Outdated
there may be cases in which we want some classes of startup errors to be non-fatal. For example,
if a misconfigured option is not necessary for server operation. Although this is discouraged, we
will discuss these on a case by case basis case basis during code review (an example of this
is the --admin-address-path option). **If degraded mode error handling is implemented, we require
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: --admin-address-path

STYLE.md Outdated
to document (and check in debug-only builds) program invariants. Use `ASSERT` liberally, but do
not use it for things that will crash in an obvious way in a subsequent line. E.g., do not do
`ASSERT(foo != nullptr); foo->doSomething();`. Note that there is a gray line between external
environment and program invariants. For example, memory corruption due to a security issue (a bug,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reword: "external environment failures and program invariant violations."

@mattklein123 mattklein123 merged commit f935973 into master Apr 28, 2017
@mattklein123 mattklein123 deleted the failure_docs branch April 28, 2017 17:06
jpsim pushed a commit that referenced this pull request Nov 28, 2022
Description: upgrade to bazel 3.1.0 to avoid #11210. Additionally, avoid issue from using mobile-install on the CI machines #853.
Risk Level: low, finally a green build.
Testing: CI is green!

Signed-off-by: Jose Nino <jnino@lyft.com>
Signed-off-by: JP Simard <jp@jpsim.com>
jpsim pushed a commit that referenced this pull request Nov 29, 2022
Description: upgrade to bazel 3.1.0 to avoid #11210. Additionally, avoid issue from using mobile-install on the CI machines #853.
Risk Level: low, finally a green build.
Testing: CI is green!

Signed-off-by: Jose Nino <jnino@lyft.com>
Signed-off-by: JP Simard <jp@jpsim.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants