error handling proposal#153
Conversation
specification/error-handling.md
Outdated
|
|
||
| 1. APIs must not throw unhandled exceptions when the API is used incorrectly by | ||
| the end user. Smart defaults should be used so that the SDK generally works. | ||
| For instance, name like `empty` MUST be used when `null` value was passed as |
There was a problem hiding this comment.
Hmmm I think it's good practice to validate the required argument when end users call our API and throw exceptions when it violates certain constraints. Consider other cases like name being too long, I think it's better to throw exceptions than silently truncate it or use defaults.
There was a problem hiding this comment.
Would it be better if application doesn't start due to some environment variable was misconfigured (for resource API for instance) or if application sent a slightly inconsistent telemetry?
Same for span name. Would you rather return 500 to customer or use empty for the null string? (see also comment for @bogdandrutu 's question
There was a problem hiding this comment.
We had some conversations in Python SDK open-telemetry/opentelemetry-python#11 (comment).
One thing to consider: there is no simple way to determine if span name is too long since different exporters/backends might have different limitations. We don't want the users to see the application running fine for exporter A, and start to see unhandled exceptions when they switch to exporter B.
There was a problem hiding this comment.
We don't want the users to see the application running fine for exporter A, and start to see unhandled exceptions when they switch to exporter B.
+1
specification/error-handling.md
Outdated
|
|
||
| OpenTelemetry SDK must not throw or leak unhandled or user unhandled exceptions. | ||
|
|
||
| 1. APIs must not throw unhandled exceptions when the API is used incorrectly by |
There was a problem hiding this comment.
This is an interesting point. I think I have a different opinion:
- When a specific implementation is used no different checks are applied - means that if the API says a span name is valid if not null, then implementation cannot throw exception if empty.
- API can throw exception for things like null span name because that is documented.
Happy to be convinced otherwise, but my understanding is that as long as we apply the same checks across the API and SDK we can throw exception for obvious things.
There was a problem hiding this comment.
one problem that you may run into is that arguments you are passing to telemetry API may be received from external source. So you have a null string. If you are smart and read the doc - you will check for null and pass some random name to the API. If you are smart and haven't read the doc - API will crash and potentially bring this request processing with you.
Both outcomes can be worked around for by applying the same smart default as customer would need to anyway.
There was a problem hiding this comment.
Sure, but running a "blind" process with no monitor at all because we cannot export data to the metrics backend, is that better? I feel that fail fast approach should be used here and crashing during the initialization is very good in this case.
There was a problem hiding this comment.
I agree with @bogdandrutu on "API can throw exception for things like null span name because that is documented".
There was a problem hiding this comment.
Both outcomes can be worked around for by applying the same smart default as customer would need to anyway.
I agree with this. At the same time, we need a balance for reporting errors for truly fatal things (i.e. " cannot export data to the metrics backend"), but otherwise throwing exceptions should stay at a minimum level.
Specific case: I do think using a null name for Span could fallback to using a default name, instead of throwing. Documenting it is nice, but providing defaults for these kind of things is a good alternative ;)
There was a problem hiding this comment.
I agree re: failing fast at initialization/construction time, but not in ways that might crash the program (or thread or ____) at runtime.
iredelmeier
left a comment
There was a problem hiding this comment.
The doc currently mingles API and SDK error handling in a way that I expect to be confusing to multiple parties, particularly anyone either working on or using an alternative to the SDK.
Part of the problem may be upstream, i.e., that the API spec is missing logic and so the SDK is forced to backfill (and therefore error handle) it.
specification/error-handling.md
Outdated
| 4. Beware of any call to external callbacks or override-able interface. Expect | ||
| them to throw. | ||
|
|
||
| ## SDK self-diagnostics |
There was a problem hiding this comment.
While very valuable, self-diagnostics are inherently complicated. This feels like something that needs a larger discussion.
specification/error-handling.md
Outdated
| 4. Beware of any call to external callbacks or override-able interface. Expect | ||
| them to throw. | ||
|
|
||
| ## SDK self-diagnostics |
There was a problem hiding this comment.
I believe that user-facing instrumentation APIs should never return errors (or throw them).
One reason that users should never receive these, is that the'll be tempted to turn around and call the instrumentation library with the error.
The SDK will, however, encounter errors, and it a user has a right to know, but not with in-line code. In the opentracing Go library we addressed this by letting the application register a callback for receiving self diagnostics, including errors, that would allow the application to handle self diagnostics in a separate module.
There was a problem hiding this comment.
We are doing exactly the same at Dynatrace with our agent SDK:
The SDK API never throws, since we do not want to change the application's behavior and risk crashing it if they don't catch everything properly, just because monitoring wouldn't work. In order to not let our SDK users in the dark and give them guidance for troubleshooting we also offer a diagnostic callback. It provides immediate logging output in case of errors (see https://github.com/Dynatrace/OneAgent-SDK#logging-callback). This turned out to work well and be pretty helpful in the past.
|
This has gone a bit stale. I believe @SergeyKanzhelev is on paternity leave. Any opinions on how we resolve this, or otherwise move it forwards? |
|
@mwear I've added a note into guidance on returning no-op instead of |
|
Thanks @SergeyKanzhelev I'll bring this up at the Ruby SIG so folks are aware of it. |
I think there's a fourth option here: that the SDK can rescue errors and return non-null defaults when they're encountered. If it's possible to provide defaults for every argument ("do runtime type checks and provide defaults when they fail") then presumably it's possible to provide the default return value that you'd get by calling the function with all the default args. I don't mean to say this is the right solution, just that it's worth considering with the others. |
specification/error-handling.md
Outdated
| SDK authors MAY supply settings that allow end users to change the library's | ||
| default error handling behavior. Application developers may want to run with | ||
| strict error handling in a staging environment to catch invalid uses of the API, | ||
| or malformed config. |
There was a problem hiding this comment.
@bogdandrutu you brought up the auditing use case in the SIG call, what do you think about expanding on strict error handling here? E.g.:
Under certain circumstances, it's preferable for an application to fail to complete an operation than to complete it but fail to produce telemetry data.
SDK authors SHOULD provide a configuration option that allows the end user to change the library's default error handling behavior. When this option is enabled, the SDK MUST NOT suppress errors or return placeholder objects.
Application developers may also want to run with strict error handling during tests, or in a staging environment, to catch invalid uses of the API, or malformed config.
There was a problem hiding this comment.
If we want to standardize this I'd suggest to have a separate work stream on it. "MAY" can be enough for a first iteration
@c24t Is your comment the same as what is being suggested here: #153 (comment). If so, that is the approach that this document currently recommends. |
Yup. We're on the same page, #153 (comment) seemed to suggest those were the only three options. |
|
@iredelmeier, thank you for review. I feel that it is addressed now and ask to review again. This proposal was discussed on today's spec SIG and a few before this. And there is a general feeling that we need to document error handling principles.
This doc asks all implementations to be consistent in their error handling practices: "OpenTelemetry implementations MUST NOT throw unhandled exceptions at run time.". Even though it may be hard to control, this is what we believe must simplify operation with the OpenTelemetry given it's nature of a monitoring, not a business critical tool in most cases.
What forum do you suggest this discussion to be held on? |
tedsuo
left a comment
There was a problem hiding this comment.
Thanks @SergeyKanzhelev, this latest version looks great.
Co-Authored-By: Armin Ruech <armin.ruech@gmail.com>
|
this PR had a lot of bake time after was approved. Merging |
* error handling proposal * added self-monitoring * Reword principles * Reword guidance * Reword diagnostics * Reword exceptions * Add note on logs, callbacks * Formatting * formatting and a mention of ToString * dynamic languages * returning noops, not nulls * Reformat for open-telemetry#192 * Update specification/error-handling.md Co-Authored-By: Armin Ruech <armin.ruech@gmail.com>
* error handling proposal * added self-monitoring * Reword principles * Reword guidance * Reword diagnostics * Reword exceptions * Add note on logs, callbacks * Formatting * formatting and a mention of ToString * dynamic languages * returning noops, not nulls * Reformat for open-telemetry#192 * Update specification/error-handling.md Co-Authored-By: Armin Ruech <armin.ruech@gmail.com>
Co-authored-by: Armin Ruech <armin.ruech@dynatrace.com>
Fix #141
This proposal suggests the change comparing to the current API/SDK implementation in Java.