upstream: add transport socket failure reason to stream info and log#6018
upstream: add transport socket failure reason to stream info and log#6018lizan merged 24 commits intoenvoyproxy:masterfrom
Conversation
Signed-off-by: Lizan Zhou <lizan@tetrate.io>
Signed-off-by: Lizan Zhou <lizan@tetrate.io>
Signed-off-by: Lizan Zhou <lizan@tetrate.io>
Signed-off-by: Lizan Zhou <lizan@tetrate.io>
Signed-off-by: Lizan Zhou <lizan@tetrate.io>
Signed-off-by: Lizan Zhou <lizan@tetrate.io>
Signed-off-by: Lizan Zhou <lizan@tetrate.io>
Signed-off-by: Lizan Zhou <lizan@tetrate.io>
mattklein123
left a comment
There was a problem hiding this comment.
Thanks, a few high level questions before a full review.
/wait
|
|
||
| %UPSTREAM_TRANSPORT_FAILURE_REASON% | ||
| HTTP | ||
| The reason of upstream transport failure, this is usually upstream TLS connection failure with 503 response code. |
There was a problem hiding this comment.
Can you make this a little more clear? Is the latter part of the sentence the actual error message? Should it be in quotes? Also, do we want a string here, or should we have well defined error codes like we do for other failures?
There was a problem hiding this comment.
I considered well defined error codes but I chose a string here due to the transport socket extensibility and it is likely only used in logging purpose. We can make some well known errors as predefined strings though. No the latter part is not the actual message, will update the document.
| The reason of upstream transport failure, this is usually upstream TLS connection failure with 503 response code. | ||
|
|
||
| TCP | ||
| Not implemented ("-") |
There was a problem hiding this comment.
We could implement this, right? Is there a TODO somewhere?
There was a problem hiding this comment.
Added a TODO, will look into this later to see if I can implement within this PR.
Signed-off-by: Lizan Zhou <lizan@tetrate.io>
mattklein123
left a comment
There was a problem hiding this comment.
Thanks generally looks good. I think I would prefer if we had some more documentation for the known messages that the built-in transport sockets emit. Can we add that?
Also, are there any relevant errors we should print for the non-TLS transport? I guess I'm wondering if we should do the equivalent of strerror to print out connection failure reasons, etc.? Might not be worth it in this PR but maybe a TODO somewhere if you think that would be relevant?
/wait
|
|
||
| namespace { | ||
|
|
||
| constexpr absl::string_view NotReadyReason{"SSL error: secret is not ready"}; |
There was a problem hiding this comment.
nit: What does this mean? Can you make the error message a bit better or potentially actually document all the potential errors for the built-in transport sockets somewhere?
There was a problem hiding this comment.
I hope now it is better, when SDS is configured there are chances that Cluster are up but Secret is not ready.
Signed-off-by: Lizan Zhou <lizan@tetrate.io>
Signed-off-by: Lizan Zhou <lizan@tetrate.io>
Done.
Yes that's one of the reason I put it into Connection interface, left a TODO there. |
Signed-off-by: Lizan Zhou <lizan@tetrate.io>
mattklein123
left a comment
There was a problem hiding this comment.
Thanks for the extra docs, it makes the feature really clear and obviously useful to me. A few small additional doc/error thoughts.
/wait
| Trouble shooting | ||
| ---------------- | ||
|
|
||
| When Envoy originates TLS when making connections to upstream clusters, its error will be logged into |
There was a problem hiding this comment.
nit: "..., any errors will be logged into the ..."
| :ref:`UPSTREAM_TRANSPORT_FAILURE_REASON<config_access_log_format_upstream_transport_failure_reason>` field. | ||
| Common errors are: | ||
|
|
||
| * ``Secret is not delivered via SDS``: Envoy is still waiting SDS to deliver key/cert or root CA. |
There was a problem hiding this comment.
nit: I think I would call this error "Secret not supplied by SDS" WDYT?
nit: "still waiting for"
docs/root/intro/version_history.rst
Outdated
| * access log: added a new flag for upstream retry count exceeded. | ||
| * access log: added a :ref:`gRPC filter <envoy_api_msg_config.filter.accesslog.v2.GrpcStatusFilter>` to allow filtering on gRPC status. | ||
| * access log: added a new flag for stream idle timeout. | ||
| * access log: added UPSTREAM_TRANSPORT_FAILURE_REASON for HTTP. |
There was a problem hiding this comment.
nit: can you ref link to both new proto field and text formatting docs?
Signed-off-by: Lizan Zhou <lizan@tetrate.io>
Signed-off-by: Lizan Zhou <lizan@tetrate.io>
|
@mattklein123 docs updated, let's merge #6142 first and I will update this PR. I believe they conflicts. |
mattklein123
left a comment
There was a problem hiding this comment.
Agreed on merging the other one first. Thanks!
| namespace { | ||
|
|
||
| constexpr absl::string_view NotReadyReason{"TLS error: Secret is not delivered via SDS"}; | ||
| constexpr absl::string_view NotReadyReason{"TLS error: Secret is not supplied by SDS"}; |
There was a problem hiding this comment.
Would be nice if there was a test that covers this?
Signed-off-by: Lizan Zhou <lizan@tetrate.io>
Signed-off-by: Lizan Zhou <lizan@tetrate.io>
* master: token bucket: several fixes (envoyproxy#6235) config: move logging of full response to trace logging (envoyproxy#6226) mysql_filter: add a warning about compatibility (envoyproxy#6234) upstream: add transport socket failure reason to stream info and log (envoyproxy#6018) IoHandle readv and writev (envoyproxy#6037) Signed-off-by: Michael Puncel <mpuncel@squareup.com>
This fixes a bug in the pre-tls_async_cert_validation codepath. The tls_async_cert_validation codepath is unaffected. Per both OpenSSL and BoringSSL documentation, if you use SSL_CTX_set_cert_verify_callback, the library expects to find the error code in X509_STORE_CTX. This error determines the TLS alert. X509_verify_cert does this for you, but Envoy's callback was not doing this when doSynchronousVerifyCertChain runs the extra checks in verifyCertAndUpdateStatus. This change fixes this and sets the generic X509_V_ERR_APPLICATION_VERIFICATION. Before this commit, the error would usually be X509_V_OK, but sometimes it would be something else. Envoy implements the allow_expired_certificate using the X509_STORE_CTX verify callback, via ignoreCertificateExpirationCallback. This callback, when it suppresses an error, will sometimes leave stray errors in X509_STORE_CTX. These are usually benign, *but* if Envoy then does extra checks, which fail, and then report some *other* failure, it will inadvertantly return the stray error to BoringSSL! That is why FailedClientCertAllowExpiredBadHashVerification expected a SSLV3_ALERT_CERTIFICATE_EXPIRED alert. The alert makes no sense because the connection sets allow_expired_certificate! Rather, the test actually caught an Envoy bug, but envoyproxy#6018 mistakenly codified it as expected behavior, rather than correctly diagnosing it as a bug. This commit fixes that bug. Unfortunately, OpenSSL/BoringSSL map X509_V_ERR_APPLICATION_VERIFICATION to handshake_failure rather than certificate_unknown, so this does change the alert returned by the pre-tls_async_cert_validation codepath in other cases. (certificate_unknown came from X509_V_OK hitting the default in a switch-case.) Arguably certificate_unknown is a better alert than handshake_failure, so I'm slightly inclined to fix BoringSSL to map that differently. But since Envoy needs to support an old BoringSSL, I've just updated the expectations for now. Either way, tls_async_cert_validation gives full control over the alert and will make this moot. Signed-off-by: David Benjamin <davidben@google.com>
This fixes a bug in the pre-tls_async_cert_validation codepath. The tls_async_cert_validation codepath is unaffected. Per both OpenSSL and BoringSSL documentation, if you use SSL_CTX_set_cert_verify_callback, the library expects to find the error code in X509_STORE_CTX. This error determines the TLS alert. X509_verify_cert does this for you, but Envoy's callback was not doing this when doSynchronousVerifyCertChain runs the extra checks in verifyCertAndUpdateStatus. This change fixes this and sets the generic X509_V_ERR_APPLICATION_VERIFICATION. Before this commit, the error would usually be X509_V_OK, but sometimes it would be something else. Envoy implements the allow_expired_certificate using the X509_STORE_CTX verify callback, via ignoreCertificateExpirationCallback. This callback, when it suppresses an error, will sometimes leave stray errors in X509_STORE_CTX. These are usually benign, *but* if Envoy then does extra checks, which fail, and then report some *other* failure, it will inadvertantly return the stray error to BoringSSL! That is why FailedClientCertAllowExpiredBadHashVerification expected a SSLV3_ALERT_CERTIFICATE_EXPIRED alert. The alert makes no sense because the connection sets allow_expired_certificate! Rather, the test actually caught an Envoy bug, but #6018 mistakenly codified it as expected behavior, rather than correctly diagnosing it as a bug. This commit fixes that bug. Unfortunately, OpenSSL/BoringSSL map X509_V_ERR_APPLICATION_VERIFICATION to handshake_failure rather than certificate_unknown, so this does change the alert returned by the pre-tls_async_cert_validation codepath in other cases. (certificate_unknown came from X509_V_OK hitting the default in a switch-case.) Arguably certificate_unknown is a better alert than handshake_failure, so I'm slightly inclined to fix BoringSSL to map that differently. But since Envoy needs to support an old BoringSSL, I've just updated the expectations for now. Either way, tls_async_cert_validation gives full control over the alert and will make this moot. Signed-off-by: David Benjamin <davidben@google.com>
Description:
Fixes #5603
Risk Level: Low (not changing flow, add more info)
Testing: unit test,
Docs Changes: Added
Release Notes: Added