Skip to content

upstream: add transport socket failure reason to stream info and log#6018

Merged
lizan merged 24 commits intoenvoyproxy:masterfrom
lizan:tls_details
Mar 10, 2019
Merged

upstream: add transport socket failure reason to stream info and log#6018
lizan merged 24 commits intoenvoyproxy:masterfrom
lizan:tls_details

Conversation

@lizan
Copy link
Copy Markdown
Member

@lizan lizan commented Feb 21, 2019

Description:
Fixes #5603

Risk Level: Low (not changing flow, add more info)
Testing: unit test,
Docs Changes: Added
Release Notes: Added

Signed-off-by: Lizan Zhou <lizan@tetrate.io>
Signed-off-by: Lizan Zhou <lizan@tetrate.io>
Signed-off-by: Lizan Zhou <lizan@tetrate.io>
Signed-off-by: Lizan Zhou <lizan@tetrate.io>
Signed-off-by: Lizan Zhou <lizan@tetrate.io>
Signed-off-by: Lizan Zhou <lizan@tetrate.io>
Signed-off-by: Lizan Zhou <lizan@tetrate.io>
lizan added 2 commits March 3, 2019 00:06
Signed-off-by: Lizan Zhou <lizan@tetrate.io>
Signed-off-by: Lizan Zhou <lizan@tetrate.io>
lizan added 2 commits March 4, 2019 19:37
Signed-off-by: Lizan Zhou <lizan@tetrate.io>
Signed-off-by: Lizan Zhou <lizan@tetrate.io>
@lizan lizan marked this pull request as ready for review March 5, 2019 08:20
@lizan lizan requested a review from htuch as a code owner March 5, 2019 08:20
Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, a few high level questions before a full review.

/wait


%UPSTREAM_TRANSPORT_FAILURE_REASON%
HTTP
The reason of upstream transport failure, this is usually upstream TLS connection failure with 503 response code.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make this a little more clear? Is the latter part of the sentence the actual error message? Should it be in quotes? Also, do we want a string here, or should we have well defined error codes like we do for other failures?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered well defined error codes but I chose a string here due to the transport socket extensibility and it is likely only used in logging purpose. We can make some well known errors as predefined strings though. No the latter part is not the actual message, will update the document.

The reason of upstream transport failure, this is usually upstream TLS connection failure with 503 response code.

TCP
Not implemented ("-")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could implement this, right? Is there a TODO somewhere?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a TODO, will look into this later to see if I can implement within this PR.

Signed-off-by: Lizan Zhou <lizan@tetrate.io>
Signed-off-by: Lizan Zhou <lizan@tetrate.io>
Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks generally looks good. I think I would prefer if we had some more documentation for the known messages that the built-in transport sockets emit. Can we add that?

Also, are there any relevant errors we should print for the non-TLS transport? I guess I'm wondering if we should do the equivalent of strerror to print out connection failure reasons, etc.? Might not be worth it in this PR but maybe a TODO somewhere if you think that would be relevant?

/wait


namespace {

constexpr absl::string_view NotReadyReason{"SSL error: secret is not ready"};
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: What does this mean? Can you make the error message a bit better or potentially actually document all the potential errors for the built-in transport sockets somewhere?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope now it is better, when SDS is configured there are chances that Cluster are up but Secret is not ready.

lizan added 3 commits March 6, 2019 00:31
Signed-off-by: Lizan Zhou <lizan@tetrate.io>
Signed-off-by: Lizan Zhou <lizan@tetrate.io>
@lizan
Copy link
Copy Markdown
Member Author

lizan commented Mar 7, 2019

I think I would prefer if we had some more documentation for the known messages that the built-in transport sockets emit. Can we add that?

Done.

Also, are there any relevant errors we should print for the non-TLS transport? I guess I'm wondering if we should do the equivalent of strerror to print out connection failure reasons, etc.? Might not be worth it in this PR but maybe a TODO somewhere if you think that would be relevant?

Yes that's one of the reason I put it into Connection interface, left a TODO there.

Signed-off-by: Lizan Zhou <lizan@tetrate.io>
Signed-off-by: Lizan Zhou <lizan@tetrate.io>
@lizan lizan requested a review from mattklein123 March 7, 2019 21:41
Signed-off-by: Lizan Zhou <lizan@tetrate.io>
Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the extra docs, it makes the feature really clear and obviously useful to me. A few small additional doc/error thoughts.

/wait

Trouble shooting
----------------

When Envoy originates TLS when making connections to upstream clusters, its error will be logged into
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "..., any errors will be logged into the ..."

:ref:`UPSTREAM_TRANSPORT_FAILURE_REASON<config_access_log_format_upstream_transport_failure_reason>` field.
Common errors are:

* ``Secret is not delivered via SDS``: Envoy is still waiting SDS to deliver key/cert or root CA.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think I would call this error "Secret not supplied by SDS" WDYT?
nit: "still waiting for"

* access log: added a new flag for upstream retry count exceeded.
* access log: added a :ref:`gRPC filter <envoy_api_msg_config.filter.accesslog.v2.GrpcStatusFilter>` to allow filtering on gRPC status.
* access log: added a new flag for stream idle timeout.
* access log: added UPSTREAM_TRANSPORT_FAILURE_REASON for HTTP.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you ref link to both new proto field and text formatting docs?

Signed-off-by: Lizan Zhou <lizan@tetrate.io>
Signed-off-by: Lizan Zhou <lizan@tetrate.io>
@lizan
Copy link
Copy Markdown
Member Author

lizan commented Mar 8, 2019

@mattklein123 docs updated, let's merge #6142 first and I will update this PR. I believe they conflicts.

mattklein123
mattklein123 previously approved these changes Mar 8, 2019
Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on merging the other one first. Thanks!

namespace {

constexpr absl::string_view NotReadyReason{"TLS error: Secret is not delivered via SDS"};
constexpr absl::string_view NotReadyReason{"TLS error: Secret is not supplied by SDS"};
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice if there was a test that covers this?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

lizan added 2 commits March 8, 2019 14:43
Signed-off-by: Lizan Zhou <lizan@tetrate.io>
Signed-off-by: Lizan Zhou <lizan@tetrate.io>
@lizan lizan merged commit b41ba59 into envoyproxy:master Mar 10, 2019
@lizan lizan deleted the tls_details branch March 10, 2019 07:50
mpuncel added a commit to mpuncel/envoy that referenced this pull request Mar 11, 2019
* master:
  token bucket: several fixes (envoyproxy#6235)
  config: move logging of full response to trace logging (envoyproxy#6226)
  mysql_filter: add a warning about compatibility (envoyproxy#6234)
  upstream: add transport socket failure reason to stream info and log (envoyproxy#6018)
  IoHandle readv and writev (envoyproxy#6037)

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
davidben added a commit to davidben/envoy that referenced this pull request Sep 29, 2022
This fixes a bug in the pre-tls_async_cert_validation codepath. The
tls_async_cert_validation codepath is unaffected.

Per both OpenSSL and BoringSSL documentation, if you use
SSL_CTX_set_cert_verify_callback, the library expects to find the error
code in X509_STORE_CTX. This error determines the TLS alert.
X509_verify_cert does this for you, but Envoy's callback was not doing
this when doSynchronousVerifyCertChain runs the extra checks in
verifyCertAndUpdateStatus. This change fixes this and sets the generic
X509_V_ERR_APPLICATION_VERIFICATION.

Before this commit, the error would usually be X509_V_OK, but sometimes
it would be something else. Envoy implements the
allow_expired_certificate using the X509_STORE_CTX verify callback, via
ignoreCertificateExpirationCallback. This callback, when it suppresses
an error, will sometimes leave stray errors in X509_STORE_CTX. These are
usually benign, *but* if Envoy then does extra checks, which fail, and
then report some *other* failure, it will inadvertantly return the stray
error to BoringSSL!

That is why FailedClientCertAllowExpiredBadHashVerification expected a
SSLV3_ALERT_CERTIFICATE_EXPIRED alert. The alert makes no sense because
the connection sets allow_expired_certificate! Rather, the test actually
caught an Envoy bug, but envoyproxy#6018
mistakenly codified it as expected behavior, rather than correctly
diagnosing it as a bug. This commit fixes that bug.

Unfortunately, OpenSSL/BoringSSL map X509_V_ERR_APPLICATION_VERIFICATION
to handshake_failure rather than certificate_unknown, so this does
change the alert returned by the pre-tls_async_cert_validation codepath
in other cases. (certificate_unknown came from X509_V_OK hitting the
default in a switch-case.) Arguably certificate_unknown is a better
alert than handshake_failure, so I'm slightly inclined to fix BoringSSL
to map that differently. But since Envoy needs to support an old
BoringSSL, I've just updated the expectations for now. Either way,
tls_async_cert_validation gives full control over the alert and will
make this moot.

Signed-off-by: David Benjamin <davidben@google.com>
ggreenway pushed a commit that referenced this pull request Oct 3, 2022
This fixes a bug in the pre-tls_async_cert_validation codepath. The
tls_async_cert_validation codepath is unaffected.

Per both OpenSSL and BoringSSL documentation, if you use
SSL_CTX_set_cert_verify_callback, the library expects to find the error
code in X509_STORE_CTX. This error determines the TLS alert.
X509_verify_cert does this for you, but Envoy's callback was not doing
this when doSynchronousVerifyCertChain runs the extra checks in
verifyCertAndUpdateStatus. This change fixes this and sets the generic
X509_V_ERR_APPLICATION_VERIFICATION.

Before this commit, the error would usually be X509_V_OK, but sometimes
it would be something else. Envoy implements the
allow_expired_certificate using the X509_STORE_CTX verify callback, via
ignoreCertificateExpirationCallback. This callback, when it suppresses
an error, will sometimes leave stray errors in X509_STORE_CTX. These are
usually benign, *but* if Envoy then does extra checks, which fail, and
then report some *other* failure, it will inadvertantly return the stray
error to BoringSSL!

That is why FailedClientCertAllowExpiredBadHashVerification expected a
SSLV3_ALERT_CERTIFICATE_EXPIRED alert. The alert makes no sense because
the connection sets allow_expired_certificate! Rather, the test actually
caught an Envoy bug, but #6018
mistakenly codified it as expected behavior, rather than correctly
diagnosing it as a bug. This commit fixes that bug.

Unfortunately, OpenSSL/BoringSSL map X509_V_ERR_APPLICATION_VERIFICATION
to handshake_failure rather than certificate_unknown, so this does
change the alert returned by the pre-tls_async_cert_validation codepath
in other cases. (certificate_unknown came from X509_V_OK hitting the
default in a switch-case.) Arguably certificate_unknown is a better
alert than handshake_failure, so I'm slightly inclined to fix BoringSSL
to map that differently. But since Envoy needs to support an old
BoringSSL, I've just updated the expectations for now. Either way,
tls_async_cert_validation gives full control over the alert and will
make this moot.

Signed-off-by: David Benjamin <davidben@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants