AsyncClientFactory: use getThreadLocalCluster for thread safety#15227
AsyncClientFactory: use getThreadLocalCluster for thread safety#15227mattklein123 merged 12 commits intoenvoyproxy:mainfrom
Conversation
due to envoyproxy#14204, doing static cluster checks for AsyncClientFactory can only be done on the main thread. adds a ENVOY_BUG check to ensure that cluster checks are skipped when not running on the main thread. Signed-off-by: Michael Wozniak <wozz@koh.ms> Signed-off-by: wozz <wozz@users.noreply.github.com>
Signed-off-by: Michael Wozniak <wozz@koh.ms>
Signed-off-by: Michael Wozniak <wozz@koh.ms>
mattklein123
left a comment
There was a problem hiding this comment.
Thanks for adding this check. One question.
/wait
Signed-off-by: Michael Wozniak <wozz@koh.ms>
| ASSERT(Thread::MainThread::isMainThread(), | ||
| "async client factory cluster checks should only be performed on the main thread"); |
There was a problem hiding this comment.
I'm really sorry, I should have looked at this in more detail, but instead of using the clusters() api should we just be using the get() API which is thread local? This code seems broken. If we did that could we remove the skip clusters check param entirely?
There was a problem hiding this comment.
I guess we can use the get api, but the skip_clusters_check would become require_static_cluster? I'm not sure what the implications are of a cluster being statically defined vs provided by API or if there are cases that actually require statically defined clusters or if that requirement is due to some legacy behavior that's no longer necessary.
There was a problem hiding this comment.
actually, I think I can use this: https://github.com/envoyproxy/envoy/blob/main/source/common/config/utility.h#L116..L128
There was a problem hiding this comment.
Yes ^ exactly, thanks.
/wait
There was a problem hiding this comment.
the utility function has the same issue, relying on ClusterManager::clusters(), so I've just updated this one instance to use getThreadLocalCluster and maintain the same semantics as previously for the skip_cluster_checks param.
There was a problem hiding this comment.
The async client factory should only ever be used on the main thread. I think it might be worth asserting that, since you probably will get other kinds of broken behavior if you try use it on a worker thread.
There was a problem hiding this comment.
This is an example where it's currently being used on a worker thread: https://github.com/envoyproxy/envoy/blob/main/source/extensions/filters/http/ext_authz/config.cc#L80..L89
Does that need to change?
There was a problem hiding this comment.
Hmm, good point actually, I think what you have makes sense then. CC @chaoqin-li1123 for visibility into this PR.
Signed-off-by: Michael Wozniak <wozz@koh.ms>
Signed-off-by: Michael Wozniak <wozz@koh.ms>
Signed-off-by: Michael Wozniak <wozz@koh.ms>
Signed-off-by: Michael Wozniak <wozz@koh.ms>
Signed-off-by: Michael Wozniak <wozz@koh.ms>
|
|
||
| using AsyncClientFactoryPtr = std::unique_ptr<AsyncClientFactory>; | ||
|
|
||
| enum class AsyncClientFactoryClusterChecks { Skip, ValidateStatic, ValidateStaticDuringBootstrap }; |
There was a problem hiding this comment.
Where is ValidateStaticDuringBootstrap used in prod code? I don't see it?
There was a problem hiding this comment.
it's effectively used here: https://github.com/envoyproxy/envoy/blob/main/source/server/server.cc#L620..L621 in factoryForGrpcApiConfigSource
This is the place where getThreadLocalCluster doesn't work.
There was a problem hiding this comment.
OK got it. It really seems like we should be able to do something better here but I'm struggling to think of what that better thing is so I agree we can go with this right now. I will take a look once passing CI.
Signed-off-by: Michael Wozniak <wozz@koh.ms>
…ty (envoyproxy#15227)" This reverts commit 296d106. Signed-off-by: Michael Wozniak <wozz@koh.ms>
…ty (envoyproxy#15227)" (envoyproxy#15449) This reverts commit 296d106. Signed-off-by: Michael Wozniak <wozz@koh.ms> Signed-off-by: Auni Ahsan <auni@google.com>
Commit Message: AsyncClientFactory: use getThreadLocalCluster for thread safety
Additional Description: due to #14204, AsyncClientFactory should use
ClusterManager::getThreadLocalClusterto ensure that checking the cluster status can be performed on main or worker threads.Risk Level: low
Testing: existing tests
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Deprecated:]
[Optional API Considerations:]