xDS interop: collect pod logs#30594
Conversation
59f45cd to
987caf8
Compare
…ions (#30607) - Changes the order of waiting for pods to start: wait for the pods first, then for the deployment to transition to active. This should provide more useful information in the logs, showing exactly why the pod didn't start, instead of generic "Replicas not available" ref b/200293121. This also needed for #30594 - Add support for `check_result` callback in the retryer helpers - Completely replaces `retrying` with `tenacity`, ref b/200293121. Retrying is not longer maintained. - Improves the readability of timeout errors: now they contain the timeout (or the attempt number) exceeded, and information why the timeout failed (exception/check function): Before: > `tenacity.RetryError: RetryError[<Future at 0x7f8ce156bc18 state=finished returned dict>]` After: > `framework.helpers.retryers.RetryError: Retry error calling framework.infrastructure.k8s.KubernetesNamespace.get_pod: timeout 0:01:00 exceeded. Check result callback returned False.` - Improves the readability of the k8s wait operation errors: now the log includes colorized and formatted status of the k8s object being watched, instead of dumping the full k8s object. For example, here's how an error caused by using incorrect TD bootstrap image:
2902ada to
5e1eeb9
Compare
5e1eeb9 to
c056bc2
Compare
|
Logging not enabled, confirming the framework still works: |
gnossen
left a comment
There was a problem hiding this comment.
Awesome work! This is going to make troubleshooting issues so much easier!
|
|
||
| COLLECT_APP_LOGS = flags.DEFINE_bool( | ||
| 'collect_app_logs', | ||
| default=False, |
There was a problem hiding this comment.
Why not turn this on by default?
There was a problem hiding this comment.
I don't think we need this in most cases on local dev. It'll write to /tmp by default (this is an absl thing). It won't make sense for pre-provisioned pods (like url map tests) either.
Also, --log_dir is different per each environment (kokoro/staging/local), so I want setting collect_app_logs to be a conscious choice.
And there's always flagfiles to set this flags if needed.
tools/run_tests/xds_k8s_test_driver/framework/test_app/runners/base_runner.py
Outdated
Show resolved
Hide resolved
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| class PodLogCollector(threading.Thread): |
There was a problem hiding this comment.
There are a bunch of landmines when subclassing threading.Thread rather than just passing a callable to the threading.Thread constructor, but it looks like you've managed to avoid them all. No complaints on my end as long you've confirmed that it's all working as intended.
There was a problem hiding this comment.
Yea, I had an impression this was the case, but couldn't find any strong advices against it. Thanks for noticing! I'll give it another looks, just in case. Won't be a difficult refactoring to not extend a thread, but manage it as a field.
- Enables pod log collection in all PSM interop jobs implemented in grpc#30594. - Associate test suite runs with their own log file, so it's displayed on "Target Log" tab Cherry-pick conflicts resolved: - Removed authz test
- Enables pod log collection in all PSM interop jobs implemented in grpc#30594. - Associate test suite runs with their own log file, so it's displayed on "Target Log" tab Cherry-pick conflicts resolved: - Removed authz test
- Enables pod log collection in all PSM interop jobs implemented in grpc#30594. - Associate test suite runs with their own log file, so it's displayed on "Target Log" tab Cherry-pick conflicts resolved: - Removed authz test - Removed set command from the xlang script that doesn't exist in the following branches
- Enables pod log collection in all PSM interop jobs implemented in grpc#30594. - Associate test suite runs with their own log file, so it's displayed on "Target Log" tab Cherry-pick conflicts resolved: - Removed authz test - Ignore xlang script as it hasn't been created
…30864) - Enables pod log collection in all PSM interop jobs implemented in #30594. - Associate test suite runs with their own log file, so it's displayed on "Target Log" tab Cherry-pick conflicts resolved: - Removed authz test - Removed set command from the xlang script that doesn't exist in the following branches
* Bump 1.49.x branch to 1.49.0.pre1 (#30615) * bump version to 1.49.0-pre1 * regenerate projects * [backport][v1.49.x] forkable fixes (#30646) * Fix forkable globals (#30608) * Fix forkable repeated registration (#30642) This fixes a bug that could occur on repeated grpc initialization (after a complete shutdown) Fixes #30640 * Drop support for ruby 2.5 (#30699) (#30762) * Drop ruby 2.5 support * Backport: "stabilize the C2P resolver URI scheme" to v1.49.x (#30654) * stabilize the C2P resolver URI scheme * Bump 1.49 branch to 1.49.0.pre2 (#30786) * bump version to 1.49.0-pre2 * regenerate projects * Update protobuf on ancillary packages (#30795) (#30805) * Bump release version on 1.49 to 1.49.0.pre3 (#30814) * bump version to 1.49.0-pre3 * regenerate projects * xDS interop: enable pod log collection in the buildscripts (#30735) (#30856) - Enables pod log collection in all PSM interop jobs implemented in #30594. - Associate test suite runs with their own log file, so it's displayed on "Target Log" tab * xDS interop: buildscripts: fix run_test return status (#30768) (#30875) To capture the return status of the test in run_test the last command must be the call to the test itself. This removes `set +x`, which makes the run_test always return success, and not propagate the test status. I can't find it, but this exact error bit us before. Looks like it leaked to other scripts. The good thing is if the test was executed, it's failure would still be picked up from the result xml. However, if the test framework didn't start in the first place, the result will be false positive. Example: https://source.cloud.google.com/results/invocations/98d3e679-ec8a-40bd-9f36-88179747b0d6/targets ``` /home/kbuilder/.pyenv/versions/k8s_xds_test_runner/bin/python3: Error while finding module specification for 'tests.authz_test' (ModuleNotFoundError: No module named 'tests') + set +x Failed test suites: 0 [ID: 3548168] Command finished after 625 secs, exit value: 0 ``` * client_channel: fix crash when cancelling a watch after SHUTDOWN (#30885) (#30928) * Support Python 3.11 (#30818) (#30944) * Support Python 3.11 * Update build images for 3.11 * Whoopsie * The architecture of this thing is garbage * Silence ownership warning * Account for change in git behavior * Fix directory * I am in great pain * Update Windows and arm linux * Agh * Clean up * Bump 1.49 branch to 1.49.0 (#30974) * bump version to 1.49.0 * regenerate projects * Update protobuf to v21.6 on 1.49.x (#31028) * Update third_party/protobuf * run tools/distrib/python/make_grpcio_tools.py * update build_handwritten.yaml * regenerate projects * Bump v1.49.x to v1.49.1 (#31037) * bump version to 1.49.1 * regenerate projects * Automated change: Fix sanity tests Co-authored-by: gnossen <gnossen@users.noreply.github.com> * Fix ruby windows ucrt build (#31053) Co-authored-by: apolcyn <apolcyn@google.com> Co-authored-by: AJ Heller <hork@google.com> Co-authored-by: Richard Belleville <rbellevi@google.com> Co-authored-by: Sergii Tkachenko <sergiitk@google.com> Co-authored-by: Mark D. Roth <roth@google.com> Co-authored-by: Richard Belleville <gnossen@gmail.com> Co-authored-by: gnossen <gnossen@users.noreply.github.com>
--collect_app_logsflag, and specify--log_dir.--log_dir(natively provided by absl)--followtobin/run_test_server.pyandbin/run_test_client.pyto follow pod logs printed to stdoutPortForwarderfrom k8s.py to its own fileThe collection itself will be enabled per-suite in #30735.