Skip to content

xDS interop: collect pod logs#30594

Merged
sergiitk merged 13 commits intogrpc:masterfrom
sergiitk:xds-interop-pod-logs
Aug 26, 2022
Merged

xDS interop: collect pod logs#30594
sergiitk merged 13 commits intogrpc:masterfrom
sergiitk:xds-interop-pod-logs

Conversation

@sergiitk
Copy link
Copy Markdown
Member

@sergiitk sergiitk commented Aug 16, 2022

  • Added support for pod log collection. To enable, set --collect_app_logs flag, and specify --log_dir.
  • Added support and helpers for operating on the --log_dir (natively provided by absl)
  • Added support for --follow to bin/run_test_server.py and bin/run_test_client.py to follow pod logs printed to stdout
  • Moved PortForwarder from k8s.py to its own file

The collection itself will be enabled per-suite in #30735.

@sergiitk sergiitk force-pushed the xds-interop-pod-logs branch 7 times, most recently from 59f45cd to 987caf8 Compare August 18, 2022 21:09
sergiitk added a commit that referenced this pull request Aug 23, 2022
…ions (#30607)

- Changes the order of waiting for pods to start: wait for the pods first, then for the deployment to transition to active. This should provide more useful information in the logs, showing exactly why the pod didn't start, instead of generic "Replicas not available" ref b/200293121. This also needed for #30594
- Add support for `check_result` callback in the retryer helpers
- Completely replaces `retrying` with `tenacity`, ref b/200293121. Retrying is not longer maintained.
- Improves the readability of timeout errors: now they contain the timeout (or the attempt number) exceeded, and information why the timeout failed (exception/check function):
  Before:  
  > `tenacity.RetryError: RetryError[<Future at 0x7f8ce156bc18 state=finished returned dict>]`
  
  After:
  > `framework.helpers.retryers.RetryError: Retry error calling framework.infrastructure.k8s.KubernetesNamespace.get_pod: timeout 0:01:00 exceeded. Check result callback returned False.`
- Improves the readability of the k8s wait operation errors: now the log includes colorized and formatted status of the k8s object being watched, instead of dumping the full k8s object. For example, here's how an error caused by using incorrect TD bootstrap image:
@sergiitk sergiitk force-pushed the xds-interop-pod-logs branch 2 times, most recently from 2902ada to 5e1eeb9 Compare August 23, 2022 21:19
@sergiitk sergiitk force-pushed the xds-interop-pod-logs branch from 5e1eeb9 to c056bc2 Compare August 24, 2022 17:15
@sergiitk sergiitk added release notes: no Indicates if PR should not be in release notes area/psm interop labels Aug 24, 2022
@sergiitk
Copy link
Copy Markdown
Member Author

sergiitk commented Aug 24, 2022

Logging not enabled, confirming the framework still works:

@sergiitk sergiitk marked this pull request as ready for review August 25, 2022 00:18
@sergiitk sergiitk requested a review from gnossen August 25, 2022 00:20
Copy link
Copy Markdown
Contributor

@gnossen gnossen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work! This is going to make troubleshooting issues so much easier!


COLLECT_APP_LOGS = flags.DEFINE_bool(
'collect_app_logs',
default=False,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not turn this on by default?

Copy link
Copy Markdown
Member Author

@sergiitk sergiitk Aug 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this in most cases on local dev. It'll write to /tmp by default (this is an absl thing). It won't make sense for pre-provisioned pods (like url map tests) either.
Also, --log_dir is different per each environment (kokoro/staging/local), so I want setting collect_app_logs to be a conscious choice.
And there's always flagfiles to set this flags if needed.

logger = logging.getLogger(__name__)


class PodLogCollector(threading.Thread):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a bunch of landmines when subclassing threading.Thread rather than just passing a callable to the threading.Thread constructor, but it looks like you've managed to avoid them all. No complaints on my end as long you've confirmed that it's all working as intended.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I had an impression this was the case, but couldn't find any strong advices against it. Thanks for noticing! I'll give it another looks, just in case. Won't be a difficult refactoring to not extend a thread, but manage it as a field.

sergiitk added a commit to sergiitk/grpc that referenced this pull request Sep 7, 2022
- Enables pod log collection in all PSM interop jobs implemented in grpc#30594.
- Associate test suite runs with their own log file, so it's displayed on "Target Log" tab

Cherry-pick conflicts resolved:
- Removed authz test
sergiitk added a commit to sergiitk/grpc that referenced this pull request Sep 7, 2022
- Enables pod log collection in all PSM interop jobs implemented in grpc#30594.
- Associate test suite runs with their own log file, so it's displayed on "Target Log" tab

Cherry-pick conflicts resolved:
- Removed authz test
sergiitk added a commit to sergiitk/grpc that referenced this pull request Sep 7, 2022
- Enables pod log collection in all PSM interop jobs implemented in grpc#30594.
- Associate test suite runs with their own log file, so it's displayed on "Target Log" tab

Cherry-pick conflicts resolved:
- Removed authz test
- Removed set command from the xlang script that doesn't exist in the
  following branches
sergiitk added a commit to sergiitk/grpc that referenced this pull request Sep 7, 2022
- Enables pod log collection in all PSM interop jobs implemented in grpc#30594.
- Associate test suite runs with their own log file, so it's displayed on "Target Log" tab

Cherry-pick conflicts resolved:
- Removed authz test
- Ignore xlang script as it hasn't been created
ejona86 pushed a commit that referenced this pull request Sep 7, 2022
…30860)

- Enables pod log collection in all PSM interop jobs implemented in #30594.
- Associate test suite runs with their own log file, so it's displayed on "Target Log" tab

Cherry-pick conflicts resolved:
- Removed authz test
ejona86 pushed a commit that referenced this pull request Sep 7, 2022
…30861)

- Enables pod log collection in all PSM interop jobs implemented in #30594.
- Associate test suite runs with their own log file, so it's displayed on "Target Log" tab

Cherry-pick conflicts resolved:
- Removed authz test
ejona86 pushed a commit that referenced this pull request Sep 7, 2022
…30862)

- Enables pod log collection in all PSM interop jobs implemented in #30594.
- Associate test suite runs with their own log file, so it's displayed on "Target Log" tab

Cherry-pick conflicts resolved:
- Removed authz test
ejona86 pushed a commit that referenced this pull request Sep 7, 2022
…30863)

- Enables pod log collection in all PSM interop jobs implemented in #30594.
- Associate test suite runs with their own log file, so it's displayed on "Target Log" tab

Cherry-pick conflicts resolved:
- Removed authz test
ejona86 pushed a commit that referenced this pull request Sep 7, 2022
…30864)

- Enables pod log collection in all PSM interop jobs implemented in #30594.
- Associate test suite runs with their own log file, so it's displayed on "Target Log" tab

Cherry-pick conflicts resolved:
- Removed authz test
- Removed set command from the xlang script that doesn't exist in the
  following branches
ejona86 pushed a commit that referenced this pull request Sep 7, 2022
…30865)

- Enables pod log collection in all PSM interop jobs implemented in #30594.
- Associate test suite runs with their own log file, so it's displayed on "Target Log" tab

Cherry-pick conflicts resolved:
- Removed authz test
- Ignore xlang script as it hasn't been created
ejona86 pushed a commit that referenced this pull request Sep 7, 2022
…30857)

- Enables pod log collection in all PSM interop jobs implemented in #30594.
- Associate test suite runs with their own log file, so it's displayed on "Target Log" tab
ejona86 pushed a commit that referenced this pull request Sep 7, 2022
…30856)

- Enables pod log collection in all PSM interop jobs implemented in #30594.
- Associate test suite runs with their own log file, so it's displayed on "Target Log" tab
veblush added a commit that referenced this pull request Sep 20, 2022
* Bump 1.49.x branch to 1.49.0.pre1 (#30615)

* bump version to 1.49.0-pre1

* regenerate projects

* [backport][v1.49.x] forkable fixes (#30646)

* Fix forkable globals (#30608)

* Fix forkable repeated registration (#30642)

This fixes a bug that could occur on repeated grpc initialization (after a complete shutdown)

Fixes #30640

* Drop support for ruby 2.5 (#30699) (#30762)

* Drop ruby 2.5 support

* Backport: "stabilize the C2P resolver URI scheme" to v1.49.x (#30654)

* stabilize the C2P resolver URI scheme

* Bump 1.49 branch to 1.49.0.pre2 (#30786)

* bump version to 1.49.0-pre2

* regenerate projects

* Update protobuf on ancillary packages (#30795) (#30805)

* Bump release version on 1.49 to 1.49.0.pre3 (#30814)

* bump version to 1.49.0-pre3

* regenerate projects

* xDS interop: enable pod log collection in the buildscripts (#30735) (#30856)

- Enables pod log collection in all PSM interop jobs implemented in #30594.
- Associate test suite runs with their own log file, so it's displayed on "Target Log" tab

* xDS interop: buildscripts: fix run_test return status (#30768) (#30875)

To capture the return status of the test in run_test the last command must be the call to the test itself.
This removes `set +x`, which makes the run_test always return success, and not propagate the test status.

I can't find it, but this exact error bit us before. Looks like it leaked to other scripts.
The good thing is if the test was executed, it's failure would still be picked up from the result xml.

However, if the test framework didn't start in the first place, the result will be false positive.
Example: https://source.cloud.google.com/results/invocations/98d3e679-ec8a-40bd-9f36-88179747b0d6/targets

```
/home/kbuilder/.pyenv/versions/k8s_xds_test_runner/bin/python3: Error while finding module specification for 'tests.authz_test' (ModuleNotFoundError: No module named 'tests')
+ set +x
Failed test suites: 0

[ID: 3548168] Command finished after 625 secs, exit value: 0
```

* client_channel: fix crash when cancelling a watch after SHUTDOWN (#30885) (#30928)

* Support Python 3.11 (#30818) (#30944)

* Support Python 3.11

* Update build images for 3.11

* Whoopsie

* The architecture of this thing is garbage

* Silence ownership warning

* Account for change in git behavior

* Fix directory

* I am in great pain

* Update Windows and arm linux

* Agh

* Clean up

* Bump 1.49 branch to 1.49.0 (#30974)

* bump version to 1.49.0

* regenerate projects

* Update protobuf to v21.6 on 1.49.x (#31028)

* Update third_party/protobuf

* run tools/distrib/python/make_grpcio_tools.py

* update build_handwritten.yaml

* regenerate projects

* Bump v1.49.x to v1.49.1 (#31037)

* bump version to 1.49.1

* regenerate projects

* Automated change: Fix sanity tests

Co-authored-by: gnossen <gnossen@users.noreply.github.com>

* Fix ruby windows ucrt build (#31053)

Co-authored-by: apolcyn <apolcyn@google.com>
Co-authored-by: AJ Heller <hork@google.com>
Co-authored-by: Richard Belleville <rbellevi@google.com>
Co-authored-by: Sergii Tkachenko <sergiitk@google.com>
Co-authored-by: Mark D. Roth <roth@google.com>
Co-authored-by: Richard Belleville <gnossen@gmail.com>
Co-authored-by: gnossen <gnossen@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/psm interop imported Specifies if the PR has been imported to the internal repository release notes: no Indicates if PR should not be in release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants