Skip to content

Conversation

@alxtkr77
Copy link
Member

@alxtkr77 alxtkr77 commented Dec 1, 2025

Summary

Add automatic pod log collection when model monitoring system tests fail to help debug CI failures.

Changes Made

  • Add collect_monitoring_pod_logs() and helper methods to TestMLRunSystemModelMonitoring
  • Add pytest hook in conftest.py to trigger log collection on test failure
  • Collect logs from monitoring pods (stream, controller, writer, serving)
  • Collect filtered error logs from mlrun-api pods mentioning test project

Testing

  • Lint passes
  • Manual verification of log collection on test failure

Reference

  • Jira: ML-11480

Add automatic collection and logging of pod logs when model monitoring
system tests fail, to help debug CI failures.

Changes:
- Add collect_monitoring_pod_logs() to TestMLRunSystemModelMonitoring
- Collect logs from monitoring pods (stream, controller, writer, serving)
- Collect filtered error logs from mlrun-api pods mentioning test project
- Add pytest hook in conftest.py to trigger on test failure
- Requires MLRUN_SYSTEM_TEST_KUBECONFIG_PATH env var to enable

ML-11480

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link
Member

@liranbg liranbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be more generalized making rest of the test suites dump their logs upon error. overall, you could collect all system pods with certain labels (the ones mlrun set across) and that would be helpful when debugging any failing test

@assaf758
Copy link
Member

assaf758 commented Dec 2, 2025

I think this can be more generalized making rest of the test suites dump their logs upon error. overall, you could collect all system pods with certain labels (the ones mlrun set across) and that would be helpful when debugging any failing test

Yep, I second that!

Move pod log collection from model_monitoring-specific to all system tests:

- Add collect_pod_logs_on_failure() to TestMLRunSystem base class
- Project pods (name contains project_name): collect full logs
- System pods (mlrun-api-*): collect time-bounded logs via since_seconds
- Add autouse fixture to track test start time for duration calculation
- Remove duplicate code from model_monitoring/__init__.py
- Delete model_monitoring/conftest.py (functionality moved to system level)

This addresses reviewer feedback to make pod log collection available
for debugging failures in any system test suite, not just model monitoring.
@assaf758 assaf758 merged commit bf8a3b4 into mlrun:development Dec 2, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants