qa: do not report errors on stderr as cluster log findings by batrick · Pull Request #66366 · ceph/ceph

batrick · 2025-11-21T13:20:38Z

Fixes: https://tracker.ceph.com/issues/73953

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins test classic perf Jenkins Job | Jenkins Job Definition
jenkins test crimson perf Jenkins Job | Jenkins Job Definition
jenkins test signed Jenkins Job | Jenkins Job Definition
jenkins test make check Jenkins Job | Jenkins Job Definition
jenkins test make check arm64 Jenkins Job | Jenkins Job Definition
jenkins test submodules Jenkins Job | Jenkins Job Definition
jenkins test dashboard Jenkins Job | Jenkins Job Definition
jenkins test dashboard cephadm Jenkins Job | Jenkins Job Definition
jenkins test api Jenkins Job | Jenkins Job Definition
jenkins test docs ReadTheDocs | Github Workflow Definition
jenkins test ceph-volume all Jenkins Jobs | Jenkins Jobs Definition
jenkins test windows Jenkins Job | Jenkins Job Definition
jenkins test rook e2e Jenkins Job | Jenkins Job Definition

You must only issue one Jenkins command per-comment. Jenkins does not understand
comments with more than one command.

Fixes: https://tracker.ceph.com/issues/73953 Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>

batrick · 2025-12-01T16:06:45Z

2025-11-30T07:55:01.246 INFO:tasks.cephadm:Checking cluster log for badness...
2025-11-30T07:55:01.246 DEBUG:teuthology.orchestra.run.smithi016:> sudo grep -E '\[ERR\]|\[WRN\]|\[SEC\]' /var/log/ceph/a94969ea-cdc1-11f0-87aa-adfe0268badd/ceph.log | grep -E -v '\(MDS_ALL_DOWN\)' | grep -E -v '\(MDS_UP_LESS_THAN_MAX\)' | grep -E -v FS_DEGRADED | grep -E -v 'fs.*is degraded' | grep -E -v 'filesystem is degraded' | grep -E -v FS_INLINE_DATA_DEPRECATED | grep -E -v FS_WITH_FAILED_MDS | grep -E -v MDS_ALL_DOWN | grep -E -v 'filesystem is offline' | grep -E -v 'is offline because no MDS' | grep -E -v MDS_DAMAGE | grep -E -v MDS_DEGRADED | grep -E -v MDS_FAILED | grep -E -v MDS_INSUFFICIENT_STANDBY | grep -E -v 'insufficient standby MDS daemons available' | grep -E -v MDS_UP_LESS_THAN_MAX | grep -E -v 'online, but wants' | grep -E -v 'filesystem is online with fewer MDS than max_mds' | grep -E -v POOL_APP_NOT_ENABLED | grep -E -v 'do not have an application enabled' | grep -E -v 'overall HEALTH_' | grep -E -v 'Replacing daemon' | grep -E -v 'deprecated feature inline_data' | grep -E -v BLUESTORE_SLOW_OP_ALERT | grep -E -v 'slow operation indications in BlueStore' | grep -E -v 'experiencing slow operations in BlueStore' | grep -E -v MGR_MODULE_ERROR | grep -E -v OSD_DOWN | grep -E -v 'osd.* is down' | grep -E -v PG_AVAILABILITY | grep -E -v PG_DEGRADED | grep -E -v 'Reduced data availability' | grep -E -v 'Degraded data redundancy' | grep -E -v 'pg .* is stuck inactive' | grep -E -v 'pg .* is .*degraded' | grep -E -v 'pg .* is stuck peering' | head -n 1
2025-11-30T07:55:01.275 INFO:teuthology.orchestra.run.smithi016.stderr:grep: /var/log/ceph/a94969ea-cdc1-11f0-87aa-adfe0268badd/ceph.log: No such file or directory

/teuthology/pdonnell-2025-11-30_03:13:50-fs-wip-pdonnell-testing-20251126.180742-debug-distro-default-smithi/8630995/teuthology.log

works as expected

idryomov · 2025-12-01T16:17:13Z

qa/tasks/ceph.py

            stdout = r.stdout.getvalue().decode()
            if stdout:
                return stdout
-            stderr = r.stderr.getvalue()


2025-11-20T19:44:37.966 INFO:teuthology.orchestra.run.smithi045.stderr:grep: /var/log/ceph/573c76ee-c649-11f0-877f-adfe0268badd/ceph.log: No such file or directory

It was added in #48539 specifically to catch cases where the cluster log file doesn't exist for some reason ;) Unfortunately this was the case for all cephadm-based jobs for years before it was noticed and fixed in #54312 and caused high-profile issues like https://tracker.ceph.com/issues/63389 to get missed.

My main problem with it is that it obscures the real error for teuthology but that's more of a problem with teuthology I suppose.

It's more with this bit in the task than with teuthology itself IMO -- the failure could certainly be signaled in a more sophisticated way than just returning grep's stderr. But I'd argue that an obscure error is better than no error and a false-pass result.

Sure, but the error is only valid if it's from a test which would otherwise pass, no? In this case, cephadm task didn't even start successfully.

Anyway, I will probably update this to give a better error so moving this to draft.

Sure, but the error is only valid if it's from a test which would otherwise pass, no?

Yes, that is the part where it's a problem with teuthology itself.

github-actions · 2026-01-30T21:04:51Z

This pull request has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs for another 30 days.
If you are a maintainer or core committer, please follow-up on this pull request to identify what steps should be taken by the author to move this proposed change forward.
If you are the author of this pull request, thank you for your proposed contribution. If you believe this change is still appropriate, please ensure that any feedback has been addressed and ask for a code review.

github-actions · 2026-03-01T23:02:43Z

This pull request has been automatically closed because there has been no activity for 90 days. Please feel free to reopen this pull request (or open a new one) if the proposed change is still appropriate. Thank you for your contribution!

qa: do not report errors on stderr as cluster log findings

e363623

Fixes: https://tracker.ceph.com/issues/73953 Signed-off-by: Patrick Donnelly <pdonnell@ibm.com>

batrick requested a review from athanatos November 21, 2025 13:20

batrick requested a review from a team as a code owner November 21, 2025 13:20

batrick added wip-pdonnell-testing wip-pdonnell-testing2 labels Nov 21, 2025

github-actions bot added cephadm tests labels Nov 21, 2025

batrick added needs-review and removed tests cephadm labels Nov 21, 2025

idryomov reviewed Dec 1, 2025

View reviewed changes

batrick marked this pull request as draft December 1, 2025 19:15

batrick added tests and removed needs-review wip-pdonnell-testing wip-pdonnell-testing2 labels Dec 1, 2025

github-actions bot added the stale label Jan 30, 2026

github-actions bot closed this Mar 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

qa: do not report errors on stderr as cluster log findings#66366

qa: do not report errors on stderr as cluster log findings#66366
batrick wants to merge 1 commit intoceph:mainfrom
batrick:qa-cluster-log-check

batrick commented Nov 21, 2025 •

edited

Loading

Uh oh!

batrick commented Dec 1, 2025

Uh oh!

idryomov Dec 1, 2025

Uh oh!

batrick Dec 1, 2025

Uh oh!

idryomov Dec 1, 2025

Uh oh!

batrick Dec 1, 2025

Uh oh!

idryomov Dec 1, 2025

Uh oh!

github-actions bot commented Jan 30, 2026

Uh oh!

github-actions bot commented Mar 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

batrick commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contribution Guidelines

Checklist

Uh oh!

batrick commented Dec 1, 2025

Uh oh!

idryomov Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

batrick Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

idryomov Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

batrick Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

idryomov Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 30, 2026

Uh oh!

github-actions bot commented Mar 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

batrick commented Nov 21, 2025 •

edited

Loading