qa/suites/upgrade/reef-x: sync log-ignorelist with quincy-x by idryomov · Pull Request #60969 · ceph/ceph

idryomov · 2024-12-05T20:41:19Z

Daemons are terminated by cephadm during the upgrade, so health checks like OSD_DOWN must be ignored. Since there shouldn't be any fundamental difference between upgrading from quincy and upgrading from reef, make quincy-x and reef-x ignorelists the same.

Fixes: https://tracker.ceph.com/issues/69135

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows
jenkins test rook e2e

idryomov · 2024-12-05T20:44:43Z

This is essentially the equivalent of #58277 for reef-x.

Daemons are terminated by cephadm during the upgrade, so health checks like OSD_DOWN must be ignored. Since there shouldn't be any fundamental difference between upgrading from quincy and upgrading from reef, make quincy-x and reef-x ignorelists the same. Fixes: https://tracker.ceph.com/issues/69135 Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

Cover warnings like [WRN] osd.4 (root=default,host=smithi184) is down" in cluster log which OSD_DOWN doesn't. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

Cover warnings like [WRN] pg 2.7 is active+undersized+degraded, acting [6,7]" in cluster log This is based on commit 4a4fc7b ("qa: ignore pg availability/degraded warnings"). Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

Cover warnings like [WRN] Health check failed: 1 filesystem is degraded (FS_DEGRADED)" in cluster log MDS_ALL_DOWN is already ignored in ignorelist_health.yaml for reef-x. Not sure why it's not ignored for quincy-x -- ignorelist_health.yaml isn't present there at all. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

Cover warnings like [WRN] Health check failed: all OSDs are running squid or later but require_osd_release < squid (OSD_UPGRADE_FINISHED)" in cluster log They are inherently transient and should ideally be delayed for a grace period instead of being raised immediately just to be ignored. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

Cover warnings like [WRN] Health check failed: Telemetry requires re-opt-in (TELEMETRY_CHANGED)" in cluster log [WRN] telemetry module includes new collections; please re-opt-in to new collections with `ceph telemetry on`" in cluster log Re-opt-in can happen in a respective workunit (test_telemetry_quincy_x.sh or test_telemetry_reef_x.sh), but it gets run only at the very end after both "workload" and "upgrade-sequence" complete. Over an hour passes in the interim: 2024-12-08T00:06:31.197 INFO:teuthology.task.print:**** done end upgrade, wait... ... 2024-12-08T01:28:38.588 INFO:tasks.workunit:Running workunit test_telemetry_reef_x.sh... The existing list is now duplicated in 0-start.yaml, so replace it entirely. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

idryomov · 2024-12-09T19:53:01Z

@athanatos @ljflores @rzarzynski I have been doing some QA on my own and just pushed a set of additions to log-ignorelist for more rare warnings -- they are noted in commit messages. Apologies for not being quicker ;)

idryomov · 2024-12-10T15:23:19Z

Pushed a minor amendment to cover one more POOL_APP_NOT_ENABLED variant:

       - do not have an application enabled
       - application not enabled
+      - or freeform for custom applications
       - POOL_APP_NOT_ENABLED

Cover warnings like [WRN] POOL_FULL: 2 pool(s) full" in cluster log [WRN] pool 'test-librbd-smithi184-145008-24' is full (running out of quota)" in cluster log [WRN] Health detail: HEALTH_WARN 2 pool(s) full" in cluster log POOL_FULL is already ignored, but only in a parenthesized form. The "... (XYZ)" vs "XYZ: ..." variety isn't specific to POOL_FULL, so get rid of parenthesis throughout. While at it, drop POOL_APP_NOT_ENABLED, PG_AVAILABILITY and MON_DOWN which are duplicated in *-start.yaml. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

Cover warnings like [WRN] Health detail: HEALTH_WARN 1 pool(s) do not have an application enabled" in cluster log [WRN] application not enabled on pool 'cephfs_metadata'" in cluster log [WRN] use 'ceph osd pool application enable <pool-name> <app-name>', where <app-name> is 'cephfs', 'rbd', 'rgw', or freeform for custom applications." in cluster log and also the non-parenthesized form. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

idryomov · 2024-12-10T21:06:51Z

... and another minor amendment to cover one more POOL_FULL variant:

+      - pool\(s\) full
       - POOL_FULL

idryomov · 2024-12-10T22:24:20Z

I'm going to stop here -- my assumption that the ignorelist in quincy-x was good and simply needed to be copied over to reef-x turned out to be very wrong. Grepping through qa/, I see many other jobs in different suites that are similarly flaky when it comes to log-ignorelist, so it's a systemic issue.

I think we need to rethink the approach to these ignorelists because reliably ignoring a single health check can take 3-4 ignorelist entries to account for the summary and the free-form detail and even then it's not easy to get right due to special characters getting reinterpreted on the way between a yaml snippet and egrep -v filter in teuthology:

One idea that comes to mind is adding a small teuthology task that would process a new health-ignorelist stanza and silence health checks with ceph health mute instead of letting everything fire and get logged only to tediously hunt down the possible patterns. This should be 100% reliable and would mimic what we expect users to do in such cases.
Another is to have a more regular output -- perhaps the free-form detail could grow a prefix when logged to the cluster log? Also, it would be good to settle on a common way of handling singular vs plural count -- currently it's a mix of X mons (no parenthesis), X pool(s) and some places go all the way to distinguish is and are.
Finally, more health checks need to have a grace period. all OSDs are running squid or later but require_osd_release < squid (OSD_UPGRADE_FINISHED) is a perfect example: it fires ~20 seconds before cephadm gets to running ceph osd require-osd-release squid.

@rzarzynski I'd like to propose this as a topic for the next CDM, but I think the owner would be @ceph/core.

idryomov · 2024-12-11T10:59:44Z

jenkins test api

ljflores · 2024-12-16T19:49:43Z

freeform

@idryomov I added this to the backlog so we remember to discuss it: https://pad.ceph.com/p/cdm-backlog#L6

rzarzynski

Yes, let's bring this to CDM!

idryomov · 2025-01-10T21:56:21Z

Merging this as is as another job that would have been taken care of by this PR just got added to https://tracker.ceph.com/issues/69135.

Naveenaidu · 2025-01-12T13:50:32Z

Rados approved: https://tracker.ceph.com/issues/69215#note-2

idryomov requested review from a team and ljflores December 5, 2024 20:41

github-actions bot added the tests label Dec 5, 2024

idryomov added the core label Dec 5, 2024

idryomov force-pushed the wip-69135 branch from b339bdf to bb5893f Compare December 5, 2024 22:54

idryomov added 2 commits December 7, 2024 18:35

qa/suites/upgrade/*-x: add "is down" to log-ignorelist

05dca26

Cover warnings like [WRN] osd.4 (root=default,host=smithi184) is down" in cluster log which OSD_DOWN doesn't. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

qa/suites/upgrade/*-x: add more PG states to log-ignorelist

8fa6cf7

Cover warnings like [WRN] pg 2.7 is active+undersized+degraded, acting [6,7]" in cluster log This is based on commit 4a4fc7b ("qa: ignore pg availability/degraded warnings"). Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

idryomov force-pushed the wip-69135 branch from 32da83a to b833d54 Compare December 7, 2024 17:42

athanatos approved these changes Dec 9, 2024

View reviewed changes

idryomov added 3 commits December 9, 2024 17:53

ljflores approved these changes Dec 9, 2024

View reviewed changes

rzarzynski approved these changes Dec 9, 2024

View reviewed changes

ljflores added the needs-qa label Dec 9, 2024

idryomov force-pushed the wip-69135 branch from b833d54 to e2e615b Compare December 9, 2024 19:43

athanatos self-requested a review December 9, 2024 19:54

athanatos approved these changes Dec 9, 2024

View reviewed changes

SrinivasaBharath added the wip-bharath9-testing label Dec 10, 2024

idryomov force-pushed the wip-69135 branch from e2e615b to d7200ee Compare December 10, 2024 15:22

idryomov added 2 commits December 10, 2024 22:02

idryomov force-pushed the wip-69135 branch from d7200ee to 1513644 Compare December 10, 2024 21:05

rzarzynski approved these changes Dec 16, 2024

View reviewed changes

idryomov merged commit 8eac760 into ceph:main Jan 10, 2025

idryomov deleted the wip-69135 branch January 10, 2025 21:56

idryomov mentioned this pull request Jan 11, 2025

squid: qa/suites/upgrade/reef-x: sync log-ignorelist with quincy-x #61335

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

qa/suites/upgrade/reef-x: sync log-ignorelist with quincy-x#60969

qa/suites/upgrade/reef-x: sync log-ignorelist with quincy-x#60969
idryomov merged 8 commits intoceph:mainfrom
idryomov:wip-69135

idryomov commented Dec 5, 2024

Uh oh!

idryomov commented Dec 5, 2024

Uh oh!

idryomov commented Dec 9, 2024

Uh oh!

idryomov commented Dec 10, 2024

Uh oh!

idryomov commented Dec 10, 2024

Uh oh!

idryomov commented Dec 10, 2024 •

edited

Loading

Uh oh!

idryomov commented Dec 11, 2024

Uh oh!

ljflores commented Dec 16, 2024

Uh oh!

rzarzynski left a comment

Uh oh!

idryomov commented Jan 10, 2025

Uh oh!

Naveenaidu commented Jan 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

idryomov commented Dec 5, 2024

Contribution Guidelines

Checklist

Uh oh!

idryomov commented Dec 5, 2024

Uh oh!

idryomov commented Dec 9, 2024

Uh oh!

idryomov commented Dec 10, 2024

Uh oh!

idryomov commented Dec 10, 2024

Uh oh!

idryomov commented Dec 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

idryomov commented Dec 11, 2024

Uh oh!

ljflores commented Dec 16, 2024

Uh oh!

rzarzynski left a comment

Choose a reason for hiding this comment

Uh oh!

idryomov commented Jan 10, 2025

Uh oh!

Naveenaidu commented Jan 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

idryomov commented Dec 10, 2024 •

edited

Loading