Skip to content

qa/suites/upgrade/reef-x: sync log-ignorelist with quincy-x#60969

Merged
idryomov merged 8 commits intoceph:mainfrom
idryomov:wip-69135
Jan 10, 2025
Merged

qa/suites/upgrade/reef-x: sync log-ignorelist with quincy-x#60969
idryomov merged 8 commits intoceph:mainfrom
idryomov:wip-69135

Conversation

@idryomov
Copy link
Contributor

@idryomov idryomov commented Dec 5, 2024

Daemons are terminated by cephadm during the upgrade, so health checks like OSD_DOWN must be ignored. Since there shouldn't be any fundamental difference between upgrading from quincy and upgrading from reef, make quincy-x and reef-x ignorelists the same.

Fixes: https://tracker.ceph.com/issues/69135

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@idryomov idryomov requested review from a team and ljflores December 5, 2024 20:41
@github-actions github-actions bot added the tests label Dec 5, 2024
@idryomov idryomov added the core label Dec 5, 2024
@idryomov
Copy link
Contributor Author

idryomov commented Dec 5, 2024

This is essentially the equivalent of #58277 for reef-x.

Daemons are terminated by cephadm during the upgrade, so health checks
like OSD_DOWN must be ignored.  Since there shouldn't be any fundamental
difference between upgrading from quincy and upgrading from reef, make
quincy-x and reef-x ignorelists the same.

Fixes: https://tracker.ceph.com/issues/69135
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Cover warnings like

  [WRN] osd.4 (root=default,host=smithi184) is down" in cluster log

which OSD_DOWN doesn't.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Cover warnings like

  [WRN] pg 2.7 is active+undersized+degraded, acting [6,7]" in cluster log

This is based on commit 4a4fc7b ("qa: ignore pg
availability/degraded warnings").

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Cover warnings like

  [WRN] Health check failed: 1 filesystem is degraded (FS_DEGRADED)" in cluster log

MDS_ALL_DOWN is already ignored in ignorelist_health.yaml for reef-x.
Not sure why it's not ignored for quincy-x -- ignorelist_health.yaml
isn't present there at all.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Cover warnings like

  [WRN] Health check failed: all OSDs are running squid or later but require_osd_release < squid (OSD_UPGRADE_FINISHED)" in cluster log

They are inherently transient and should ideally be delayed for a grace
period instead of being raised immediately just to be ignored.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Cover warnings like

  [WRN] Health check failed: Telemetry requires re-opt-in (TELEMETRY_CHANGED)" in cluster log
  [WRN] telemetry module includes new collections; please re-opt-in to new collections with `ceph telemetry on`" in cluster log

Re-opt-in can happen in a respective workunit
(test_telemetry_quincy_x.sh or test_telemetry_reef_x.sh), but it gets
run only at the very end after both "workload" and "upgrade-sequence"
complete.  Over an hour passes in the interim:

  2024-12-08T00:06:31.197 INFO:teuthology.task.print:**** done end upgrade, wait...
  ...
  2024-12-08T01:28:38.588 INFO:tasks.workunit:Running workunit test_telemetry_reef_x.sh...

The existing list is now duplicated in 0-start.yaml, so replace it
entirely.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
@idryomov
Copy link
Contributor Author

idryomov commented Dec 9, 2024

@athanatos @ljflores @rzarzynski I have been doing some QA on my own and just pushed a set of additions to log-ignorelist for more rare warnings -- they are noted in commit messages. Apologies for not being quicker ;)

@athanatos athanatos self-requested a review December 9, 2024 19:54
@idryomov
Copy link
Contributor Author

Pushed a minor amendment to cover one more POOL_APP_NOT_ENABLED variant:

       - do not have an application enabled
       - application not enabled
+      - or freeform for custom applications
       - POOL_APP_NOT_ENABLED

Cover warnings like

  [WRN] POOL_FULL: 2 pool(s) full" in cluster log
  [WRN] pool 'test-librbd-smithi184-145008-24' is full (running out of quota)" in cluster log
  [WRN] Health detail: HEALTH_WARN 2 pool(s) full" in cluster log

POOL_FULL is already ignored, but only in a parenthesized form.  The
"... (XYZ)" vs "XYZ: ..." variety isn't specific to POOL_FULL, so get
rid of parenthesis throughout.  While at it, drop POOL_APP_NOT_ENABLED,
PG_AVAILABILITY and MON_DOWN which are duplicated in *-start.yaml.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Cover warnings like

  [WRN] Health detail: HEALTH_WARN 1 pool(s) do not have an application enabled" in cluster log
  [WRN] application not enabled on pool 'cephfs_metadata'" in cluster log
  [WRN] use 'ceph osd pool application enable <pool-name> <app-name>', where <app-name> is 'cephfs', 'rbd', 'rgw', or freeform for custom applications." in cluster log

and also the non-parenthesized form.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
@idryomov
Copy link
Contributor Author

... and another minor amendment to cover one more POOL_FULL variant:

+      - pool\(s\) full
       - POOL_FULL

@idryomov
Copy link
Contributor Author

idryomov commented Dec 10, 2024

I'm going to stop here -- my assumption that the ignorelist in quincy-x was good and simply needed to be copied over to reef-x turned out to be very wrong. Grepping through qa/, I see many other jobs in different suites that are similarly flaky when it comes to log-ignorelist, so it's a systemic issue.

I think we need to rethink the approach to these ignorelists because reliably ignoring a single health check can take 3-4 ignorelist entries to account for the summary and the free-form detail and even then it's not easy to get right due to special characters getting reinterpreted on the way between a yaml snippet and egrep -v filter in teuthology:

  • One idea that comes to mind is adding a small teuthology task that would process a new health-ignorelist stanza and silence health checks with ceph health mute instead of letting everything fire and get logged only to tediously hunt down the possible patterns. This should be 100% reliable and would mimic what we expect users to do in such cases.
  • Another is to have a more regular output -- perhaps the free-form detail could grow a prefix when logged to the cluster log? Also, it would be good to settle on a common way of handling singular vs plural count -- currently it's a mix of X mons (no parenthesis), X pool(s) and some places go all the way to distinguish is and are.
  • Finally, more health checks need to have a grace period. all OSDs are running squid or later but require_osd_release < squid (OSD_UPGRADE_FINISHED) is a perfect example: it fires ~20 seconds before cephadm gets to running ceph osd require-osd-release squid.

@rzarzynski I'd like to propose this as a topic for the next CDM, but I think the owner would be @ceph/core.

@idryomov
Copy link
Contributor Author

jenkins test api

@ljflores
Copy link
Member

freeform

@idryomov I added this to the backlog so we remember to discuss it: https://pad.ceph.com/p/cdm-backlog#L6

Copy link
Contributor

@rzarzynski rzarzynski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's bring this to CDM!

@idryomov
Copy link
Contributor Author

Merging this as is as another job that would have been taken care of by this PR just got added to https://tracker.ceph.com/issues/69135.

@Naveenaidu
Copy link
Contributor

Rados approved: https://tracker.ceph.com/issues/69215#note-2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants