qa/suites/upgrade/reef-x: sync log-ignorelist with quincy-x#60969
qa/suites/upgrade/reef-x: sync log-ignorelist with quincy-x#60969
Conversation
|
This is essentially the equivalent of #58277 for reef-x. |
Daemons are terminated by cephadm during the upgrade, so health checks like OSD_DOWN must be ignored. Since there shouldn't be any fundamental difference between upgrading from quincy and upgrading from reef, make quincy-x and reef-x ignorelists the same. Fixes: https://tracker.ceph.com/issues/69135 Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Cover warnings like [WRN] osd.4 (root=default,host=smithi184) is down" in cluster log which OSD_DOWN doesn't. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Cover warnings like [WRN] pg 2.7 is active+undersized+degraded, acting [6,7]" in cluster log This is based on commit 4a4fc7b ("qa: ignore pg availability/degraded warnings"). Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Cover warnings like [WRN] Health check failed: 1 filesystem is degraded (FS_DEGRADED)" in cluster log MDS_ALL_DOWN is already ignored in ignorelist_health.yaml for reef-x. Not sure why it's not ignored for quincy-x -- ignorelist_health.yaml isn't present there at all. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Cover warnings like [WRN] Health check failed: all OSDs are running squid or later but require_osd_release < squid (OSD_UPGRADE_FINISHED)" in cluster log They are inherently transient and should ideally be delayed for a grace period instead of being raised immediately just to be ignored. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Cover warnings like [WRN] Health check failed: Telemetry requires re-opt-in (TELEMETRY_CHANGED)" in cluster log [WRN] telemetry module includes new collections; please re-opt-in to new collections with `ceph telemetry on`" in cluster log Re-opt-in can happen in a respective workunit (test_telemetry_quincy_x.sh or test_telemetry_reef_x.sh), but it gets run only at the very end after both "workload" and "upgrade-sequence" complete. Over an hour passes in the interim: 2024-12-08T00:06:31.197 INFO:teuthology.task.print:**** done end upgrade, wait... ... 2024-12-08T01:28:38.588 INFO:tasks.workunit:Running workunit test_telemetry_reef_x.sh... The existing list is now duplicated in 0-start.yaml, so replace it entirely. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
@athanatos @ljflores @rzarzynski I have been doing some QA on my own and just pushed a set of additions to log-ignorelist for more rare warnings -- they are noted in commit messages. Apologies for not being quicker ;) |
|
Pushed a minor amendment to cover one more |
Cover warnings like [WRN] POOL_FULL: 2 pool(s) full" in cluster log [WRN] pool 'test-librbd-smithi184-145008-24' is full (running out of quota)" in cluster log [WRN] Health detail: HEALTH_WARN 2 pool(s) full" in cluster log POOL_FULL is already ignored, but only in a parenthesized form. The "... (XYZ)" vs "XYZ: ..." variety isn't specific to POOL_FULL, so get rid of parenthesis throughout. While at it, drop POOL_APP_NOT_ENABLED, PG_AVAILABILITY and MON_DOWN which are duplicated in *-start.yaml. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Cover warnings like [WRN] Health detail: HEALTH_WARN 1 pool(s) do not have an application enabled" in cluster log [WRN] application not enabled on pool 'cephfs_metadata'" in cluster log [WRN] use 'ceph osd pool application enable <pool-name> <app-name>', where <app-name> is 'cephfs', 'rbd', 'rgw', or freeform for custom applications." in cluster log and also the non-parenthesized form. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
... and another minor amendment to cover one more |
|
I'm going to stop here -- my assumption that the ignorelist in quincy-x was good and simply needed to be copied over to reef-x turned out to be very wrong. Grepping through I think we need to rethink the approach to these ignorelists because reliably ignoring a single health check can take 3-4 ignorelist entries to account for the summary and the free-form detail and even then it's not easy to get right due to special characters getting reinterpreted on the way between a yaml snippet and
@rzarzynski I'd like to propose this as a topic for the next CDM, but I think the owner would be @ceph/core. |
|
jenkins test api |
@idryomov I added this to the backlog so we remember to discuss it: https://pad.ceph.com/p/cdm-backlog#L6 |
rzarzynski
left a comment
There was a problem hiding this comment.
Yes, let's bring this to CDM!
|
Merging this as is as another job that would have been taken care of by this PR just got added to https://tracker.ceph.com/issues/69135. |
|
Rados approved: https://tracker.ceph.com/issues/69215#note-2 |
Daemons are terminated by cephadm during the upgrade, so health checks like OSD_DOWN must be ignored. Since there shouldn't be any fundamental difference between upgrading from quincy and upgrading from reef, make quincy-x and reef-x ignorelists the same.
Fixes: https://tracker.ceph.com/issues/69135
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windowsjenkins test rook e2e