Skip to content

pacific: qa/suites/orch: whitelist warnings that are expected in test environments#55523

Merged
ljflores merged 1 commit intoceph:pacificfrom
ljflores:wip-tracker-64343-pacific
Feb 16, 2024
Merged

pacific: qa/suites/orch: whitelist warnings that are expected in test environments#55523
ljflores merged 1 commit intoceph:pacificfrom
ljflores:wip-tracker-64343-pacific

Conversation

@ljflores
Copy link
Member

@ljflores ljflores commented Feb 9, 2024

Semi-backport of 00fc796. (#55507) Some changes had to be made though for yaml files and warnings that are specific to pacific.

The motivation is that we are still testing stuff for pacific, i.e. in https://trello.com/c/3cEnuGqr/1952-wip-yuri10-testing-2024-02-08-0854-pacific, so we'll need clean results.

Fixes: https://tracker.ceph.com/issues/64343

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@ljflores ljflores requested a review from a team as a code owner February 9, 2024 21:22
@github-actions github-actions bot added this to the pacific milestone Feb 9, 2024
@ljflores ljflores changed the title qa/suites/orch: whitelist warnings that are expected in test environments pacific: qa/suites/orch: whitelist warnings that are expected in test environments Feb 9, 2024
@ljflores ljflores requested review from athanatos and markhpc February 9, 2024 21:28
@ljflores ljflores force-pushed the wip-tracker-64343-pacific branch from 57bd6ab to 84e7142 Compare February 12, 2024 21:55
@ljflores ljflores force-pushed the wip-tracker-64343-pacific branch from 84e7142 to bedfc49 Compare February 13, 2024 17:14
@markhpc
Copy link
Member

markhpc commented Feb 13, 2024

@ljflores looks like a bigger set of changes than for main? Happy to approve this once it passes QA. Anything you need from me in the mean time?

@ljflores
Copy link
Member Author

@markhpc yeah, this changeset is a bit different since there are some tests that are specific to pacific. I wrote that this is a "partial" backport in the commit message.

I'll link test results here once I have them. (Trello ref: https://trello.com/c/3cEnuGqr/1952-wip-yuri10-testing-2024-02-08-0854-pacific)

@ljflores ljflores requested review from a team February 13, 2024 19:05
Copy link
Contributor

@ronen-fr ronen-fr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (apart from aone question)

- mons down
- flag\(s\) set
- out of quorum
- PG_
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that one restrictive enough?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ronen-fr I can make it more specific

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thrash tests really trigger a lot of pg related warnings. @ronen-fr Is there one specifically you want to disallow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ronen-fr I think Sam's comment makes sense. Let's keep it as is, especially since we whitelist that exact string in many of our other tests:

$ git grep "PG_"
basic/tasks/rados_api_tests.yaml:    - \(PG_AVAILABILITY\)
basic/tasks/rados_api_tests.yaml:    - \(PG_DEGRADED\)
basic/tasks/rados_cls_all.yaml:    - \(PG_AVAILABILITY\)
basic/tasks/rados_python.yaml:    - \(PG_
basic/tasks/repair_test.yaml:      - \(PG_
basic/tasks/scrub_test.yaml:    - \(PG_
dashboard/tasks/dashboard.yaml:        - \(PG_
mgr/tasks/crash.yaml:        - \(PG_
mgr/tasks/failover.yaml:        - \(PG_
mgr/tasks/insights.yaml:        - \(PG_
mgr/tasks/module_selftest.yaml:        - \(PG_
mgr/tasks/progress.yaml:        - \(PG_
mgr/tasks/prometheus.yaml:        - \(PG_
mgr/tasks/workunits.yaml:        - \(PG_
monthrash/ceph.yaml:# slow mons -> slow peering -> PG_AVAILABILITY
monthrash/ceph.yaml:      - \(PG_AVAILABILITY\)
monthrash/workloads/rados_api_tests.yaml:      - \(PG_
monthrash/workloads/rados_mon_workunits.yaml:    - \(PG_
multimon/tasks/mon_clock_with_skews.yaml:    - \(PG_
multimon/tasks/mon_recovery.yaml:      - \(PG_AVAILABILITY\)
objectstore/backends/ceph_objectstore_tool.yaml:      - \(PG_
perf/ceph.yaml:      - \(PG_
rest/mgr-restful.yaml:      - \(PG_
singleton-bluestore/all/cephtool.yaml:    - \(PG_
singleton-bluestore/all/cephtool.yaml:    - \(SMALLER_PG_NUM\)
singleton-nomsgr/all/balancer.yaml:      - \(PG_AVAILABILITY\)

@athanatos
Copy link
Contributor

athanatos commented Feb 14, 2024

s/whitelist/ignorelist in PR/commit message nvm, it's a backport

@vshankar
Copy link
Contributor

Changes look fine, but waiting for RCA on fs related issue I detailed in https://trello.com/c/3cEnuGqr/1952-wip-yuri10-testing-2024-02-08-0854-pacific.

On PTO today - will have a look tomorrow.

@ljflores ljflores force-pushed the wip-tracker-64343-pacific branch from bedfc49 to 41aa181 Compare February 15, 2024 19:37
@ljflores
Copy link
Member Author

ljflores commented Feb 15, 2024

Made some final adjustments for the rados suite based on latest test results.

All results found here: https://pulpito.ceph.com/?branch=wip-yuri10-testing-2024-02-08-0854-pacific

All expected warnings on the core side have been addressed (unless there's something I missed due to a nondeterministic test scenario).

Remaining warnings are from MDS or Cephadm daemons.

Will have a final summary posted soon.

@ljflores
Copy link
Member Author

ljflores commented Feb 15, 2024

Failures look acceptable on the core side: https://tracker.ceph.com/projects/rados/wiki/PACIFIC

There were some new warnings from cephadm, but @adk3798 had a look and didn't view them as problematic to testing the release, so I simply raised some new tickets to track them here:

@vshankar on the latest run (https://pulpito.ceph.com/lflores-2024-02-15_17:32:20-rados-wip-yuri10-testing-2024-02-08-0854-pacific-distro-default-smithi/), I see these new MDS/filesystem warnings. Do you want to take care of whitelisting those in this PR, or raise tracker tickets and handle them separately?

  1. https://pulpito.ceph.com/lflores-2024-02-15_17:32:20-rados-wip-yuri10-testing-2024-02-08-0854-pacific-distro-default-smithi/7561653
cluster [WRN] Replacing daemon mds.a.smithi005.lvnvyk as rank 0 with standby daemon mds.user_test_fs.smithi005.shkleu" in cluster log
  1. https://pulpito.ceph.com/lflores-2024-02-15_17:32:20-rados-wip-yuri10-testing-2024-02-08-0854-pacific-distro-default-smithi/7561657
cluster [WRN] Health check failed: insufficient standby MDS daemons available (MDS_INSUFFICIENT_STANDBY)" in cluster log
  1. https://pulpito.ceph.com/lflores-2024-02-15_17:32:20-rados-wip-yuri10-testing-2024-02-08-0854-pacific-distro-default-smithi/7561662
cluster [WRN] Health detail: HEALTH_WARN 1 filesystem with deprecated feature inline_data" in cluster log
  1. https://pulpito.ceph.com/lflores-2024-02-15_17:32:20-rados-wip-yuri10-testing-2024-02-08-0854-pacific-distro-default-smithi/7561664
cluster [WRN] daemon mds.cephfs.smithi091.sntctf compat changed unexpectedly" in cluster log

@vshankar
Copy link
Contributor

Failures look acceptable on the core side: https://tracker.ceph.com/projects/rados/wiki/PACIFIC

There were some new warnings from cephadm, but @adk3798 had a look and didn't view them as problematic to testing the release, so I simply raised some new tickets to track them here:

@vshankar on the latest run (https://pulpito.ceph.com/lflores-2024-02-15_17:32:20-rados-wip-yuri10-testing-2024-02-08-0854-pacific-distro-default-smithi/), I see these new MDS/filesystem warnings. Do you want to take care of whitelisting those in this PR, or raise tracker tickets and handle them separately?

I'm going through the failures now - If they aren't related to any underlying cephfs issue, we can add those to ignore list in this change.

  1. https://pulpito.ceph.com/lflores-2024-02-15_17:32:20-rados-wip-yuri10-testing-2024-02-08-0854-pacific-distro-default-smithi/7561653
cluster [WRN] Replacing daemon mds.a.smithi005.lvnvyk as rank 0 with standby daemon mds.user_test_fs.smithi005.shkleu" in cluster log

test_nfs.py::test_cluster_set_reset_user_config() creates a cephfs volume user_test_fs. cephadm deploys MDS daemons and a MDS becomes active for the filesystem, but the mons choose standby MDS to replace the active MDS:

2024-02-15T18:20:15.691 INFO:journalctl@ceph.mon.a.smithi005.stdout:Feb 15 18:20:15 smithi005 ceph-f6315b2e-cc2d-11ee-95ba-87774f69a715-mon-a[38343]: cluster 2024-02-15T18:20:14.544744+0000 mon.a (mon.0) 508 : cluster [INF] Dropping low affinity
 active daemon mds.a.smithi005.lvnvyk in favor of higher affinity standby.
2024-02-15T18:20:15.691 INFO:journalctl@ceph.mon.a.smithi005.stdout:Feb 15 18:20:15 smithi005 ceph-f6315b2e-cc2d-11ee-95ba-87774f69a715-mon-a[38343]: cluster 2024-02-15T18:20:14.544763+0000 mon.a (mon.0) 509 : cluster [WRN] Replacing daemon mds.
a.smithi005.lvnvyk as rank 0 with standby daemon mds.user_test_fs.smithi005.shkleu

@adk3798 I don't see config set mds_join_fs ... being executed in teuthology.log , but I see src/pybind/mgr/cephadm/services/cephadmservice.py::MdsService::config() does config set mds_join_fs ..., so that's getting applied somehow.

  1. https://pulpito.ceph.com/lflores-2024-02-15_17:32:20-rados-wip-yuri10-testing-2024-02-08-0854-pacific-distro-default-smithi/7561657
cluster [WRN] Health check failed: insufficient standby MDS daemons available (MDS_INSUFFICIENT_STANDBY)" in cluster log

See: #55601 (review)

(under discussion, but looks like we might have to silence this warning)


 3. https://pulpito.ceph.com/lflores-2024-02-15_17:32:20-rados-wip-yuri10-testing-2024-02-08-0854-pacific-distro-default-smithi/7561662

cluster [WRN] Health detail: HEALTH_WARN 1 filesystem with deprecated feature inline_data" in cluster log


 4. https://pulpito.ceph.com/lflores-2024-02-15_17:32:20-rados-wip-yuri10-testing-2024-02-08-0854-pacific-distro-default-smithi/7561664

cluster [WRN] daemon mds.cephfs.smithi091.sntctf compat changed unexpectedly" in cluster log

All warnings seem like a fallout from #54312 to me.

@vshankar
Copy link
Contributor

Failures look acceptable on the core side: https://tracker.ceph.com/projects/rados/wiki/PACIFIC

There were some new warnings from cephadm, but @adk3798 had a look and didn't view them as problematic to testing the release, so I simply raised some new tickets to track them here:

@vshankar on the latest run (https://pulpito.ceph.com/lflores-2024-02-15_17:32:20-rados-wip-yuri10-testing-2024-02-08-0854-pacific-distro-default-smithi/), I see these new MDS/filesystem warnings. Do you want to take care of whitelisting those in this PR, or raise tracker tickets and handle them separately?

We'll want to tackle that separately since it's about time we fix such unnecessary warnings since this also shows up in main branch runs. So, I'll create a tracker for that. As far as this PR is concerned, we have two options:

  • Merge this as it is with the fs related warnings (I'll approve the change)
  • Add the warnings to white/ignore list to silence them

If time permits, can we add these warning to ignore list? @ljflores

@adk3798
Copy link
Contributor

adk3798 commented Feb 16, 2024

@adk3798 I don't see config set mds_join_fs ... being executed in teuthology.log , but I see src/pybind/mgr/cephadm/services/cephadmservice.py::MdsService::config() does config set mds_join_fs ..., so that's getting applied somehow.

Yeah, if cephadm deployed it, it should have run mds_join_fs for it. I'm never sure in some tests where things are deployed through the roles in teuthology if it's going through the normal path or not though.

  1. https://pulpito.ceph.com/lflores-2024-02-15_17:32:20-rados-wip-yuri10-testing-2024-02-08-0854-pacific-distro-default-smithi/7561657
cluster [WRN] Health check failed: insufficient standby MDS daemons available (MDS_INSUFFICIENT_STANDBY)" in cluster log

We scale the fs down to a single MDS during upgrade, so this warning is expected I think and is just popping up now because of the change to get log scraping working as you were thinking

@ljflores
Copy link
Member Author

Adding to ignorelist...

@ljflores ljflores force-pushed the wip-tracker-64343-pacific branch from 41aa181 to 6472e62 Compare February 16, 2024 16:51
Copy link
Contributor

@vshankar vshankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

…ents

Semi-backport of 00fc796. Some changes
had to be made though for yaml files and warnings that are specific to pacific.

Fixes: https://tracker.ceph.com/issues/64343
Signed-off-by: Laura Flores <lflores@ibm.com>
@ljflores ljflores force-pushed the wip-tracker-64343-pacific branch from 6472e62 to 275f1a4 Compare February 16, 2024 19:09
@ljflores
Copy link
Member Author

Latest test results are pretty clean, barring accepted cephadm warnings and one more FS warning. I've tracked the FS warning here (https://tracker.ceph.com/issues) so final QA isn't blocked any longer for 16.2.15, but so we'll be able to use it in the result summary.

@ljflores ljflores merged commit 81bd20d into ceph:pacific Feb 16, 2024
@ljflores ljflores deleted the wip-tracker-64343-pacific branch February 16, 2024 22:23
- \(CACHE_POOL_NO_HIT_SET\)
- \(PG_
- \(OSD_
- mons down:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this causes https://tracker.ceph.com/issues/64452

I don't think we need a colon behind mons down

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants