qa/tasks/ceph_manager.py: increase test_pool_min_size timeout#47138
qa/tasks/ceph_manager.py: increase test_pool_min_size timeout#47138
Conversation
1. When `test_pool_min_size` hit the case where `not all PGs are active or peered` we dump each PG state that doesn't have active or peered state 2. Improve logs message in `inject_pause()`. 3. Add logs for the `test_map_discontinuity()`. 4. In the function, `choose_action()`, added more logs regarding `chance_down`. 5. Added more loggings to `primary_affinity()`, `thrash_pg_upmap_items()`, `thrash_pg_upmap()`. 6. Make self.is_clean() dump the pgs that are not active+clean. Signed-off-by: Kamoltat <ksirivad@redhat.com>
|
300 Jobs scheduled with the commits from this PR. Description of the runs: Result: |
In test_pool_min_size(): 1. Provided buffer time before we check for recovery in ceph_manager.wait_for_recovery() 2. Increased timeout in ceph_manager.wait_for_clean() 3. Increased sleep time for ceph_manager.all_active_or_peered() Fixes: https://tracker.ceph.com/issues/49777 https://tracker.ceph.com/issues/54511 https://tracker.ceph.com/issues/51904 Signed-off-by: Kamoltat <ksirivad@redhat.com>
8761dfa to
ed73288
Compare
|
jenkins test windows |
|
jenkins test api |
ljflores
left a comment
There was a problem hiding this comment.
Looks really good! I just had a suggestion for the logging. As for testing, I feel like the 300 tests you have already run are sufficient.
| def dump_pgs_not_active_peered(self, pgs): | ||
| for pg in pgs: | ||
| if (not pg['state'].count('active')) and (not pg['state'].count('peered')): | ||
| self.log('PG %s is not active or peered' % pg['pgid']) | ||
| self.log(pg) | ||
|
|
There was a problem hiding this comment.
Can we put all the "not active or peered" pgs into a list and dump them at the end instead of logging them one at a time?
There was a problem hiding this comment.
@ljflores
For me, it is easier to look at each pg stats individually rather than in a big list. Also, I was following the convention that has been implemented before me: https://github.com/ceph/ceph/blob/main/qa/tasks/ceph_manager.py#L2689-L2717
Let me know what you think!
|
@ljflores @NitzanMordhai PTAL |
neha-ojha
left a comment
There was a problem hiding this comment.
Let's verify from the teuthology logs that test_pool_min_size is performing the sequence of steps correctly.
Problem
In the test where we thrash OSDs with EC pools using
test_pool_min_sizeas one of the actions, there are someedge cases that will make the recovery process
of OSDs longer than normal. One of these edge cases
is when you kill and revive two OSDs consecutively that
share the same acting set with a PG. This can
lead to the state of the PG being
downtemporarilyas it is waiting for more OSD in its acting set to be
up. The amount of time we give to the recovery
the process to happen is too short which leads to
issues like https://tracker.ceph.com/issues/49777
and https://tracker.ceph.com/issues/51904. Moreover,
there are some cases where we also did not give enough
time for the test to track the OSD recovery process properly
as we can observe from issue like https://tracker.ceph.com/issues/54511.
Solution
Provided buffer time before we check
for recovery in ceph_manager.wait_for_recovery().
This addresses https://tracker.ceph.com/issues/54511
Increased timeout in ceph_manager.wait_for_clean().
This addresses https://tracker.ceph.com/issues/51904.
Increased sleep time for ceph_manager.all_active_or_peered().
This addresses https://tracker.ceph.com/issues/49777
Improve loggings for better debugging purposes in ceph_manager.py.
This helps make the future debugging process easier for developers.
Fixes:
https://tracker.ceph.com/issues/49777
https://tracker.ceph.com/issues/54511
https://tracker.ceph.com/issues/51904
Signed-off-by: Kamoltat ksirivad@redhat.com
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windows