Conversation
…_deletes_crimson Reverts 9c2d11a. PeeringListener::rebuild_missing_set_with_deletes should only be invoked upon advancing from an OSDMap where CEPH_OSDMAP_RECOVERY_DELETES is not set to one where it is. That shouldn't be possible for a crimson cluster as CEPH_OSDMAP_RECOVERY_DELETES should be set as long as require_osd_release >= luminous, which was quite a few versions ago. OSDMonitor::create_initial() defaults to squid, or as old as quincy with config options. I'm not sure this actually worked as it uses get0() on a seastar::future<>, which won't correctly deal with the thread-local interrupt_cond created by the interruptor::async wrapper in PG::do_peering_event. Additionally, PeeringListener::rebuild_missing_set_with_deletes would have been invoked under PeeringState::on_new_interval() while processing an AdvMap event, but PG::handle_advance_map doesn't actually invoke peering_state.advance_map under seastar::async or interruptor::async. Signed-off-by: Samuel Just <sjust@redhat.com>
|
crimson-rados/thrash --limit 10 sjust-2024-07-06_06:23:37-crimson-rados:thrash-wip-sjust-crimson-testing-2024-07-05-distro-default-smithi: one notable failure, looks good otherwise
Full crimson-rados run pending https://pulpito.ceph.com/sjust-2024-07-07_00:37:55-crimson-rados-wip-sjust-crimson-testing-2024-07-05-distro-default-smithi/ |
|
jenkins test make check |
|
jenkins test api |
|
jenkins test make check arm64 |
| pg->handle_initialize(rctx); | ||
| pg->handle_activate_map(rctx); |
| return interruptor::async([this, pg] { | ||
| return interruptor::async([this, pg, &shard_services] { | ||
| pg->do_peering_event(evt, ctx); | ||
| complete_rctx(shard_services, pg).get(); |
There was a problem hiding this comment.
Should we call complete_rctx within do_peering_event to avoid future incorrect users?
There was a problem hiding this comment.
I'd rather leave it as is -- pg_advance_map does potentially multiple events before calling complete_rctx.
|
jenkins test api |
See comment and https://tracker.ceph.com/issues/66708. Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
…:process stage Otherwise, transactions and messages might be submitted out of order. Signed-off-by: Samuel Just <sjust@redhat.com>
a36f5ac to
048e341
Compare
|
jenkins test make check |
|
Looking at the bluestore failures, this one looks new (may relate to #58463 because of InternalClientRequest edit): Same here: Also this one, although Looks unrelated to me, what do you think? |
048e341 to
059564c
Compare
|
I removed the Fixes: line from the last commit message as I saw an instance of the AsyncReserver crash with this PR merged in another test run. I'd still like to go ahead and merge this as it's a useful cleanup and does seem to have greatly reduced the incidence rate, but there seems to be another way for the reservation cancel and request to reorder. Will update bug/post new PR when I have that figured out. |
Yeah, that one was a bug in the other PR, pushed a fix.
That's a crash during AlienStore::open_collection as part of load_pgs -- https://tracker.ceph.com/issues/66294. I'm testing a fix for that one now -- coll_map needs a mutex or we get undefined behavior as it's accessed from different reactor threads concurrently. |
|
jenkins test api |
|
Modifying PR to address the crash I mentioned above -- let's hold off on merging it until I've got the new version tested. |
…tions Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Samuel Just <sjust@redhat.com>
- Adds ShardServices::singleton_orderer_t mechanism to ensure that OSDSingleton calls are completed in order. - Updates ShardServices accessors invoked from PeeringListener handlers to use orderer. - Updates PGListener handlers and complete_rctx to use orderer. Fixes: https://tracker.ceph.com/issues/66316 Signed-off-by: Samuel Just <sjust@redhat.com>
059564c to
e12e92c
Compare
|
10 job rados/thrash run, no failures: https://pulpito.ceph.com/sjust-2024-07-23_08:42:01-crimson-rados:thrash-wip-sjust-crimson-testing-2024-07-22-distro-default-smithi/ |
|
https://pulpito.ceph.com/matan-2024-07-24_07:30:51-crimson-rados-wip-sjust-crimson-testing-2024-07-22-distro-crimson-smithi/ |
|
jenkins test windows |
…server crimson: peering event processing fixes, wait for async operations started during peering events Reviewed-by: Matan Breizman <mbreizma@redhat.com>
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windowsjenkins test rook e2e