Bug #55141: thrashers/fastread: assertion failure: rollback_info_trimmed_to == head - RADOS - Ceph

Actions

Copy link

Bug #55141

open

thrashers/fastread: assertion failure: rollback_info_trimmed_to == head

Added by Radoslaw Zarzynski almost 4 years ago. Updated 9 months ago.

Status:

In Progress

Priority:

Normal

Assignee:

Nitzan Mordechai

Category:

Target version:

% Done:

Source:

Backport:

reef,squid

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Tags (freeform):

Merge Commit:

Fixed In:

Released In:

Upkeep Timestamp:

Description

From the /home/teuthworker/archive/yuriw-2022-03-29_21:35:32-rados-wip-yuri5-testing-2022-03-29-1152-quincy-distro-default-smithi/6767850/teuthology.log:

2022-03-30T01:34:59.499 INFO:tasks.ceph.osd.2.smithi080.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.1.0-125-g9053ed98/rpm/el8/BUILD/ceph-17.1.0-125-g9053ed98/src/osd/PGLog.h: In function 'void PGLog::IndexedLog::claim_log_and_clear_rollback_info(const pg_log_t&)' thread 7fa658b53700 time 2022-03-30T01:34:59.459273+0000
2022-03-30T01:34:59.499 INFO:tasks.ceph.osd.2.smithi080.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.1.0-125-g9053ed98/rpm/el8/BUILD/ceph-17.1.0-125-g9053ed98/src/osd/PGLog.h: 286: FAILED ceph_assert(rollback_info_trimmed_to == head)
2022-03-30T01:34:59.499 INFO:tasks.ceph.osd.2.smithi080.stderr: ceph version 17.1.0-125-g9053ed98 (9053ed984698b7140d91d3195fcba61aa554fe69) quincy (stable)
2022-03-30T01:34:59.499 INFO:tasks.ceph.osd.2.smithi080.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x55d6569c2464]
2022-03-30T01:34:59.500 INFO:tasks.ceph.osd.2.smithi080.stderr: 2: ceph-osd(+0x5d7685) [0x55d6569c2685]
2022-03-30T01:34:59.500 INFO:tasks.ceph.osd.2.smithi080.stderr: 3: (PeeringState::Stray::react(MLogRec const&)+0x3d0) [0x55d656de5390]
2022-03-30T01:34:59.500 INFO:tasks.ceph.osd.2.smithi080.stderr: 4: (boost::statechart::simple_state<PeeringState::Stray, PeeringState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x280) [0x55d656e1a6c0]
2022-03-30T01:34:59.500 INFO:tasks.ceph.osd.2.smithi080.stderr: 5: (boost::statechart::state_machine<PeeringState::PeeringMachine, PeeringState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x74) [0x55d656b90c54]
2022-03-30T01:34:59.500 INFO:tasks.ceph.osd.2.smithi080.stderr: 6: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PeeringCtx&)+0x2d6) [0x55d656b84ea6]
2022-03-30T01:34:59.501 INFO:tasks.ceph.osd.2.smithi080.stderr: 7: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x175) [0x55d656afa3c5]
2022-03-30T01:34:59.501 INFO:tasks.ceph.osd.2.smithi080.stderr: 8: (ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x56) [0x55d656d915b6]
2022-03-30T01:34:59.501 INFO:tasks.ceph.osd.2.smithi080.stderr: 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xaf8) [0x55d656aec0e8]
2022-03-30T01:34:59.501 INFO:tasks.ceph.osd.2.smithi080.stderr: 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x55d6571f1a64]
2022-03-30T01:34:59.502 INFO:tasks.ceph.osd.2.smithi080.stderr: 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55d6571f2e04]
2022-03-30T01:34:59.502 INFO:tasks.ceph.osd.2.smithi080.stderr: 12: /lib64/libpthread.so.0(+0x817f) [0x7fa684d7017f]
2022-03-30T01:34:59.502 INFO:tasks.ceph.osd.2.smithi080.stderr: 13: clone()
2022-03-30T01:34:59.502 INFO:tasks.ceph.osd.2.smithi080.stderr:*** Caught signal (Aborted) **

Related issues 2 (1 open — 1 closed)

Actions

Copy link

Updated by Laura Flores almost 4 years ago

Leading up to the ceph_assert failure in osd.2:

/a/yuriw-2022-03-29_21:35:32-rados-wip-yuri5-testing-2022-03-29-1152-quincy-distro-default-smithi/6767850/remote/smithi080/log/ceph-osd.2.log.gz

2022-03-30T01:34:59.458+0000 7fa65e35e700 10 osd.2 pg_epoch: 844 pg[6.13s1( v 824'2298 lc 771'314 (0'0,824'2298] local-lis/les=840/841 n=1265 ec=782/768 lis/c=809/782 les/c/f=810/783/0 sis=840) [3,2,0]/[NONE,2,0]p2(1) async=[3(0)] r=1 lpr=840 pi=[782,840)/2 crt=824'2298 lcod 0'0 mlcod 0'0 activating+undersized+degraded+remapped m=898 mbc={0={(0+0)=838,(0+1)=350,(0+2)=10,(1+0)=1},1={(0+1)=898,(1+0)=177,(1+1)=124},2={(0+1)=897,(1+0)=177,(1+1)=125}}] search_for_missing 6:c87ee54b:::benchmark_data_smithi080_195491_object1379:head 770'88 is on osd.7(0)
2022-03-30T01:34:59.459+0000 7fa65e35e700 10 osd.2 pg_epoch: 844 pg[6.13s1( v 824'2298 lc 771'314 (0'0,824'2298] local-lis/les=840/841 n=1265 ec=782/768 lis/c=809/782 les/c/f=810/783/0 sis=840) [3,2,0]/[NONE,2,0]p2(1) async=[3(0)] r=1 lpr=840 pi=[782,840)/2 crt=824'2298 lcod 0'0 mlcod 0'0 activating+undersized+degraded+remapped m=898 mbc={0={(0+0)=838,(0+1)=349,(0+2)=11,(1+0)=1},1={(0+1)=898,(1+0)=177,(1+1)=124},2={(0+1)=897,(1+0)=177,(1+1)=125}}] search_for_missing 6:c883b7b1:::benchmark_data_smithi080_195491_object38312:head 804'2230 also missing on osd.7(0) (last_update 783'2113 < needed 804'2230)
2022-03-30T01:34:59.459+0000 7fa65e35e700 10 osd.2 pg_epoch: 844 pg[6.13s1( v 824'2298 lc 771'314 (0'0,824'2298] local-lis/les=840/841 n=1265 ec=782/768 lis/c=809/782 les/c/f=810/783/0 sis=840) [3,2,0]/[NONE,2,0]p2(1) async=[3(0)] r=1 lpr=840 pi=[782,840)/2 crt=824'2298 lcod 0'0 mlcod 0'0 activating+undersized+degraded+remapped m=898 mbc={0={(0+0)=838,(0+1)=349,(0+2)=11,(1+0)=1},1={(0+1)=898,(1+0)=177,(1+1)=124},2={(0+1)=897,(1+0)=177,(1+1)=125}}] search_for_missing 6:c88b487d:::benchmark_data_smithi080_195491_object187:head 770'12 is on osd.7(0)
2022-03-30T01:34:59.462+0000 7fa680793700  1 -- [v2:172.21.15.80:6826/205141,v1:172.21.15.80:6827/205141] <== osd.0 v2:172.21.15.80:6818/33883 46 ==== PGlog(6.0s2 log log((792'2126,828'2271], crt=828'2271) pi ([0,0] all_participants= intervals=) pg_lease(ru 0.000000000s ub 1576.709106445s int 16.000000000s) e845/844) v6 ==== 51420+0+0 (crc 0 0 0) 0x55d65e75a000 con 0x55d65dc4e000
2022-03-30T01:34:59.462+0000 7fa680793700 15 osd.2 845 enqueue_peering_evt 6.0s2 epoch_sent: 845 epoch_requested: 844 MLogRec from 0(0) log log((792'2126,828'2271], crt=828'2271) pi ([0,0] all_participants= intervals=) pg_lease(ru 0.000000000s ub 1576.709106445s int 16.000000000s) +create_info
2022-03-30T01:34:59.462+0000 7fa680793700 20 osd.2 op_wq(0) _enqueue OpSchedulerItem(6.0s2 PGPeeringEvent(epoch_sent: 845 epoch_requested: 844 MLogRec from 0(0) log log((792'2126,828'2271], crt=828'2271) pi ([0,0] all_participants= intervals=) pg_lease(ru 0.000000000s ub 1576.709106445s int 16.000000000s) +create_info) prio 255 cost 10 e845)
2022-03-30T01:34:59.463+0000 7fa658b53700 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.1.0-125-g9053ed98/rpm/el8/BUILD/ceph-17.1.0-125-g9053ed98/src/osd/PGLog.h: In function 'void PGLog::IndexedLog::claim_log_and_clear_rollback_info(const pg_log_t&)' thread 7fa658b53700 time 2022-03-30T01:34:59.459273+0000
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.1.0-125-g9053ed98/rpm/el8/BUILD/ceph-17.1.0-125-g9053ed98/src/osd/PGLog.h: 286: FAILED ceph_assert(rollback_info_trimmed_to == head)

Actions

Copy link

Updated by Neha Ojha almost 4 years ago

Assignee set to Nitzan Mordechai

Actions

Copy link

Updated by Sridhar Seshasayee almost 4 years ago

Observed this in a pacific run:
/a/yuriw-2022-06-15_18:29:33-rados-wip-yuri4-testing-2022-06-15-1000-pacific-distro-default-smithi/6881247

Although the job was marked dead, the crash information is available from smithi093 logs.

Test Description:
rados/thrash-erasure-code-overwrites/{bluestore-bitmap ceph clusters/{fixed-2 openstack} fast/normal mon_election/connectivity msgr-failures/osd-dispatch-delay rados recovery-overrides/{more-async-partial-recovery} supported-random-distro$/{centos_8} thrashers/pggrow thrashosds-health workloads/ec-small-objects-overwrites}

Backtrace:


  -424> 2022-06-15T20:31:19.951+0000 7f3fddfd2700 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.9-239-g224fc22e/rpm/el8/BUILD/ceph-16.2.9-239-g224fc22e/src/osd/PGLog.h: In function 'void PGLog::IndexedLog::claim_log_and_clear_rollback_info(const pg_log_t&)' thread 7f3fddfd2700 time 2022-06-15T20:31:19.949134+0000
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.9-239-g224fc22e/rpm/el8/BUILD/ceph-16.2.9-239-g224fc22e/src/osd/PGLog.h: 286: FAILED ceph_assert(rollback_info_trimmed_to == head)

 ceph version 16.2.9-239-g224fc22e (224fc22e07cebeecc3e08055cfd6105b1a30f173) pacific (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x557bf1bb69f4]
 2: ceph-osd(+0x580c0e) [0x557bf1bb6c0e]
 3: (PeeringState::Stray::react(MLogRec const&)+0x230) [0x557bf1f6d320]
 4: (boost::statechart::simple_state<PeeringState::Stray, PeeringState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xd5) [0x557bf1f99855]
 5: (boost::statechart::state_machine<PeeringState::PeeringMachine, PeeringState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x5b) [0x557bf1d80dab]
 6: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PeeringCtx&)+0x2d1) [0x557bf1d758f1]
 7: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x29c) [0x557bf1cebedc]
 8: (ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x56) [0x557bf1f1f1f6]
 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xc28) [0x557bf1cddc88]
 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x557bf235b8b4]
 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x557bf235e794]
 12: /lib64/libpthread.so.0(+0x81ca) [0x7f4005e491ca]
 13: clone()

Actions

Copy link

Updated by Radoslaw Zarzynski over 3 years ago

Backport changed from quincy to pacific,quincy

Actions

Copy link

Updated by Radoslaw Zarzynski over 3 years ago

Priority changed from High to Normal

Lowering the priority as we haven't seen a reoccurence last time.

Actions

Copy link

Updated by Radoslaw Zarzynski over 3 years ago

Has duplicate Bug #57913: Thrashosd: timeout 120 ceph --cluster ceph osd pool rm unique_pool_2 unique_pool_2 --yes-i-really-really-mean-it added

Actions

Copy link

Updated by Radoslaw Zarzynski over 3 years ago

Well, just found a new occurance.

Actions

Copy link

Updated by Nitzan Mordechai over 3 years ago

Radoslaw Zarzynski wrote:

Well, just found a new occurance.

Where can i find it?

Actions

Copy link

Updated by Radoslaw Zarzynski over 3 years ago

Nitzan Mordechai wrote:

Radoslaw Zarzynski wrote:

Well, just found a new occurance.

Where can i find it?

https://tracker.ceph.com/issues/57913#note-2

Actions

Copy link

#10

Updated by Kamoltat (Junior) Sirivadhna about 3 years ago

/a/yuriw-2023-03-02_00:09:05-rados-wip-yuri11-testing-2023-03-01-1424-distro-default-smithi/7191380

Actions

Copy link

#11

Updated by Nitzan Mordechai about 3 years ago

I'm probably missing something here, but i'll try to summarize my finds

/a/yuriw-2023-03-02_00:09:05-rados-wip-yuri11-testing-2023-03-01-1424-distro-default-smithi/7191380

osd.2 hitting the assert rollback_info_trimmed_to == head
from coredump
head = head = {version = 1834, epoch = 682 ..
rollback_info_trimmed_to = {version = 0, epoch = 0

but info.pgid:

(gdb) 
$10 = {pgid = {pgid = {m_pool = 5, m_seed = 36, static calc_name_buf_size = 36 '$'}, shard = {id = 1 '\001', +static NO_SHARD+ = {id = -1 '\377', static NO_SHARD = <same as static member of an already seen type>}}, static calc_name_buf_size = 40 '('}, last_update = {version = 1834, epoch = 682, __pad = 0}, 
  last_complete = {version = 1834, epoch = 682, __pad = 0}, last_epoch_started = 686, last_interval_started = 685, last_user_version = 1253, log_tail = {version = 0, epoch = 0, __pad = 0}, last_backfill = {static POOL_META = -1, static POOL_TEMP_START = -2, oid = {name = ""}, snap = {val = 0}, hash = 0, 
    max = false, nibblewise_key_cache = 0, hash_reverse_bits = 0, pool = -9223372036854775808, nspace = "", key = ""}, purged_snaps = {_size = 0, m = std::map with 0 elements}, stats = {version = {version = 1253, epoch = 611, __pad = 0}, reported_seq = 1288, reported_epoch = 611, state = 12846082, 
    last_fresh = {tv = {tv_sec = 1677735243, tv_nsec = 167430097}}, last_change = {tv = {tv_sec = 1677735237, tv_nsec = 341281420}}, last_active = {tv = {tv_sec = 1677735243, tv_nsec = 167430097}}, last_peered = {tv = {tv_sec = 1677735243, tv_nsec = 167430097}}, last_clean = {tv = {tv_sec = 1677735235, 
        tv_nsec = 101799085}}, last_unstale = {tv = {tv_sec = 1677735243, tv_nsec = 167430097}}, last_undegraded = {tv = {tv_sec = 1677735237, tv_nsec = 130030467}}, last_fullsized = {tv = {tv_sec = 1677735237, tv_nsec = 124288530}}, log_start = {version = 0, epoch = 0, __pad = 0}, ondisk_log_start = {
      version = 0, epoch = 0, __pad = 0}, created = 598, last_epoch_clean = 599, parent = {m_pool = 0, m_seed = 0, static calc_name_buf_size = 36 '$'}, parent_split_bits = 6, last_scrub = {version = 0, epoch = 0, __pad = 0}, last_deep_scrub = {version = 0, epoch = 0, __pad = 0}, last_scrub_stamp = {tv = {
        tv_sec = 1677735223, tv_nsec = 1701357}}, last_deep_scrub_stamp = {tv = {tv_sec = 1677735223, tv_nsec = 1701357}}, last_clean_scrub_stamp = {tv = {tv_sec = 1677735223, tv_nsec = 1701357}}, last_scrub_duration = 0, stats = {sum = {num_bytes = 20529152, num_objects = 0, num_object_clones = 0, 
        num_object_copies = 0, num_objects_missing_on_primary = 0, num_objects_degraded = 0, num_objects_unfound = 0, num_rd = 0, num_rd_kb = 0, num_wr = 0, num_wr_kb = 0, num_scrub_errors = 0, num_objects_recovered = 0, num_bytes_recovered = 0, num_keys_recovered = 0, num_shallow_scrub_errors = 0, 
        num_deep_scrub_errors = 0, num_objects_dirty = 0, num_whiteouts = 0, num_objects_omap = 0, num_objects_hit_set_archive = 0, num_objects_misplaced = 0, num_bytes_hit_set_archive = 0, num_flush = 0, num_flush_kb = 0, num_evict = 0, num_evict_kb = 0, num_promote = 0, num_flush_mode_high = 0, 
        num_flush_mode_low = 0, num_evict_mode_some = 0, num_evict_mode_full = 0, num_objects_pinned = 0, num_objects_missing = 0, num_legacy_snapsets = 0, num_large_omap_objects = 0, num_objects_manifest = 0, num_omap_bytes = 0, num_omap_keys = 0, num_objects_repaired = 0}}, log_size = 1253, 
    log_dups_size = 0, ondisk_log_size = 1253, objects_scrubbed = 0, scrub_duration = 0, up = std::vector of length 3, capacity 4 = {7, 2, 5}, acting = std::vector of length 3, capacity 4 = {2147483647, 3, 5}, avail_no_missing = std::vector of length 3, capacity 3 = {{static NO_OSD = 2147483647, osd = 7, 
        shard = {id = 0 '\000', static NO_SHARD = {id = -1 '\377', static NO_SHARD = <same as static member of an already seen type>}}}, {static NO_OSD = 2147483647, osd = 0, shard = {id = 2 '\002', static NO_SHARD = {id = -1 '\377', static NO_SHARD = <same as static member of an already seen type>}}}, {
        static NO_OSD = 2147483647, osd = 2, shard = {id = 1 '\001', static NO_SHARD = {id = -1 '\377', static NO_SHARD = <same as static member of an already seen type>}}}}, object_location_counts = std::map with 2 elements = {[std::set with 3 elements = {[0] = {static NO_OSD = 2147483647, osd = 0, shard = {
            id = 2 '\002', static NO_SHARD = {id = -1 '\377', static NO_SHARD = <same as static member of an already seen type>}}}, [1] = {static NO_OSD = 2147483647, osd = 2, shard = {id = 1 '\001', static NO_SHARD = {id = -1 '\377', static NO_SHARD = <same as static member of an already seen type>}}}, 
        [2] = {static NO_OSD = 2147483647, osd = 7, shard = {id = 0 '\000', static NO_SHARD = {id = -1 '\377', static NO_SHARD = <same as static member of an already seen type>}}}}] = 827, [std::set with 2 elements = {[0] = {static NO_OSD = 2147483647, osd = 2, shard = {id = 1 '\001', static NO_SHARD = {
              id = -1 '\377', static NO_SHARD = <same as static member of an already seen type>}}}, [1] = {static NO_OSD = 2147483647, osd = 7, shard = {id = 0 '\000', static NO_SHARD = {id = -1 '\377', static NO_SHARD = <same as static member of an already seen type>}}}}] = 426}, mapping_epoch = 685, 
    blocked_by = std::vector of length 0, capacity 0, purged_snaps = {_size = 0, m = std::map with 0 elements}, last_became_active = {tv = {tv_sec = 1677735237, tv_nsec = 337804713}}, last_became_peered = {tv = {tv_sec = 1677735237, tv_nsec = 337804713}}, up_primary = 7, acting_primary = 3, 
    snaptrimq_len = 0, objects_trimmed = 0, snaptrim_duration = 0, scrub_sched_status = {m_scheduled_at = {tv = {tv_sec = 1677735301, tv_nsec = 828622384}}, m_duration_seconds = 0, m_sched_status = pg_scrub_sched_status_t::scheduled, m_is_active = false, m_is_deep = scrub_level_t::shallow, 
      m_is_periodic = true}, stats_invalid = true, dirty_stats_invalid = false, omap_stats_invalid = false, hitset_stats_invalid = false, hitset_bytes_stats_invalid = false, pin_stats_invalid = false, manifest_stats_invalid = false}, history = {epoch_created = 683, epoch_pool_created = 598, 
    last_epoch_started = 665, last_interval_started = 664, last_epoch_clean = 665, last_interval_clean = 664, last_epoch_split = 683, last_epoch_marked_full = 0, same_up_since = 612, same_interval_since = 685, same_primary_since = 685, last_scrub = {version = 1796, epoch = 656, __pad = 0}, last_deep_scrub = {
      version = 0, epoch = 0, __pad = 0}, last_scrub_stamp = {tv = {tv_sec = 1677735336, tv_nsec = 306800488}}, last_deep_scrub_stamp = {tv = {tv_sec = 1677735223, tv_nsec = 1701357}}, last_clean_scrub_stamp = {tv = {tv_sec = 1677735336, tv_nsec = 306800488}}, prior_readable_until_ub = {__r = 0}}, hit_set = {
    current_last_update = {version = 0, epoch = 0, __pad = 0}, history = empty std::__cxx11::list}}

Actions

Copy link

#12

Updated by Nitzan Mordechai about 3 years ago

Since this is EC pool, the NO_SHARD is confusing, we are not maintaining rollback_info_trimmed_to on replicas, looking for why we have NO_SHARD

Actions

Copy link

#13

Updated by Radoslaw Zarzynski about 3 years ago

Status changed from New to In Progress

Actions

Copy link

#14

Updated by Laura Flores over 2 years ago

Related to Bug #60084: crash: void std::list<pg_log_entry_t, mempool::pool_allocator<(mempool::pool_index_t), pg_log_entry_t> >::_M_insert<pg_log_entry_t const&>(std::_List_iterator<pg_log_entry_t>, pg_log_entry_t const&) added

Actions

Copy link

#15

Updated by Laura Flores over 2 years ago

/a/yuriw-2023-10-06_20:29:18-rados-wip-yuri6-testing-2023-10-06-0904-quincy-distro-default-smithi/7415444

2023-10-07T01:48:49.228 INFO:tasks.ceph.osd.2.smithi088.stderr:2023-10-07T01:48:49.220+0000 7fe693607700 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.6-1287-gfedcea84/rpm/el8/BUILD/ceph-17.2.6-1287-gfedcea84/src/osd/PGLog.h: In function 'void PGLog::IndexedLog::claim_log_and_clear_rollback_info(const pg_log_t&)' thread 7fe693607700 time 2023-10-07T01:48:49.217145+0000
2023-10-07T01:48:49.228 INFO:tasks.ceph.osd.2.smithi088.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.6-1287-gfedcea84/rpm/el8/BUILD/ceph-17.2.6-1287-gfedcea84/src/osd/PGLog.h: 286: FAILED ceph_assert(rollback_info_trimmed_to == head)
2023-10-07T01:48:49.228 INFO:tasks.ceph.osd.2.smithi088.stderr:
2023-10-07T01:48:49.228 INFO:tasks.ceph.osd.2.smithi088.stderr: ceph version 17.2.6-1287-gfedcea84 (fedcea84a4bd31f0708715b39e04a135187af2ea) quincy (stable)
2023-10-07T01:48:49.228 INFO:tasks.ceph.osd.2.smithi088.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x135) [0x562d684d4f25]
2023-10-07T01:48:49.228 INFO:tasks.ceph.osd.2.smithi088.stderr: 2: ceph-osd(+0x59a0eb) [0x562d684d50eb]
2023-10-07T01:48:49.229 INFO:tasks.ceph.osd.2.smithi088.stderr: 3: (PGLog::IndexedLog::claim_log_and_clear_rollback_info(pg_log_t const&)+0x36) [0x562d68964266]
2023-10-07T01:48:49.229 INFO:tasks.ceph.osd.2.smithi088.stderr: 4: (PGLog::reset_backfill_claim_log(pg_log_t const&, PGLog::LogEntryHandler*)+0x105) [0x562d68964965]
2023-10-07T01:48:49.229 INFO:tasks.ceph.osd.2.smithi088.stderr: 5: (PeeringState::Stray::react(MLogRec const&)+0x2aa) [0x562d6893a84a]
2023-10-07T01:48:49.229 INFO:tasks.ceph.osd.2.smithi088.stderr: 6: (boost::statechart::simple_state<PeeringState::Stray, PeeringState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x109) [0x562d68969e19]
2023-10-07T01:48:49.229 INFO:tasks.ceph.osd.2.smithi088.stderr: 7: (boost::statechart::state_machine<PeeringState::PeeringMachine, PeeringState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x73) [0x562d686b9e63]
2023-10-07T01:48:49.229 INFO:tasks.ceph.osd.2.smithi088.stderr: 8: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PeeringCtx&)+0x129) [0x562d6869ca69]
2023-10-07T01:48:49.229 INFO:tasks.ceph.osd.2.smithi088.stderr: 9: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x2e5) [0x562d685f43e5]
2023-10-07T01:48:49.229 INFO:tasks.ceph.osd.2.smithi088.stderr: 10: (ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x59) [0x562d688e0869]
2023-10-07T01:48:49.229 INFO:tasks.ceph.osd.2.smithi088.stderr: 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x112f) [0x562d68612d1f]
2023-10-07T01:48:49.229 INFO:tasks.ceph.osd.2.smithi088.stderr: 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x435) [0x562d68d56305]
2023-10-07T01:48:49.229 INFO:tasks.ceph.osd.2.smithi088.stderr: 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x562d68d58a24]
2023-10-07T01:48:49.230 INFO:tasks.ceph.osd.2.smithi088.stderr: 14: /lib64/libpthread.so.0(+0x814a) [0x7fe6bd62914a]
2023-10-07T01:48:49.230 INFO:tasks.ceph.osd.2.smithi088.stderr: 15: clone()

Actions

Copy link

#16

Updated by Laura Flores about 2 years ago

/a/yuriw-2024-01-23_19:22:22-rados-wip-yuri5-testing-2024-01-11-1300-pacific-distro-default-smithi/7529622

Actions

Copy link

#17

Updated by Laura Flores about 2 years ago

/a/yuriw-2024-03-04_20:52:58-rados-reef-release-distro-default-smithi/7581575

Actions

Copy link

#18

Updated by Laura Flores about 2 years ago

Backport changed from pacific,quincy to quincy,reef

Actions

Copy link

#19

Updated by Radoslaw Zarzynski about 2 years ago

Backport changed from quincy,reef to quincy,reef,squid

Actions

Copy link

#20

Updated by Nitzan Mordechai over 1 year ago

/a/skanta-2024-10-16_06:36:01-rados-wip-bharath14-testing-2024-10-16-0618-squid-distro-default-smithi/7949965

Actions

Copy link

#21

Updated by Laura Flores over 1 year ago

Nitzan Mordechai wrote in #note-12:

Since this is EC pool, the NO_SHARD is confusing, we are not maintaining rollback_info_trimmed_to on replicas, looking for why we have NO_SHARD

@Nitzan Mordechai have you been able to look at this one any further? Could be a good candidate for the RADOS call.

Actions

Copy link

#22

Updated by Konstantin Shalygin about 1 year ago

Backport changed from quincy,reef,squid to reef,squid

Actions

Copy link

#23

Updated by Bill Scales 9 months ago

I've debugged a version of this assert for optimized EC pools - there are some differences there so not sure the bug I'm fixing is a duplicate of this, but it is worthwhile adding details about what I found as it might help debug the non-optimized EC pools problem the next time it occurs:

This is the assert that was hit:

2025-06-19T09:24:16.353+0000 7fc3e032b640 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/20.3.0-976-gbb0e624d/rpm/el9/BUILD/ceph-20.3.0-976-gbb0e624d/src/osd/PGLog.h: In function 'void PGLog::IndexedLog::claim_log_and_
clear_rollback_info(const pg_log_t&)' thread 7fc3e032b640 time 2025-06-19T09:24:16.352081+0000
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/20.3.0-976-gbb0e624d/rpm/el9/BUILD/ceph-20.3.0-976-gbb0e624d/src/osd/PGLog.h: 323: FAILED ceph_assert(rollback_info_trimmed_to == head)

Looking backwards for the last log entry for the same thread 7fc3e032b640

2025-06-19T09:24:16.351+0000 7fc3e032b640 10 osd.1 pg_epoch: 276 pg[3.1cs0( v 216'1024 (0'0,216'1024] lb MIN local-lis/les=123/124 n=36 ec=221/32 lis/c=255/255 les/c/f=256/256/0 sis=275 pruub=9.883993149s) [1,8,6]/[10,8,6]p10(0) r=-1 lpr=275 pi=[123,275)/6 crt=220'1032 lcod 0'0 remapped NOTIFY pruub 12.712587357s@ mbc={} ps=[32
~2,37~1,40~1,44~1,49~1,4c~1,52~1,59~1,5b~1]] state<Started/Stray>: got info+log from osd.10(0) 3.1cs0( v 232'1056 (0'0,232'1056] lb MIN local-lis/les=275/276 n=0 ec=221/32 lis/c=255/255 les/c/f=256/256/0 sis=275) log((0'0,232'1056], crt=232'1056)

From this log entry you can see that the log head is 216'1024 and can_rollback_to (crt) is 232'1056. The fact that crt is ahead of the log head means that we will hit the assert.

How did crt get ahead of the log head? Looking further back in the log I found that several things had happened to cause this:

1. The PG had split very recently - this created a log when the most recent log entry was earlier than the head (this happens because when a PG splits the log entries are split into the corresponding PGs, but both PGs keep the same head, tail and crt).
2. There was a incomplete (divergent) write that got rolled backwards, in my case it was a partial write (new for optimized EC pools), however I think it could have been for a log entry that was in the other PG. This caused both the head and crt to roll backwards (see rewind from head)
3. The OSD then got restarted so we re-read the info and the log from disk.

In my case I found that the update to info at step 2 had been checkpointed to disk, but the update to crt was not. This turned out to be because PGLog::rewind_divergent_log is not marking the log as dirty if there are no modifications to log entries. We are in a corner case where we we have updated crt, but because the divergence is for entries not in the log there are no divergent log entries and so PGLog::rewind_divergent_log is not setting dirty_from. This means it has made a change to the log (updated crt) but hasn't scheduled a checkpoint.

In the rare case where we get an OSD restart this missing checkpoint causes problems.

Actions

Copy link

#24

Updated by Radoslaw Zarzynski 9 months ago

Hi @Nitzan Mordechai! Could you please take a look on the Bill's detailed analysis?

Actions

Copy link

#25

Updated by Bill Scales 9 months ago

See https://github.com/ceph/ceph-ci/commit/f8da0b7e1fec4db60ec27e45515e82f24d4e6abd which is the fix for optimized-EC pools, I think its likely that it will fix the non-optimized EC pools case as well. Fix isn't yet in a pull request.

Actions

Copy link

#26

Updated by Nitzan Mordechai 9 months ago

@Bill Scales what i'm missing in https://github.com/ceph/ceph-ci/commit/f8da0b7e1fec4db60ec27e45515e82f24d4e6abd ?
It doesn't look like we are touching rollback_info_trimmed_to, only checking for dirty logs and issuing a message?

Actions

Copy link

#27

Updated by Radoslaw Zarzynski 9 months ago

Per the off-record conversation with @Alex Ainscow there is a fix for this problem (debugged by @Bill Scales). It will be a part of the 2nd batch of the FastEC fixups.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Tags

Custom queries

Bug #55141

thrashers/fastread: assertion failure: rollback_info_trimmed_to == head

Updated by Laura Flores almost 4 years ago

Updated by Neha Ojha almost 4 years ago

Updated by Sridhar Seshasayee almost 4 years ago

Updated by Radoslaw Zarzynski over 3 years ago

Updated by Radoslaw Zarzynski over 3 years ago

Updated by Radoslaw Zarzynski over 3 years ago

Updated by Radoslaw Zarzynski over 3 years ago

Updated by Nitzan Mordechai over 3 years ago

Updated by Radoslaw Zarzynski over 3 years ago

Updated by Kamoltat (Junior) Sirivadhna about 3 years ago

Updated by Nitzan Mordechai about 3 years ago

Updated by Nitzan Mordechai about 3 years ago

Updated by Radoslaw Zarzynski about 3 years ago

Updated by Laura Flores over 2 years ago

Updated by Laura Flores over 2 years ago

Updated by Laura Flores about 2 years ago

Updated by Laura Flores about 2 years ago

Updated by Laura Flores about 2 years ago

Updated by Radoslaw Zarzynski about 2 years ago

Updated by Nitzan Mordechai over 1 year ago

Updated by Laura Flores over 1 year ago

Updated by Konstantin Shalygin about 1 year ago

Updated by Bill Scales 9 months ago

Updated by Radoslaw Zarzynski 9 months ago

Updated by Bill Scales 9 months ago

Updated by Nitzan Mordechai 9 months ago

Updated by Radoslaw Zarzynski 9 months ago