Project

General

Profile

Actions

Bug #55141

open

thrashers/fastread: assertion failure: rollback_info_trimmed_to == head

Added by Radoslaw Zarzynski almost 4 years ago. Updated 9 months ago.

Status:
In Progress
Priority:
Normal
Category:
-
Target version:
-
% Done:

0%

Source:
Backport:
reef,squid
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Tags (freeform):
Merge Commit:
Fixed In:
Released In:
Upkeep Timestamp:

Description

From the /home/teuthworker/archive/yuriw-2022-03-29_21:35:32-rados-wip-yuri5-testing-2022-03-29-1152-quincy-distro-default-smithi/6767850/teuthology.log:

2022-03-30T01:34:59.499 INFO:tasks.ceph.osd.2.smithi080.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.1.0-125-g9053ed98/rpm/el8/BUILD/ceph-17.1.0-125-g9053ed98/src/osd/PGLog.h: In function 'void PGLog::IndexedLog::claim_log_and_clear_rollback_info(const pg_log_t&)' thread 7fa658b53700 time 2022-03-30T01:34:59.459273+0000
2022-03-30T01:34:59.499 INFO:tasks.ceph.osd.2.smithi080.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.1.0-125-g9053ed98/rpm/el8/BUILD/ceph-17.1.0-125-g9053ed98/src/osd/PGLog.h: 286: FAILED ceph_assert(rollback_info_trimmed_to == head)
2022-03-30T01:34:59.499 INFO:tasks.ceph.osd.2.smithi080.stderr: ceph version 17.1.0-125-g9053ed98 (9053ed984698b7140d91d3195fcba61aa554fe69) quincy (stable)
2022-03-30T01:34:59.499 INFO:tasks.ceph.osd.2.smithi080.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x55d6569c2464]
2022-03-30T01:34:59.500 INFO:tasks.ceph.osd.2.smithi080.stderr: 2: ceph-osd(+0x5d7685) [0x55d6569c2685]
2022-03-30T01:34:59.500 INFO:tasks.ceph.osd.2.smithi080.stderr: 3: (PeeringState::Stray::react(MLogRec const&)+0x3d0) [0x55d656de5390]
2022-03-30T01:34:59.500 INFO:tasks.ceph.osd.2.smithi080.stderr: 4: (boost::statechart::simple_state<PeeringState::Stray, PeeringState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x280) [0x55d656e1a6c0]
2022-03-30T01:34:59.500 INFO:tasks.ceph.osd.2.smithi080.stderr: 5: (boost::statechart::state_machine<PeeringState::PeeringMachine, PeeringState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x74) [0x55d656b90c54]
2022-03-30T01:34:59.500 INFO:tasks.ceph.osd.2.smithi080.stderr: 6: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PeeringCtx&)+0x2d6) [0x55d656b84ea6]
2022-03-30T01:34:59.501 INFO:tasks.ceph.osd.2.smithi080.stderr: 7: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x175) [0x55d656afa3c5]
2022-03-30T01:34:59.501 INFO:tasks.ceph.osd.2.smithi080.stderr: 8: (ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x56) [0x55d656d915b6]
2022-03-30T01:34:59.501 INFO:tasks.ceph.osd.2.smithi080.stderr: 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xaf8) [0x55d656aec0e8]
2022-03-30T01:34:59.501 INFO:tasks.ceph.osd.2.smithi080.stderr: 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x55d6571f1a64]
2022-03-30T01:34:59.502 INFO:tasks.ceph.osd.2.smithi080.stderr: 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55d6571f2e04]
2022-03-30T01:34:59.502 INFO:tasks.ceph.osd.2.smithi080.stderr: 12: /lib64/libpthread.so.0(+0x817f) [0x7fa684d7017f]
2022-03-30T01:34:59.502 INFO:tasks.ceph.osd.2.smithi080.stderr: 13: clone()
2022-03-30T01:34:59.502 INFO:tasks.ceph.osd.2.smithi080.stderr:*** Caught signal (Aborted) **

Related issues 2 (1 open1 closed)

Related to RADOS - Bug #60084: crash: void std::list<pg_log_entry_t, mempool::pool_allocator<(mempool::pool_index_t), pg_log_entry_t> >::_M_insert<pg_log_entry_t const&>(std::_List_iterator<pg_log_entry_t>, pg_log_entry_t const&)New

Actions
Has duplicate RADOS - Bug #57913: Thrashosd: timeout 120 ceph --cluster ceph osd pool rm unique_pool_2 unique_pool_2 --yes-i-really-really-mean-itDuplicateNitzan Mordechai

Actions
Actions #1

Updated by Laura Flores almost 4 years ago

Leading up to the ceph_assert failure in osd.2:

/a/yuriw-2022-03-29_21:35:32-rados-wip-yuri5-testing-2022-03-29-1152-quincy-distro-default-smithi/6767850/remote/smithi080/log/ceph-osd.2.log.gz

2022-03-30T01:34:59.458+0000 7fa65e35e700 10 osd.2 pg_epoch: 844 pg[6.13s1( v 824'2298 lc 771'314 (0'0,824'2298] local-lis/les=840/841 n=1265 ec=782/768 lis/c=809/782 les/c/f=810/783/0 sis=840) [3,2,0]/[NONE,2,0]p2(1) async=[3(0)] r=1 lpr=840 pi=[782,840)/2 crt=824'2298 lcod 0'0 mlcod 0'0 activating+undersized+degraded+remapped m=898 mbc={0={(0+0)=838,(0+1)=350,(0+2)=10,(1+0)=1},1={(0+1)=898,(1+0)=177,(1+1)=124},2={(0+1)=897,(1+0)=177,(1+1)=125}}] search_for_missing 6:c87ee54b:::benchmark_data_smithi080_195491_object1379:head 770'88 is on osd.7(0)
2022-03-30T01:34:59.459+0000 7fa65e35e700 10 osd.2 pg_epoch: 844 pg[6.13s1( v 824'2298 lc 771'314 (0'0,824'2298] local-lis/les=840/841 n=1265 ec=782/768 lis/c=809/782 les/c/f=810/783/0 sis=840) [3,2,0]/[NONE,2,0]p2(1) async=[3(0)] r=1 lpr=840 pi=[782,840)/2 crt=824'2298 lcod 0'0 mlcod 0'0 activating+undersized+degraded+remapped m=898 mbc={0={(0+0)=838,(0+1)=349,(0+2)=11,(1+0)=1},1={(0+1)=898,(1+0)=177,(1+1)=124},2={(0+1)=897,(1+0)=177,(1+1)=125}}] search_for_missing 6:c883b7b1:::benchmark_data_smithi080_195491_object38312:head 804'2230 also missing on osd.7(0) (last_update 783'2113 < needed 804'2230)
2022-03-30T01:34:59.459+0000 7fa65e35e700 10 osd.2 pg_epoch: 844 pg[6.13s1( v 824'2298 lc 771'314 (0'0,824'2298] local-lis/les=840/841 n=1265 ec=782/768 lis/c=809/782 les/c/f=810/783/0 sis=840) [3,2,0]/[NONE,2,0]p2(1) async=[3(0)] r=1 lpr=840 pi=[782,840)/2 crt=824'2298 lcod 0'0 mlcod 0'0 activating+undersized+degraded+remapped m=898 mbc={0={(0+0)=838,(0+1)=349,(0+2)=11,(1+0)=1},1={(0+1)=898,(1+0)=177,(1+1)=124},2={(0+1)=897,(1+0)=177,(1+1)=125}}] search_for_missing 6:c88b487d:::benchmark_data_smithi080_195491_object187:head 770'12 is on osd.7(0)
2022-03-30T01:34:59.462+0000 7fa680793700  1 -- [v2:172.21.15.80:6826/205141,v1:172.21.15.80:6827/205141] <== osd.0 v2:172.21.15.80:6818/33883 46 ==== PGlog(6.0s2 log log((792'2126,828'2271], crt=828'2271) pi ([0,0] all_participants= intervals=) pg_lease(ru 0.000000000s ub 1576.709106445s int 16.000000000s) e845/844) v6 ==== 51420+0+0 (crc 0 0 0) 0x55d65e75a000 con 0x55d65dc4e000
2022-03-30T01:34:59.462+0000 7fa680793700 15 osd.2 845 enqueue_peering_evt 6.0s2 epoch_sent: 845 epoch_requested: 844 MLogRec from 0(0) log log((792'2126,828'2271], crt=828'2271) pi ([0,0] all_participants= intervals=) pg_lease(ru 0.000000000s ub 1576.709106445s int 16.000000000s) +create_info
2022-03-30T01:34:59.462+0000 7fa680793700 20 osd.2 op_wq(0) _enqueue OpSchedulerItem(6.0s2 PGPeeringEvent(epoch_sent: 845 epoch_requested: 844 MLogRec from 0(0) log log((792'2126,828'2271], crt=828'2271) pi ([0,0] all_participants= intervals=) pg_lease(ru 0.000000000s ub 1576.709106445s int 16.000000000s) +create_info) prio 255 cost 10 e845)
2022-03-30T01:34:59.463+0000 7fa658b53700 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.1.0-125-g9053ed98/rpm/el8/BUILD/ceph-17.1.0-125-g9053ed98/src/osd/PGLog.h: In function 'void PGLog::IndexedLog::claim_log_and_clear_rollback_info(const pg_log_t&)' thread 7fa658b53700 time 2022-03-30T01:34:59.459273+0000
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.1.0-125-g9053ed98/rpm/el8/BUILD/ceph-17.1.0-125-g9053ed98/src/osd/PGLog.h: 286: FAILED ceph_assert(rollback_info_trimmed_to == head)

Actions #2

Updated by Neha Ojha almost 4 years ago

  • Assignee set to Nitzan Mordechai
Actions #3

Updated by Sridhar Seshasayee almost 4 years ago

Observed this in a pacific run:
/a/yuriw-2022-06-15_18:29:33-rados-wip-yuri4-testing-2022-06-15-1000-pacific-distro-default-smithi/6881247

Although the job was marked dead, the crash information is available from smithi093 logs.

Test Description:
rados/thrash-erasure-code-overwrites/{bluestore-bitmap ceph clusters/{fixed-2 openstack} fast/normal mon_election/connectivity msgr-failures/osd-dispatch-delay rados recovery-overrides/{more-async-partial-recovery} supported-random-distro$/{centos_8} thrashers/pggrow thrashosds-health workloads/ec-small-objects-overwrites}

Backtrace:


  -424> 2022-06-15T20:31:19.951+0000 7f3fddfd2700 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.9-239-g224fc22e/rpm/el8/BUILD/ceph-16.2.9-239-g224fc22e/src/osd/PGLog.h: In function 'void PGLog::IndexedLog::claim_log_and_clear_rollback_info(const pg_log_t&)' thread 7f3fddfd2700 time 2022-06-15T20:31:19.949134+0000
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.9-239-g224fc22e/rpm/el8/BUILD/ceph-16.2.9-239-g224fc22e/src/osd/PGLog.h: 286: FAILED ceph_assert(rollback_info_trimmed_to == head)

 ceph version 16.2.9-239-g224fc22e (224fc22e07cebeecc3e08055cfd6105b1a30f173) pacific (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x557bf1bb69f4]
 2: ceph-osd(+0x580c0e) [0x557bf1bb6c0e]
 3: (PeeringState::Stray::react(MLogRec const&)+0x230) [0x557bf1f6d320]
 4: (boost::statechart::simple_state<PeeringState::Stray, PeeringState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xd5) [0x557bf1f99855]
 5: (boost::statechart::state_machine<PeeringState::PeeringMachine, PeeringState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x5b) [0x557bf1d80dab]
 6: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PeeringCtx&)+0x2d1) [0x557bf1d758f1]
 7: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x29c) [0x557bf1cebedc]
 8: (ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x56) [0x557bf1f1f1f6]
 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xc28) [0x557bf1cddc88]
 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x557bf235b8b4]
 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x557bf235e794]
 12: /lib64/libpthread.so.0(+0x81ca) [0x7f4005e491ca]
 13: clone()
Actions #4

Updated by Radoslaw Zarzynski over 3 years ago

  • Backport changed from quincy to pacific,quincy
Actions #5

Updated by Radoslaw Zarzynski over 3 years ago

  • Priority changed from High to Normal

Lowering the priority as we haven't seen a reoccurence last time.

Actions #6

Updated by Radoslaw Zarzynski over 3 years ago

  • Has duplicate Bug #57913: Thrashosd: timeout 120 ceph --cluster ceph osd pool rm unique_pool_2 unique_pool_2 --yes-i-really-really-mean-it added
Actions #7

Updated by Radoslaw Zarzynski over 3 years ago

Well, just found a new occurance.

Actions #8

Updated by Nitzan Mordechai over 3 years ago

Radoslaw Zarzynski wrote:

Well, just found a new occurance.

Where can i find it?

Actions #9

Updated by Radoslaw Zarzynski over 3 years ago

Nitzan Mordechai wrote:

Radoslaw Zarzynski wrote:

Well, just found a new occurance.

Where can i find it?

https://tracker.ceph.com/issues/57913#note-2

Actions #10

Updated by Kamoltat (Junior) Sirivadhna about 3 years ago

/a/yuriw-2023-03-02_00:09:05-rados-wip-yuri11-testing-2023-03-01-1424-distro-default-smithi/7191380

Actions #11

Updated by Nitzan Mordechai about 3 years ago

I'm probably missing something here, but i'll try to summarize my finds

/a/yuriw-2023-03-02_00:09:05-rados-wip-yuri11-testing-2023-03-01-1424-distro-default-smithi/7191380

osd.2 hitting the assert rollback_info_trimmed_to == head
from coredump
head = head = {version = 1834, epoch = 682 ..
rollback_info_trimmed_to = {version = 0, epoch = 0

but info.pgid:

(gdb) 
$10 = {pgid = {pgid = {m_pool = 5, m_seed = 36, static calc_name_buf_size = 36 '$'}, shard = {id = 1 '\001', +static NO_SHARD+ = {id = -1 '\377', static NO_SHARD = <same as static member of an already seen type>}}, static calc_name_buf_size = 40 '('}, last_update = {version = 1834, epoch = 682, __pad = 0}, 
  last_complete = {version = 1834, epoch = 682, __pad = 0}, last_epoch_started = 686, last_interval_started = 685, last_user_version = 1253, log_tail = {version = 0, epoch = 0, __pad = 0}, last_backfill = {static POOL_META = -1, static POOL_TEMP_START = -2, oid = {name = ""}, snap = {val = 0}, hash = 0, 
    max = false, nibblewise_key_cache = 0, hash_reverse_bits = 0, pool = -9223372036854775808, nspace = "", key = ""}, purged_snaps = {_size = 0, m = std::map with 0 elements}, stats = {version = {version = 1253, epoch = 611, __pad = 0}, reported_seq = 1288, reported_epoch = 611, state = 12846082, 
    last_fresh = {tv = {tv_sec = 1677735243, tv_nsec = 167430097}}, last_change = {tv = {tv_sec = 1677735237, tv_nsec = 341281420}}, last_active = {tv = {tv_sec = 1677735243, tv_nsec = 167430097}}, last_peered = {tv = {tv_sec = 1677735243, tv_nsec = 167430097}}, last_clean = {tv = {tv_sec = 1677735235, 
        tv_nsec = 101799085}}, last_unstale = {tv = {tv_sec = 1677735243, tv_nsec = 167430097}}, last_undegraded = {tv = {tv_sec = 1677735237, tv_nsec = 130030467}}, last_fullsized = {tv = {tv_sec = 1677735237, tv_nsec = 124288530}}, log_start = {version = 0, epoch = 0, __pad = 0}, ondisk_log_start = {
      version = 0, epoch = 0, __pad = 0}, created = 598, last_epoch_clean = 599, parent = {m_pool = 0, m_seed = 0, static calc_name_buf_size = 36 '$'}, parent_split_bits = 6, last_scrub = {version = 0, epoch = 0, __pad = 0}, last_deep_scrub = {version = 0, epoch = 0, __pad = 0}, last_scrub_stamp = {tv = {
        tv_sec = 1677735223, tv_nsec = 1701357}}, last_deep_scrub_stamp = {tv = {tv_sec = 1677735223, tv_nsec = 1701357}}, last_clean_scrub_stamp = {tv = {tv_sec = 1677735223, tv_nsec = 1701357}}, last_scrub_duration = 0, stats = {sum = {num_bytes = 20529152, num_objects = 0, num_object_clones = 0, 
        num_object_copies = 0, num_objects_missing_on_primary = 0, num_objects_degraded = 0, num_objects_unfound = 0, num_rd = 0, num_rd_kb = 0, num_wr = 0, num_wr_kb = 0, num_scrub_errors = 0, num_objects_recovered = 0, num_bytes_recovered = 0, num_keys_recovered = 0, num_shallow_scrub_errors = 0, 
        num_deep_scrub_errors = 0, num_objects_dirty = 0, num_whiteouts = 0, num_objects_omap = 0, num_objects_hit_set_archive = 0, num_objects_misplaced = 0, num_bytes_hit_set_archive = 0, num_flush = 0, num_flush_kb = 0, num_evict = 0, num_evict_kb = 0, num_promote = 0, num_flush_mode_high = 0, 
        num_flush_mode_low = 0, num_evict_mode_some = 0, num_evict_mode_full = 0, num_objects_pinned = 0, num_objects_missing = 0, num_legacy_snapsets = 0, num_large_omap_objects = 0, num_objects_manifest = 0, num_omap_bytes = 0, num_omap_keys = 0, num_objects_repaired = 0}}, log_size = 1253, 
    log_dups_size = 0, ondisk_log_size = 1253, objects_scrubbed = 0, scrub_duration = 0, up = std::vector of length 3, capacity 4 = {7, 2, 5}, acting = std::vector of length 3, capacity 4 = {2147483647, 3, 5}, avail_no_missing = std::vector of length 3, capacity 3 = {{static NO_OSD = 2147483647, osd = 7, 
        shard = {id = 0 '\000', static NO_SHARD = {id = -1 '\377', static NO_SHARD = <same as static member of an already seen type>}}}, {static NO_OSD = 2147483647, osd = 0, shard = {id = 2 '\002', static NO_SHARD = {id = -1 '\377', static NO_SHARD = <same as static member of an already seen type>}}}, {
        static NO_OSD = 2147483647, osd = 2, shard = {id = 1 '\001', static NO_SHARD = {id = -1 '\377', static NO_SHARD = <same as static member of an already seen type>}}}}, object_location_counts = std::map with 2 elements = {[std::set with 3 elements = {[0] = {static NO_OSD = 2147483647, osd = 0, shard = {
            id = 2 '\002', static NO_SHARD = {id = -1 '\377', static NO_SHARD = <same as static member of an already seen type>}}}, [1] = {static NO_OSD = 2147483647, osd = 2, shard = {id = 1 '\001', static NO_SHARD = {id = -1 '\377', static NO_SHARD = <same as static member of an already seen type>}}}, 
        [2] = {static NO_OSD = 2147483647, osd = 7, shard = {id = 0 '\000', static NO_SHARD = {id = -1 '\377', static NO_SHARD = <same as static member of an already seen type>}}}}] = 827, [std::set with 2 elements = {[0] = {static NO_OSD = 2147483647, osd = 2, shard = {id = 1 '\001', static NO_SHARD = {
              id = -1 '\377', static NO_SHARD = <same as static member of an already seen type>}}}, [1] = {static NO_OSD = 2147483647, osd = 7, shard = {id = 0 '\000', static NO_SHARD = {id = -1 '\377', static NO_SHARD = <same as static member of an already seen type>}}}}] = 426}, mapping_epoch = 685, 
    blocked_by = std::vector of length 0, capacity 0, purged_snaps = {_size = 0, m = std::map with 0 elements}, last_became_active = {tv = {tv_sec = 1677735237, tv_nsec = 337804713}}, last_became_peered = {tv = {tv_sec = 1677735237, tv_nsec = 337804713}}, up_primary = 7, acting_primary = 3, 
    snaptrimq_len = 0, objects_trimmed = 0, snaptrim_duration = 0, scrub_sched_status = {m_scheduled_at = {tv = {tv_sec = 1677735301, tv_nsec = 828622384}}, m_duration_seconds = 0, m_sched_status = pg_scrub_sched_status_t::scheduled, m_is_active = false, m_is_deep = scrub_level_t::shallow, 
      m_is_periodic = true}, stats_invalid = true, dirty_stats_invalid = false, omap_stats_invalid = false, hitset_stats_invalid = false, hitset_bytes_stats_invalid = false, pin_stats_invalid = false, manifest_stats_invalid = false}, history = {epoch_created = 683, epoch_pool_created = 598, 
    last_epoch_started = 665, last_interval_started = 664, last_epoch_clean = 665, last_interval_clean = 664, last_epoch_split = 683, last_epoch_marked_full = 0, same_up_since = 612, same_interval_since = 685, same_primary_since = 685, last_scrub = {version = 1796, epoch = 656, __pad = 0}, last_deep_scrub = {
      version = 0, epoch = 0, __pad = 0}, last_scrub_stamp = {tv = {tv_sec = 1677735336, tv_nsec = 306800488}}, last_deep_scrub_stamp = {tv = {tv_sec = 1677735223, tv_nsec = 1701357}}, last_clean_scrub_stamp = {tv = {tv_sec = 1677735336, tv_nsec = 306800488}}, prior_readable_until_ub = {__r = 0}}, hit_set = {
    current_last_update = {version = 0, epoch = 0, __pad = 0}, history = empty std::__cxx11::list}}

Actions #12

Updated by Nitzan Mordechai about 3 years ago

Since this is EC pool, the NO_SHARD is confusing, we are not maintaining rollback_info_trimmed_to on replicas, looking for why we have NO_SHARD

Actions #13

Updated by Radoslaw Zarzynski about 3 years ago

  • Status changed from New to In Progress
Actions #14

Updated by Laura Flores over 2 years ago

  • Related to Bug #60084: crash: void std::list<pg_log_entry_t, mempool::pool_allocator<(mempool::pool_index_t), pg_log_entry_t> >::_M_insert<pg_log_entry_t const&>(std::_List_iterator<pg_log_entry_t>, pg_log_entry_t const&) added
Actions #15

Updated by Laura Flores over 2 years ago

/a/yuriw-2023-10-06_20:29:18-rados-wip-yuri6-testing-2023-10-06-0904-quincy-distro-default-smithi/7415444

2023-10-07T01:48:49.228 INFO:tasks.ceph.osd.2.smithi088.stderr:2023-10-07T01:48:49.220+0000 7fe693607700 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.6-1287-gfedcea84/rpm/el8/BUILD/ceph-17.2.6-1287-gfedcea84/src/osd/PGLog.h: In function 'void PGLog::IndexedLog::claim_log_and_clear_rollback_info(const pg_log_t&)' thread 7fe693607700 time 2023-10-07T01:48:49.217145+0000
2023-10-07T01:48:49.228 INFO:tasks.ceph.osd.2.smithi088.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.6-1287-gfedcea84/rpm/el8/BUILD/ceph-17.2.6-1287-gfedcea84/src/osd/PGLog.h: 286: FAILED ceph_assert(rollback_info_trimmed_to == head)
2023-10-07T01:48:49.228 INFO:tasks.ceph.osd.2.smithi088.stderr:
2023-10-07T01:48:49.228 INFO:tasks.ceph.osd.2.smithi088.stderr: ceph version 17.2.6-1287-gfedcea84 (fedcea84a4bd31f0708715b39e04a135187af2ea) quincy (stable)
2023-10-07T01:48:49.228 INFO:tasks.ceph.osd.2.smithi088.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x135) [0x562d684d4f25]
2023-10-07T01:48:49.228 INFO:tasks.ceph.osd.2.smithi088.stderr: 2: ceph-osd(+0x59a0eb) [0x562d684d50eb]
2023-10-07T01:48:49.229 INFO:tasks.ceph.osd.2.smithi088.stderr: 3: (PGLog::IndexedLog::claim_log_and_clear_rollback_info(pg_log_t const&)+0x36) [0x562d68964266]
2023-10-07T01:48:49.229 INFO:tasks.ceph.osd.2.smithi088.stderr: 4: (PGLog::reset_backfill_claim_log(pg_log_t const&, PGLog::LogEntryHandler*)+0x105) [0x562d68964965]
2023-10-07T01:48:49.229 INFO:tasks.ceph.osd.2.smithi088.stderr: 5: (PeeringState::Stray::react(MLogRec const&)+0x2aa) [0x562d6893a84a]
2023-10-07T01:48:49.229 INFO:tasks.ceph.osd.2.smithi088.stderr: 6: (boost::statechart::simple_state<PeeringState::Stray, PeeringState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x109) [0x562d68969e19]
2023-10-07T01:48:49.229 INFO:tasks.ceph.osd.2.smithi088.stderr: 7: (boost::statechart::state_machine<PeeringState::PeeringMachine, PeeringState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x73) [0x562d686b9e63]
2023-10-07T01:48:49.229 INFO:tasks.ceph.osd.2.smithi088.stderr: 8: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PeeringCtx&)+0x129) [0x562d6869ca69]
2023-10-07T01:48:49.229 INFO:tasks.ceph.osd.2.smithi088.stderr: 9: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x2e5) [0x562d685f43e5]
2023-10-07T01:48:49.229 INFO:tasks.ceph.osd.2.smithi088.stderr: 10: (ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x59) [0x562d688e0869]
2023-10-07T01:48:49.229 INFO:tasks.ceph.osd.2.smithi088.stderr: 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x112f) [0x562d68612d1f]
2023-10-07T01:48:49.229 INFO:tasks.ceph.osd.2.smithi088.stderr: 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x435) [0x562d68d56305]
2023-10-07T01:48:49.229 INFO:tasks.ceph.osd.2.smithi088.stderr: 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x562d68d58a24]
2023-10-07T01:48:49.230 INFO:tasks.ceph.osd.2.smithi088.stderr: 14: /lib64/libpthread.so.0(+0x814a) [0x7fe6bd62914a]
2023-10-07T01:48:49.230 INFO:tasks.ceph.osd.2.smithi088.stderr: 15: clone()

Actions #16

Updated by Laura Flores about 2 years ago

/a/yuriw-2024-01-23_19:22:22-rados-wip-yuri5-testing-2024-01-11-1300-pacific-distro-default-smithi/7529622

Actions #17

Updated by Laura Flores about 2 years ago

/a/yuriw-2024-03-04_20:52:58-rados-reef-release-distro-default-smithi/7581575

Actions #18

Updated by Laura Flores about 2 years ago

  • Backport changed from pacific,quincy to quincy,reef
Actions #19

Updated by Radoslaw Zarzynski about 2 years ago

  • Backport changed from quincy,reef to quincy,reef,squid
Actions #20

Updated by Nitzan Mordechai over 1 year ago

/a/skanta-2024-10-16_06:36:01-rados-wip-bharath14-testing-2024-10-16-0618-squid-distro-default-smithi/7949965

Actions #21

Updated by Laura Flores over 1 year ago

Nitzan Mordechai wrote in #note-12:

Since this is EC pool, the NO_SHARD is confusing, we are not maintaining rollback_info_trimmed_to on replicas, looking for why we have NO_SHARD

@Nitzan Mordechai have you been able to look at this one any further? Could be a good candidate for the RADOS call.

Actions #22

Updated by Konstantin Shalygin about 1 year ago

  • Backport changed from quincy,reef,squid to reef,squid
Actions #23

Updated by Bill Scales 9 months ago

I've debugged a version of this assert for optimized EC pools - there are some differences there so not sure the bug I'm fixing is a duplicate of this, but it is worthwhile adding details about what I found as it might help debug the non-optimized EC pools problem the next time it occurs:

This is the assert that was hit:

2025-06-19T09:24:16.353+0000 7fc3e032b640 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/20.3.0-976-gbb0e624d/rpm/el9/BUILD/ceph-20.3.0-976-gbb0e624d/src/osd/PGLog.h: In function 'void PGLog::IndexedLog::claim_log_and_
clear_rollback_info(const pg_log_t&)' thread 7fc3e032b640 time 2025-06-19T09:24:16.352081+0000
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/20.3.0-976-gbb0e624d/rpm/el9/BUILD/ceph-20.3.0-976-gbb0e624d/src/osd/PGLog.h: 323: FAILED ceph_assert(rollback_info_trimmed_to == head)

Looking backwards for the last log entry for the same thread 7fc3e032b640

2025-06-19T09:24:16.351+0000 7fc3e032b640 10 osd.1 pg_epoch: 276 pg[3.1cs0( v 216'1024 (0'0,216'1024] lb MIN local-lis/les=123/124 n=36 ec=221/32 lis/c=255/255 les/c/f=256/256/0 sis=275 pruub=9.883993149s) [1,8,6]/[10,8,6]p10(0) r=-1 lpr=275 pi=[123,275)/6 crt=220'1032 lcod 0'0 remapped NOTIFY pruub 12.712587357s@ mbc={} ps=[32
~2,37~1,40~1,44~1,49~1,4c~1,52~1,59~1,5b~1]] state<Started/Stray>: got info+log from osd.10(0) 3.1cs0( v 232'1056 (0'0,232'1056] lb MIN local-lis/les=275/276 n=0 ec=221/32 lis/c=255/255 les/c/f=256/256/0 sis=275) log((0'0,232'1056], crt=232'1056)

From this log entry you can see that the log head is 216'1024 and can_rollback_to (crt) is 232'1056. The fact that crt is ahead of the log head means that we will hit the assert.

How did crt get ahead of the log head? Looking further back in the log I found that several things had happened to cause this:

1. The PG had split very recently - this created a log when the most recent log entry was earlier than the head (this happens because when a PG splits the log entries are split into the corresponding PGs, but both PGs keep the same head, tail and crt).
2. There was a incomplete (divergent) write that got rolled backwards, in my case it was a partial write (new for optimized EC pools), however I think it could have been for a log entry that was in the other PG. This caused both the head and crt to roll backwards (see rewind from head)
3. The OSD then got restarted so we re-read the info and the log from disk.

In my case I found that the update to info at step 2 had been checkpointed to disk, but the update to crt was not. This turned out to be because PGLog::rewind_divergent_log is not marking the log as dirty if there are no modifications to log entries. We are in a corner case where we we have updated crt, but because the divergence is for entries not in the log there are no divergent log entries and so PGLog::rewind_divergent_log is not setting dirty_from. This means it has made a change to the log (updated crt) but hasn't scheduled a checkpoint.

In the rare case where we get an OSD restart this missing checkpoint causes problems.

Actions #24

Updated by Radoslaw Zarzynski 9 months ago

Hi @Nitzan Mordechai! Could you please take a look on the Bill's detailed analysis?

Actions #25

Updated by Bill Scales 9 months ago

See https://github.com/ceph/ceph-ci/commit/f8da0b7e1fec4db60ec27e45515e82f24d4e6abd which is the fix for optimized-EC pools, I think its likely that it will fix the non-optimized EC pools case as well. Fix isn't yet in a pull request.

Actions #26

Updated by Nitzan Mordechai 9 months ago

@Bill Scales what i'm missing in https://github.com/ceph/ceph-ci/commit/f8da0b7e1fec4db60ec27e45515e82f24d4e6abd ?
It doesn't look like we are touching rollback_info_trimmed_to, only checking for dirty logs and issuing a message?

Actions #27

Updated by Radoslaw Zarzynski 9 months ago

Per the off-record conversation with @Alex Ainscow there is a fix for this problem (debugged by @Bill Scales). It will be a part of the 2nd batch of the FastEC fixups.

Actions

Also available in: Atom PDF