Bug #45702
closedPGLog::read_log_and_missing: ceph_assert(miter == missing.get_items().end() || (miter->second.need == i->version && miter->second.have == eversion_t()))
100%
Description
/a/yuriw-2020-05-22_19:55:53-rados-wip-yuri-master_5.22.20-distro-basic-smithi/5083350
2020-05-23T00:02:13.848 INFO:tasks.ceph.osd.6.smithi096.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.0.0-1826-gc3321b7686f/rpm/el8/BUILD/ceph-16.0.0-1826-gc3321b7686f/src/osd/PGLog.h: In function 'static void PGLog::read_log_and_missing(ObjectStore*, ObjectStore::CollectionHandle&, ghobject_t, const pg_info_t&, PGLog::IndexedLog&, missing_type&, std::ostringstream&, bool, bool*, const DoutPrefixProvider*, std::set<std::__cxx11::basic_string<char> >*, bool) [with missing_type = pg_missing_set<true>; ObjectStore::CollectionHandle = boost::intrusive_ptr<ObjectStore::CollectionImpl>; std::ostringstream = std::__cxx11::basic_ostringstream<char>]' thread 7fd298b2ff00 time 2020-05-23T00:02:13.846636+0000 2020-05-23T00:02:13.848 INFO:tasks.ceph.osd.6.smithi096.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.0.0-1826-gc3321b7686f/rpm/el8/BUILD/ceph-16.0.0-1826-gc3321b7686f/src/osd/PGLog.h: 1481: FAILED ceph_assert(miter == missing.get_items().end() || (miter->second.need == i->version && miter->second.have == eversion_t())) 2020-05-23T00:02:13.850 INFO:tasks.ceph.osd.6.smithi096.stderr: ceph version 16.0.0-1826-gc3321b7686f (c3321b7686f181e1bcb805a1fb24baced390ae4c) pacific (dev) 2020-05-23T00:02:13.850 INFO:tasks.ceph.osd.6.smithi096.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x556707bc9e20] 2020-05-23T00:02:13.850 INFO:tasks.ceph.osd.6.smithi096.stderr: 2: (()+0x52103a) [0x556707bca03a] 2020-05-23T00:02:13.850 INFO:tasks.ceph.osd.6.smithi096.stderr: 3: (void PGLog::read_log_and_missing<pg_missing_set<true> >(ObjectStore*, boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ghobject_t, pg_info_t const&, PGLog::IndexedLog&, pg_missing_set<true>&, std::__cxx11::basic_ostringstream<char, std::char_traits<char>, std::allocator<char> >&, bool, bool*, DoutPrefixProvider const*, std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >*, bool)+0x2656) [0x556707d970a6] 2020-05-23T00:02:13.851 INFO:tasks.ceph.osd.6.smithi096.stderr: 4: (PG::read_state(ObjectStore*)+0x1306) [0x556707d7e476] 2020-05-23T00:02:13.851 INFO:tasks.ceph.osd.6.smithi096.stderr: 5: (OSD::load_pgs()+0xa47) [0x556707cc59e7] 2020-05-23T00:02:13.851 INFO:tasks.ceph.osd.6.smithi096.stderr: 6: (OSD::init()+0x270f) [0x556707cf182f] 2020-05-23T00:02:13.851 INFO:tasks.ceph.osd.6.smithi096.stderr: 7: (main()+0x4798) [0x556707c2ef08] 2020-05-23T00:02:13.851 INFO:tasks.ceph.osd.6.smithi096.stderr: 8: (__libc_start_main()+0xf3) [0x7fd29556a873] 2020-05-23T00:02:13.851 INFO:tasks.ceph.osd.6.smithi096.stderr: 9: (_start()+0x2e) [0x556707c6b0ce] 2020-05-23T00:02:13.851 INFO:tasks.ceph.osd.6.smithi096.stderr:*** Caught signal (Aborted) **
Updated by Sage Weil about 5 years ago
- Project changed from Ceph to RADOS
- Category deleted (
OSD)
Updated by Neha Ojha about 5 years ago
- Related to Bug #49136: osd/PGLog.h: FAILED ceph_assert(miter == missing.get_items().end() || (miter->second.need == i->version && miter->second.have == eversion_t())) added
Updated by Neha Ojha about 5 years ago
https://tracker.ceph.com/issues/49136 may have a different root cause, so marking is it as related and not duplicate for now.
Updated by Neha Ojha about 5 years ago
- Related to Bug #20874: osd/PGLog.h: 1386: FAILED assert(miter == missing.get_items().end() || (miter->second.need == i->versi on && miter->second.have == eversion_t())) added
Updated by Neha Ojha about 5 years ago
- Related to deleted (Bug #49136: osd/PGLog.h: FAILED ceph_assert(miter == missing.get_items().end() || (miter->second.need == i->version && miter->second.have == eversion_t())))
Updated by Neha Ojha about 5 years ago
- Has duplicate Bug #49136: osd/PGLog.h: FAILED ceph_assert(miter == missing.get_items().end() || (miter->second.need == i->version && miter->second.have == eversion_t())) added
Updated by Neha Ojha about 5 years ago
- Priority changed from Immediate to Urgent
Updated by Neha Ojha about 5 years ago
- Assignee set to Neha Ojha
- Priority changed from Urgent to High
Updated by Neha Ojha almost 5 years ago
- Backport set to pacific
/a/yuriw-2021-03-25_20:03:40-rados-wip-yuri8-testing-2021-03-25-1042-pacific-distro-basic-smithi/5999051
Updated by Sridhar Seshasayee over 4 years ago
Observed the assert on master:
/a/sseshasa-2021-07-14_10:37:09-rados-wip-sseshasa-testing-2021-07-14-1320-distro-basic-smithi/6270005
Updated by Neha Ojha over 4 years ago
/a/yuriw-2021-08-06_16:31:19-rados-wip-yuri-master-8.6.21-distro-basic-smithi/6324561 - no logs
Updated by Laura Flores about 4 years ago
/a/yuriw-2022-02-09_22:52:18-rados-wip-yuri5-testing-2022-02-09-1322-pacific-distro-default-smithi/6672070
Updated by Laura Flores over 3 years ago
/a/yuriw-2022-06-23_21:29:45-rados-wip-yuri4-testing-2022-06-22-1415-pacific-distro-default-smithi/6895353
Updated by Ronen Friedman over 3 years ago
Is the code under debug_verify_stored_missing maintained?
Updated by Radoslaw Zarzynski over 3 years ago
Even if we don't want to deep dive into right now, we should refactor the assertion:
if (debug_verify_stored_missing) {
auto miter = missing.get_items().find(i->soid);
if (i->is_delete()) {
ceph_assert(miter == missing.get_items().end() ||
(miter->second.need == i->version &&
miter->second.have == eversion_t()));
} else {
ceph_assert(miter != missing.get_items().end());
ceph_assert(miter->second.need == i->version);
ceph_assert(miter->second.have == eversion_t());
}
checked.insert(i->soid);
}
Updated by Radoslaw Zarzynski over 3 years ago
Ronen: @debug_verify_stored_missing a an input parameter with default value. Unfortunately, it doesn't look like a dead one:
void PG::read_state(ObjectStore *store)
{
// ...
recovery_state.init_from_disk_state(
std::move(info_from_disk),
std::move(past_intervals_from_disk),
[this, store] (PGLog &pglog) {
ostringstream oss;
pglog.read_log_and_missing(
store,
ch,
pgmeta_oid,
info,
oss,
cct->_conf->osd_ignore_stale_divergent_priors,
cct->_conf->osd_debug_verify_missing_on_start);
Updated by Radoslaw Zarzynski over 3 years ago
- Assignee changed from Neha Ojha to Nitzan Mordechai
We're poking with read_log_and_missing() pretty recently (the dups issue). Does it ring a bell?
Updated by Nitzan Mordechai over 3 years ago
- Status changed from New to Fix Under Review
Updated by Laura Flores about 2 years ago
/a/lflores-2024-01-10_23:43:40-rados-wip-yuri11-testing-2024-01-10-1124-pacific-distro-default-smithi/7512901
Updated by Nitzan Mordechai over 1 year ago
/a/yuriw-2024-07-22_15:45:18-rados-wip-yuri2-testing-2024-07-19-0817-distro-default-smithi/7812729
Updated by Nitzan Mordechai over 1 year ago
from logs in: /a/yuriw-2024-07-22_15:45:18-rados-wip-yuri2-testing-2024-07-19-0817-distro-default-smithi/7812729
osd.8 is the one that hit the assert: (./remote/smithi012/log/ceph-osd.8.log.gz)
2024-07-22T16:35:16.906+0000 7f7abf9a5640 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/19.0.0-5164-gc5a6b481/rpm/el9/BUILD/ceph-19.0.0-5164-gc5a6b481/src/osd/PGLog.h: In function 'static void PGLog: :read_log_and_missing(ceph::common::CephContext*, ObjectStore*, ObjectStore::CollectionHandle&, ghobject_t, const pg_info_t&, PGLog::IndexedLog&, missing_type&, std::ostringstream&, bool, bool*, const DoutPrefixProvider*, std::set<std::__cxx11::basic_string<char> >*, bool) [with missing_type = pg_missing_set<t rue>; ObjectStore::CollectionHandle = boost::intrusive_ptr<ObjectStore::CollectionImpl>; std::ostringstream = std::__cxx11::basic_ostringstream<char>]' thread 7f7abf9a5640 time 2024-07-22T16:35:16.905750+0000 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/19.0.0-5164-gc5a6b481/rpm/el9/BUILD/ceph-19.0.0-5164-gc5a6b481/src/osd/PGLog.h: 1581: FAILED ceph_assert(miter == missing.get_items().end() || (miter->second.need == i->version && miter->second.have == eversion_t())) ceph version 19.0.0-5164-gc5a6b481 (c5a6b4817430650136a4e4cb6fdd11bb824a2b16) squid (dev) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x11e) [0x562894fc183e] 2: ceph-osd(+0x3fd9fa) [0x562894fc19fa] 3: ceph-osd(+0x38fc53) [0x562894f53c53] 4: (OSD::load_pgs()+0x606) [0x562895128ee6] 5: (OSD::init()+0x21b7) [0x56289511e857] 6: main() 7: /lib64/libc.so.6(+0x29590) [0x7f7abfe29590] 8: __libc_start_main() 9: _start()
we are trying to find oid: benchmark_data_smithi135_39125_object19245
osd.8 also prints before the assertion the read_log_and_missing items that we are decoding from the omap:
2024-07-22T16:35:16.869+0000 7f7abf9a5640 20 read_log_and_missing 41'1162 (0'0) modify 3:33146baf:::benchmark_data_smithi135_39125_object19245:head by client.4681.0:19246 2024-07-22T16:31:57.677819+0000 0 ObjectCleanRegions clean_offsets: [(65536, 18446744073709486079)], clean_omap: true, new_object: false 2024-07-22T16:35:16.890+0000 7f7abf9a5640 20 read_log_and_missing 119'6295 (41'1162) delete 3:33146baf:::benchmark_data_smithi135_39125_object19245:head by client.4681.0:122116 2024-07-22T16:33:49.603339+0000 0 ObjectCleanRegions clean_offsets: [(65536, 18446744073709486079)], clean_omap: false, new_object:>
so we have 2 objects with different versions inside the missing, when we are not finding the oid from the getattr, we are finding that oid in the missing, but we are not matching the i->version to the miter that found (first one) version.
Updated by Matan Breizman about 1 year ago
- Has duplicate Bug #69540: src/osd/PGLog.h: 1574: FAILED ceph_assert(miter == missing.get_items().end() || (miter->second.need == i->version && miter->second.have == eversion_t())) added
Updated by Lee Sanders about 1 year ago · Edited
I've seen this twice under #69540 which has been marked as a duplicate of #45702
/a/ljsanders-2025-03-06_16:33:48-rados-wip-aainscow-for-lee-sanders26-distro-default-smithi/8173891
and
/a/ljsanders-2025-03-15_10:21:04-rados-wip-jamiepryde-for-lee-sanders6-distro-default-smithi/8191486/
But #45702 is marked as Fix Under Review. The PR: https://github.com/ceph/ceph/pull/48088
has been closed by the bot due to inactivity.
How do we get this moving?
Updated by Nitzan Mordechai 12 months ago · Edited
ok, need to refresh my analyze: recreate it again with some more debug messages (/a/nmordech-2025-03-26_10:34:16-rados:thrash-erasure-code-wip-nitzan-pglog-missing-validation-centos9-only-distro-default-smithi/8210737/)
2025-03-26T11:58:29.156+0000 7fa0818188c0 15 read_log_and_missing missing 1233'15635 (1099'7269) delete 7:f22ae0ef:::benchmark_data_smithi157_109865_object117014:head by client.17885.0:251872 2025-03-26T11:58:05.713439+0000 0 ObjectCleanRegions clean_offsets: [65536~184467440737094 86079], clean_omap: false, new_object: false 2025-03-26T11:58:29.156+0000 7fa0818188c0 15 Found missing entry: need=1233'15635 vs expected 1233'15635, have=1099'7269 vs expected 0'0 2025-03-26T11:58:29.170+0000 7fa0818188c0 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/20.0.0-692-g5a05f4a9/rpm/el9/BUILD/ceph-20.0.0-692-g5a05f4a9/src/osd/PGLog.h: In func tion 'static void PGLog::read_log_and_missing(ceph::common::CephContext*, ObjectStore*, ObjectStore::CollectionHandle&, ghobject_t, const pg_info_t&, PGLog::IndexedLog&, missing_type&, std::ostringstream&, bool, bool*, const DoutPrefixProvider*, std::set<std::__cxx11::basic_string<cha r> >*, bool) [with missing_type = pg_missing_set<true>; ObjectStore::CollectionHandle = boost::intrusive_ptr<ObjectStore::CollectionImpl>; std::ostringstream = std::__cxx11::basic_ostringstream<char>]' thread 7fa0818188c0 time 2025-03-26T11:58:29.169006+0000 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/20.0.0-692-g5a05f4a9/rpm/el9/BUILD/ceph-20.0.0-692-g5a05f4a9/src/osd/PGLog.h: 1582: FAILED ceph_assert(miter == missing.get_items().end() || (miter->second.need == i->version && miter->second.have == eversion_t()))
we found missing record (delete) that hit:
ceph_assert(miter missing.get_items().end() ||
(miter->second.need i->version &&
miter->second.have == eversion_t()));
- need: 1233'15635 (the delete op’s version)
- have: 1099'7269 (a non‑default on‑disk version)
the "have" should be the default value 0'0, but it's not.
if we check the log before - A modify op is logged as:
read_log_and_missing 1099'7269 (0'0) modify 7:f22ae0ef:::benchmark_data_smithi157_109865_object117014:head by client.17885.0:117015 ...
At this stage, the object’s on‑disk version is 1099'7269 while the modify log reports a default “have” (0'0) in the parenthesis.
Later, a delete op is logged as:
read_log_and_missing 1233'15635 (1099'7269) delete 7:f22ae0ef:::benchmark_data_smithi157_109865_object117014:head by client.17885.0:251872 ...
The missing record gets updated so that the “need” field becomes 1233'15635 (the delete op’s version) but the “have” field remains 1099'7269—the on‑disk version from the earlier modify op.
We are updating the existing missing record with a deleted record, but the "have" was not reset during the update.
i found PR: https://github.com/ceph/ceph/pull/29893 that introduced several changes:
The missing.add() interface and the pg_missing_item constructor originally accepted a “new_object” parameter. That flag was used to mark an item as “new” (e.g. calling clean_regions.mark_object_new() or even mark_fully_dirty()) when the “have” field was the default.
The PR removed the “new_object” parameter and the associated logic. As a result, when processing a delete op, if the object exists on disk the code now passes the on‑disk version as “have” (1099'7269) rather than resetting it to eversion_t() (0'0).
removing the new_object is probably the root cause of that missing record that failed the validation.
Updated by Sridhar Seshasayee 12 months ago
/a/skanta-2025-03-27_08:02:07-rados-wip-bharath10-testing-2025-03-27-0430-distro-default-smithi/8212845
Updated by Nitzan Mordechai 12 months ago
- Pull request ID changed from 48088 to 62705
I'm updating the PR to 62705
Updated by Konstantin Shalygin 12 months ago
- Target version set to v20.0.0
- Source set to Development
- Affected Versions v16.2.15, v17.2.8, v18.2.5, v19.2.3 added
- Affected Versions deleted (
v16.0.0)
Updated by Radoslaw Zarzynski 11 months ago
- Status changed from Fix Under Review to Pending Backport
- Backport set to reef,squid,tentacle
Merged in main!
Updated by Upkeep Bot 11 months ago
- Copied to Backport #71222: reef: PGLog::read_log_and_missing: ceph_assert(miter == missing.get_items().end() || (miter->second.need == i->version && miter->second.have == eversion_t())) added
Updated by Upkeep Bot 11 months ago
- Copied to Backport #71223: tentacle: PGLog::read_log_and_missing: ceph_assert(miter == missing.get_items().end() || (miter->second.need == i->version && miter->second.have == eversion_t())) added
Updated by Upkeep Bot 11 months ago
- Copied to Backport #71224: squid: PGLog::read_log_and_missing: ceph_assert(miter == missing.get_items().end() || (miter->second.need == i->version && miter->second.have == eversion_t())) added
Updated by Upkeep Bot 9 months ago
- Merge Commit set to 121761dea8cfe2f6acdb2676344340da1d93b69b
- Fixed In set to v20.3.0-180-g121761dea8c
- Upkeep Timestamp set to 2025-07-09T18:56:42+00:00
Updated by Upkeep Bot 8 months ago
- Fixed In changed from v20.3.0-180-g121761dea8c to v20.3.0-180-g121761dea8
- Upkeep Timestamp changed from 2025-07-09T18:56:42+00:00 to 2025-07-14T18:12:47+00:00
Updated by Konstantin Shalygin 7 months ago
- Status changed from Pending Backport to Resolved
- % Done changed from 0 to 100