Project

General

Profile

Actions

Bug #45702

closed

PGLog::read_log_and_missing: ceph_assert(miter == missing.get_items().end() || (miter->second.need == i->version && miter->second.have == eversion_t()))

Added by Brad Hubbard almost 6 years ago. Updated 7 months ago.

Status:
Resolved
Priority:
Normal
Category:
-
Target version:
% Done:

100%

Source:
Development
Backport:
reef,squid,tentacle
Regression:
No
Severity:
3 - minor
Reviewed:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Tags (freeform):
backport_processed
Fixed In:
v20.3.0-180-g121761dea8
Released In:
Upkeep Timestamp:
2025-07-14T18:12:47+00:00

Description

/a/yuriw-2020-05-22_19:55:53-rados-wip-yuri-master_5.22.20-distro-basic-smithi/5083350

2020-05-23T00:02:13.848 INFO:tasks.ceph.osd.6.smithi096.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.0.0-1826-gc3321b7686f/rpm/el8/BUILD/ceph-16.0.0-1826-gc3321b7686f/src/osd/PGLog.h: In function 'static void PGLog::read_log_and_missing(ObjectStore*, ObjectStore::CollectionHandle&, ghobject_t, const pg_info_t&, PGLog::IndexedLog&, missing_type&, std::ostringstream&, bool, bool*, const DoutPrefixProvider*, std::set<std::__cxx11::basic_string<char> >*, bool) [with missing_type = pg_missing_set<true>; ObjectStore::CollectionHandle = boost::intrusive_ptr<ObjectStore::CollectionImpl>; std::ostringstream = std::__cxx11::basic_ostringstream<char>]' thread 7fd298b2ff00 time 2020-05-23T00:02:13.846636+0000
2020-05-23T00:02:13.848 INFO:tasks.ceph.osd.6.smithi096.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.0.0-1826-gc3321b7686f/rpm/el8/BUILD/ceph-16.0.0-1826-gc3321b7686f/src/osd/PGLog.h: 1481: FAILED ceph_assert(miter == missing.get_items().end() || (miter->second.need == i->version && miter->second.have == eversion_t()))
2020-05-23T00:02:13.850 INFO:tasks.ceph.osd.6.smithi096.stderr: ceph version 16.0.0-1826-gc3321b7686f (c3321b7686f181e1bcb805a1fb24baced390ae4c) pacific (dev)
2020-05-23T00:02:13.850 INFO:tasks.ceph.osd.6.smithi096.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x556707bc9e20]
2020-05-23T00:02:13.850 INFO:tasks.ceph.osd.6.smithi096.stderr: 2: (()+0x52103a) [0x556707bca03a]
2020-05-23T00:02:13.850 INFO:tasks.ceph.osd.6.smithi096.stderr: 3: (void PGLog::read_log_and_missing<pg_missing_set<true> >(ObjectStore*, boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ghobject_t, pg_info_t const&, PGLog::IndexedLog&, pg_missing_set<true>&, std::__cxx11::basic_ostringstream<char, std::char_traits<char>, std::allocator<char> >&, bool, bool*, DoutPrefixProvider const*, std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >*, bool)+0x2656) [0x556707d970a6]
2020-05-23T00:02:13.851 INFO:tasks.ceph.osd.6.smithi096.stderr: 4: (PG::read_state(ObjectStore*)+0x1306) [0x556707d7e476]
2020-05-23T00:02:13.851 INFO:tasks.ceph.osd.6.smithi096.stderr: 5: (OSD::load_pgs()+0xa47) [0x556707cc59e7]
2020-05-23T00:02:13.851 INFO:tasks.ceph.osd.6.smithi096.stderr: 6: (OSD::init()+0x270f) [0x556707cf182f]
2020-05-23T00:02:13.851 INFO:tasks.ceph.osd.6.smithi096.stderr: 7: (main()+0x4798) [0x556707c2ef08]
2020-05-23T00:02:13.851 INFO:tasks.ceph.osd.6.smithi096.stderr: 8: (__libc_start_main()+0xf3) [0x7fd29556a873]
2020-05-23T00:02:13.851 INFO:tasks.ceph.osd.6.smithi096.stderr: 9: (_start()+0x2e) [0x556707c6b0ce]
2020-05-23T00:02:13.851 INFO:tasks.ceph.osd.6.smithi096.stderr:*** Caught signal (Aborted) **

Related issues 6 (0 open6 closed)

Related to RADOS - Bug #20874: osd/PGLog.h: 1386: FAILED assert(miter == missing.get_items().end() || (miter->second.need == i->versi on && miter->second.have == eversion_t()))Can't reproduceJosh Durgin08/02/2017

Actions
Has duplicate RADOS - Bug #49136: osd/PGLog.h: FAILED ceph_assert(miter == missing.get_items().end() || (miter->second.need == i->version && miter->second.have == eversion_t()))Duplicate

Actions
Has duplicate RADOS - Bug #69540: src/osd/PGLog.h: 1574: FAILED ceph_assert(miter == missing.get_items().end() || (miter->second.need == i->version && miter->second.have == eversion_t()))Duplicate

Actions
Copied to RADOS - Backport #71222: reef: PGLog::read_log_and_missing: ceph_assert(miter == missing.get_items().end() || (miter->second.need == i->version && miter->second.have == eversion_t()))ResolvedNitzan MordechaiActions
Copied to RADOS - Backport #71223: tentacle: PGLog::read_log_and_missing: ceph_assert(miter == missing.get_items().end() || (miter->second.need == i->version && miter->second.have == eversion_t()))ResolvedNitzan MordechaiActions
Copied to RADOS - Backport #71224: squid: PGLog::read_log_and_missing: ceph_assert(miter == missing.get_items().end() || (miter->second.need == i->version && miter->second.have == eversion_t()))ResolvedNitzan MordechaiActions
Actions #1

Updated by Brad Hubbard almost 6 years ago

  • Description updated (diff)
Actions #2

Updated by Sage Weil about 5 years ago

  • Project changed from Ceph to RADOS
  • Category deleted (OSD)
Actions #3

Updated by Neha Ojha about 5 years ago

  • Related to Bug #49136: osd/PGLog.h: FAILED ceph_assert(miter == missing.get_items().end() || (miter->second.need == i->version && miter->second.have == eversion_t())) added
Actions #4

Updated by Neha Ojha about 5 years ago

https://tracker.ceph.com/issues/49136 may have a different root cause, so marking is it as related and not duplicate for now.

Actions #5

Updated by Neha Ojha about 5 years ago

  • Related to Bug #20874: osd/PGLog.h: 1386: FAILED assert(miter == missing.get_items().end() || (miter->second.need == i->versi on && miter->second.have == eversion_t())) added
Actions #6

Updated by Neha Ojha about 5 years ago

  • Related to deleted (Bug #49136: osd/PGLog.h: FAILED ceph_assert(miter == missing.get_items().end() || (miter->second.need == i->version && miter->second.have == eversion_t())))
Actions #7

Updated by Neha Ojha about 5 years ago

  • Has duplicate Bug #49136: osd/PGLog.h: FAILED ceph_assert(miter == missing.get_items().end() || (miter->second.need == i->version && miter->second.have == eversion_t())) added
Actions #8

Updated by Neha Ojha about 5 years ago

  • Priority changed from Immediate to Urgent
Actions #9

Updated by Neha Ojha about 5 years ago

  • Assignee set to Neha Ojha
  • Priority changed from Urgent to High
Actions #10

Updated by Neha Ojha almost 5 years ago

  • Backport set to pacific

/a/yuriw-2021-03-25_20:03:40-rados-wip-yuri8-testing-2021-03-25-1042-pacific-distro-basic-smithi/5999051

Actions #11

Updated by Sridhar Seshasayee over 4 years ago

Observed the assert on master:
/a/sseshasa-2021-07-14_10:37:09-rados-wip-sseshasa-testing-2021-07-14-1320-distro-basic-smithi/6270005

Actions #12

Updated by Neha Ojha over 4 years ago

/a/yuriw-2021-08-06_16:31:19-rados-wip-yuri-master-8.6.21-distro-basic-smithi/6324561 - no logs

Actions #13

Updated by Neha Ojha about 4 years ago

  • Priority changed from High to Normal
Actions #14

Updated by Laura Flores about 4 years ago

/a/yuriw-2022-02-09_22:52:18-rados-wip-yuri5-testing-2022-02-09-1322-pacific-distro-default-smithi/6672070

Actions #15

Updated by Laura Flores over 3 years ago

/a/yuriw-2022-06-23_21:29:45-rados-wip-yuri4-testing-2022-06-22-1415-pacific-distro-default-smithi/6895353

Actions #17

Updated by Radoslaw Zarzynski over 3 years ago

Even if we don't want to deep dive into right now, we should refactor the assertion:

            if (debug_verify_stored_missing) {
              auto miter = missing.get_items().find(i->soid);
              if (i->is_delete()) {
                ceph_assert(miter == missing.get_items().end() ||
                       (miter->second.need == i->version &&
                        miter->second.have == eversion_t()));
              } else {
                ceph_assert(miter != missing.get_items().end());
                ceph_assert(miter->second.need == i->version);
                ceph_assert(miter->second.have == eversion_t());
              }
              checked.insert(i->soid);
            } 
Actions #18

Updated by Radoslaw Zarzynski over 3 years ago

Ronen: @debug_verify_stored_missing a an input parameter with default value. Unfortunately, it doesn't look like a dead one:

void PG::read_state(ObjectStore *store)
{
  // ...
  recovery_state.init_from_disk_state(
    std::move(info_from_disk),
    std::move(past_intervals_from_disk),
    [this, store] (PGLog &pglog) {
      ostringstream oss;
      pglog.read_log_and_missing(
        store,
        ch,
        pgmeta_oid,
        info,
        oss,
        cct->_conf->osd_ignore_stale_divergent_priors,
        cct->_conf->osd_debug_verify_missing_on_start);
Actions #19

Updated by Radoslaw Zarzynski over 3 years ago

  • Assignee changed from Neha Ojha to Nitzan Mordechai

We're poking with read_log_and_missing() pretty recently (the dups issue). Does it ring a bell?

Actions #20

Updated by Nitzan Mordechai over 3 years ago

  • Pull request ID set to 48088
Actions #21

Updated by Nitzan Mordechai over 3 years ago

  • Status changed from New to Fix Under Review
Actions #22

Updated by Laura Flores about 2 years ago

/a/lflores-2024-01-10_23:43:40-rados-wip-yuri11-testing-2024-01-10-1124-pacific-distro-default-smithi/7512901

Actions #23

Updated by Laura Flores about 2 years ago

  • Tags set to test-failure
Actions #24

Updated by Nitzan Mordechai over 1 year ago

/a/yuriw-2024-07-22_15:45:18-rados-wip-yuri2-testing-2024-07-19-0817-distro-default-smithi/7812729

Actions #25

Updated by Radoslaw Zarzynski over 1 year ago

scrub note: reopened the PR.

Actions #26

Updated by Konstantin Shalygin over 1 year ago

  • Backport deleted (pacific)
Actions #27

Updated by Nitzan Mordechai over 1 year ago

from logs in: /a/yuriw-2024-07-22_15:45:18-rados-wip-yuri2-testing-2024-07-19-0817-distro-default-smithi/7812729

osd.8 is the one that hit the assert: (./remote/smithi012/log/ceph-osd.8.log.gz)

2024-07-22T16:35:16.906+0000 7f7abf9a5640 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/19.0.0-5164-gc5a6b481/rpm/el9/BUILD/ceph-19.0.0-5164-gc5a6b481/src/osd/PGLog.h: In function 'static void PGLog:
:read_log_and_missing(ceph::common::CephContext*, ObjectStore*, ObjectStore::CollectionHandle&, ghobject_t, const pg_info_t&, PGLog::IndexedLog&, missing_type&, std::ostringstream&, bool, bool*, const DoutPrefixProvider*, std::set<std::__cxx11::basic_string<char> >*, bool) [with missing_type = pg_missing_set<t
rue>; ObjectStore::CollectionHandle = boost::intrusive_ptr<ObjectStore::CollectionImpl>; std::ostringstream = std::__cxx11::basic_ostringstream<char>]' thread 7f7abf9a5640 time 2024-07-22T16:35:16.905750+0000
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/19.0.0-5164-gc5a6b481/rpm/el9/BUILD/ceph-19.0.0-5164-gc5a6b481/src/osd/PGLog.h: 1581: FAILED ceph_assert(miter == missing.get_items().end() || (miter->second.need == i->version && miter->second.have == eversion_t()))

 ceph version 19.0.0-5164-gc5a6b481 (c5a6b4817430650136a4e4cb6fdd11bb824a2b16) squid (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x11e) [0x562894fc183e]
 2: ceph-osd(+0x3fd9fa) [0x562894fc19fa]
 3: ceph-osd(+0x38fc53) [0x562894f53c53]
 4: (OSD::load_pgs()+0x606) [0x562895128ee6]
 5: (OSD::init()+0x21b7) [0x56289511e857]
 6: main()
 7: /lib64/libc.so.6(+0x29590) [0x7f7abfe29590]
 8: __libc_start_main()
 9: _start()

we are trying to find oid: benchmark_data_smithi135_39125_object19245

osd.8 also prints before the assertion the read_log_and_missing items that we are decoding from the omap:

2024-07-22T16:35:16.869+0000 7f7abf9a5640 20 read_log_and_missing 41'1162 (0'0) modify   3:33146baf:::benchmark_data_smithi135_39125_object19245:head by client.4681.0:19246 2024-07-22T16:31:57.677819+0000 0 ObjectCleanRegions clean_offsets: [(65536, 18446744073709486079)], clean_omap: true, new_object: false
2024-07-22T16:35:16.890+0000 7f7abf9a5640 20 read_log_and_missing 119'6295 (41'1162) delete   3:33146baf:::benchmark_data_smithi135_39125_object19245:head by client.4681.0:122116 2024-07-22T16:33:49.603339+0000 0 ObjectCleanRegions clean_offsets: [(65536, 18446744073709486079)], clean_omap: false, new_object:>

so we have 2 objects with different versions inside the missing, when we are not finding the oid from the getattr, we are finding that oid in the missing, but we are not matching the i->version to the miter that found (first one) version.

Actions #28

Updated by Matan Breizman about 1 year ago

  • Has duplicate Bug #69540: src/osd/PGLog.h: 1574: FAILED ceph_assert(miter == missing.get_items().end() || (miter->second.need == i->version && miter->second.have == eversion_t())) added
Actions #29

Updated by Lee Sanders about 1 year ago · Edited

@Nitzan Mordechai

I've seen this twice under #69540 which has been marked as a duplicate of #45702

/a/ljsanders-2025-03-06_16:33:48-rados-wip-aainscow-for-lee-sanders26-distro-default-smithi/8173891
and
/a/ljsanders-2025-03-15_10:21:04-rados-wip-jamiepryde-for-lee-sanders6-distro-default-smithi/8191486/

But #45702 is marked as Fix Under Review. The PR: https://github.com/ceph/ceph/pull/48088
has been closed by the bot due to inactivity.

How do we get this moving?

Actions #30

Updated by Radoslaw Zarzynski 12 months ago

I've reopneded the fux.

Actions #31

Updated by Nitzan Mordechai 12 months ago · Edited

ok, need to refresh my analyze: recreate it again with some more debug messages (/a/nmordech-2025-03-26_10:34:16-rados:thrash-erasure-code-wip-nitzan-pglog-missing-validation-centos9-only-distro-default-smithi/8210737/)

2025-03-26T11:58:29.156+0000 7fa0818188c0 15 read_log_and_missing  missing 1233'15635 (1099'7269) delete   7:f22ae0ef:::benchmark_data_smithi157_109865_object117014:head by client.17885.0:251872 2025-03-26T11:58:05.713439+0000 0 ObjectCleanRegions clean_offsets: [65536~184467440737094
86079], clean_omap: false, new_object: false
2025-03-26T11:58:29.156+0000 7fa0818188c0 15 Found missing entry: need=1233'15635 vs expected 1233'15635, have=1099'7269 vs expected 0'0
2025-03-26T11:58:29.170+0000 7fa0818188c0 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/20.0.0-692-g5a05f4a9/rpm/el9/BUILD/ceph-20.0.0-692-g5a05f4a9/src/osd/PGLog.h: In func
tion 'static void PGLog::read_log_and_missing(ceph::common::CephContext*, ObjectStore*, ObjectStore::CollectionHandle&, ghobject_t, const pg_info_t&, PGLog::IndexedLog&, missing_type&, std::ostringstream&, bool, bool*, const DoutPrefixProvider*, std::set<std::__cxx11::basic_string<cha
r> >*, bool) [with missing_type = pg_missing_set<true>; ObjectStore::CollectionHandle = boost::intrusive_ptr<ObjectStore::CollectionImpl>; std::ostringstream = std::__cxx11::basic_ostringstream<char>]' thread 7fa0818188c0 time 2025-03-26T11:58:29.169006+0000
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/20.0.0-692-g5a05f4a9/rpm/el9/BUILD/ceph-20.0.0-692-g5a05f4a9/src/osd/PGLog.h: 1582: FAILED ceph_assert(miter == missing.get_items().end() || (miter->second.need == i->version && miter->second.have == eversion_t()))

we found missing record (delete) that hit:
ceph_assert(miter missing.get_items().end() ||
(miter->second.need i->version &&
miter->second.have == eversion_t()));

the record have:
  • need: 1233'15635 (the delete op’s version)
  • have: 1099'7269 (a non‑default on‑disk version)

the "have" should be the default value 0'0, but it's not.

if we check the log before - A modify op is logged as:

read_log_and_missing 1099'7269 (0'0) modify   7:f22ae0ef:::benchmark_data_smithi157_109865_object117014:head by client.17885.0:117015 ...

At this stage, the object’s on‑disk version is 1099'7269 while the modify log reports a default “have” (0'0) in the parenthesis.

Later, a delete op is logged as:

read_log_and_missing 1233'15635 (1099'7269) delete   7:f22ae0ef:::benchmark_data_smithi157_109865_object117014:head by client.17885.0:251872 ...

The missing record gets updated so that the “need” field becomes 1233'15635 (the delete op’s version) but the “have” field remains 1099'7269—the on‑disk version from the earlier modify op.

We are updating the existing missing record with a deleted record, but the "have" was not reset during the update.

i found PR: https://github.com/ceph/ceph/pull/29893 that introduced several changes:
The missing.add() interface and the pg_missing_item constructor originally accepted a “new_object” parameter. That flag was used to mark an item as “new” (e.g. calling clean_regions.mark_object_new() or even mark_fully_dirty()) when the “have” field was the default.

The PR removed the “new_object” parameter and the associated logic. As a result, when processing a delete op, if the object exists on disk the code now passes the on‑disk version as “have” (1099'7269) rather than resetting it to eversion_t() (0'0).

removing the new_object is probably the root cause of that missing record that failed the validation.

Actions #32

Updated by Sridhar Seshasayee 12 months ago

/a/skanta-2025-03-27_08:02:07-rados-wip-bharath10-testing-2025-03-27-0430-distro-default-smithi/8212845

Actions #33

Updated by Nitzan Mordechai 12 months ago

  • Pull request ID changed from 48088 to 62705

I'm updating the PR to 62705

Actions #34

Updated by Laura Flores 12 months ago

PR under review...

Actions #35

Updated by Konstantin Shalygin 12 months ago

  • Target version set to v20.0.0
  • Source set to Development
  • Affected Versions v16.2.15, v17.2.8, v18.2.5, v19.2.3 added
  • Affected Versions deleted (v16.0.0)
Actions #36

Updated by Laura Flores 11 months ago

Awaiting review.

Actions #37

Updated by Laura Flores 11 months ago

Fix was approved and taken into testing.

Actions #38

Updated by Radoslaw Zarzynski 11 months ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to reef,squid,tentacle

Merged in main!

Actions #39

Updated by Upkeep Bot 11 months ago

  • Copied to Backport #71222: reef: PGLog::read_log_and_missing: ceph_assert(miter == missing.get_items().end() || (miter->second.need == i->version && miter->second.have == eversion_t())) added
Actions #40

Updated by Upkeep Bot 11 months ago

  • Copied to Backport #71223: tentacle: PGLog::read_log_and_missing: ceph_assert(miter == missing.get_items().end() || (miter->second.need == i->version && miter->second.have == eversion_t())) added
Actions #41

Updated by Upkeep Bot 11 months ago

  • Copied to Backport #71224: squid: PGLog::read_log_and_missing: ceph_assert(miter == missing.get_items().end() || (miter->second.need == i->version && miter->second.have == eversion_t())) added
Actions #42

Updated by Upkeep Bot 11 months ago

  • Tags (freeform) set to backport_processed
Actions #43

Updated by Upkeep Bot 9 months ago

  • Merge Commit set to 121761dea8cfe2f6acdb2676344340da1d93b69b
  • Fixed In set to v20.3.0-180-g121761dea8c
  • Upkeep Timestamp set to 2025-07-09T18:56:42+00:00
Actions #44

Updated by Upkeep Bot 8 months ago

  • Fixed In changed from v20.3.0-180-g121761dea8c to v20.3.0-180-g121761dea8
  • Upkeep Timestamp changed from 2025-07-09T18:56:42+00:00 to 2025-07-14T18:12:47+00:00
Actions #45

Updated by Konstantin Shalygin 7 months ago

  • Status changed from Pending Backport to Resolved
  • % Done changed from 0 to 100
Actions

Also available in: Atom PDF