Skip to content

osd: rework EC for the sake of integration with crimson#54930

Merged
ljflores merged 45 commits intoceph:mainfrom
rzarzynski:wip-osd-ec-rework
Feb 15, 2024
Merged

osd: rework EC for the sake of integration with crimson#54930
ljflores merged 45 commits intoceph:mainfrom
rzarzynski:wip-osd-ec-rework

Conversation

@rzarzynski
Copy link
Contributor

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@athanatos
Copy link
Contributor

@rzarzynski Did you want a review on this one, or should I wait for the split up PRs?

@rzarzynski rzarzynski changed the title crimson/osd: add support for EC pools osd: rework EC for the sake of integration with crimson Dec 20, 2023
@rzarzynski
Copy link
Contributor Author

@athanatos: just pushed out the crimson commits.

@rzarzynski
Copy link
Contributor Author

jenkins retest this please

@rzarzynski
Copy link
Contributor Author

Oops, it got an undetected merge conflict with scrub_validator.cc:

/home/jenkins-build/build/workspace/ceph-pull-requests/src/crimson/osd/scrub/scrub_validator.cc:100:10: error: call to deleted member function 'push_back'
      bl.push_back(xiter->second);
      ~~~^~~~~~~~~
/home/jenkins-build/build/workspace/ceph-pull-requests/src/include/rados/buffer.h:1051:10: note: candidate function has been explicitly deleted
    void push_back(ptr_node&&) = delete;
         ^
/home/jenkins-build/build/workspace/ceph-pull-requests/src/include/rados/buffer.h:1049:10: note: candidate function has been explicitly deleted
    void push_back(const ptr_node&) = delete;

Fixing.

`read()` is simply too common for grep. Also, this commit
will ensure there is no unnoticed new call site which
could be problematic in the spot of the upcoming recovery
read rework.

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
`pg` for `ECBackend*` is simply misleading.

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
…kend

`complete_read_op()` being aware about `RecoveryMessages` was too much.

TODO:
  * rename or rework `RecoveryMessages`. Since now it's also
    a callack for `ReadOp::on_complete`. I don't like that.
  * drop the `pair<RecoveryMessages*, read_request_t&>`.

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Signed-off-by: Radosław Zarzyński <rzarzyns@redhat.com>
Signed-off-by: Radosław Zarzyński <rzarzyns@redhat.com>
Signed-off-by: Radosław Zarzyński <rzarzyns@redhat.com>
Signed-off-by: Radosław Zarzyński <rzarzyns@redhat.com>
1. Move some of them to .cc and
2. switch their implementations to use lower-layer methods
   instead of touching `temp_contents` directly.

Signed-off-by: Radosław Zarzyński <rzarzyns@redhat.com>
@athanatos
Copy link
Contributor

Should 'mon/OSDMonitor: let crimson handle ECPools' still be here with the crimson bits removed?

Copy link
Contributor

@athanatos athanatos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise -- sorry for the delay!

@rzarzynski rzarzynski added needs-qa and removed DNM labels Jan 11, 2024
@rzarzynski
Copy link
Contributor Author

jenkins test api

@ljflores
Copy link
Member

@rzarzynski seeing a lot of failures in the rados suite that look like this:

/a/yuriw-2024-01-23_23:02:31-rados-wip-yuri5-testing-2024-01-23-0805-distro-default-smithi/7530052

2024-01-24T03:46:27.346 INFO:tasks.ceph.osd.3.smithi040.stderr:2024-01-24T03:46:27.343+0000 7fdcc7c08640 -1 ErasureCodeLrc: _minimum_to_decode not enough chunks in  to read 0,1,4,5
2024-01-24T03:46:27.346 INFO:tasks.ceph.osd.3.smithi040.stderr:2024-01-24T03:46:27.343+0000 7fdcc7c08640 -1 ErasureCodeLrc: _minimum_to_decode not enough chunks in 2,3,6,7 to read 0,1,4,5
2024-01-24T03:46:27.347 INFO:teuthology.orchestra.run.smithi040.stdout:ERROR: (22) Invalid argument
2024-01-24T03:46:27.347 INFO:teuthology.orchestra.run.smithi040.stdout:op_tracker tracking is not enabled now, so no ops are tracked currently, even those get stuck. Please enable "osd_enable_op_tracker", and the tracker will start to track new ops received afterwards.
2024-01-24T03:46:27.347 INFO:tasks.ceph.osd.11.smithi188.stderr:2024-01-24T03:46:27.344+0000 7f60b485b640 -1 ErasureCodeLrc: _minimum_to_decode not enough chunks in  to read 0,1,4,5
2024-01-24T03:46:27.347 INFO:tasks.ceph.osd.11.smithi188.stderr:2024-01-24T03:46:27.344+0000 7f60b485b640 -1 ErasureCodeLrc: _minimum_to_decode not enough chunks in 2,3,6,7 to read 0,1,4,5
2024-01-24T03:46:27.348 INFO:tasks.ceph.osd.3.smithi040.stderr:2024-01-24T03:46:27.344+0000 7fdcc7407640 -1 ErasureCodeLrc: _minimum_to_decode not enough chunks in  to read 0,1,4,5
2024-01-24T03:46:27.348 INFO:tasks.ceph.osd.3.smithi040.stderr:2024-01-24T03:46:27.345+0000 7fdcc7407640 -1 ErasureCodeLrc: _minimum_to_decode not enough chunks in 2,3,6,7 to read 0,1,4,5
2024-01-24T03:46:27.349 INFO:tasks.rados.rados.0.smithi040.stdout:15:  writing smithi04040358-15 from 0 to 245760 tid 1
2024-01-24T03:46:27.349 INFO:tasks.rados.rados.0.smithi040.stdout:16: write initial oid 16
2024-01-24T03:46:27.349 INFO:tasks.rados.rados.0.smithi040.stdout:16:  seq_num 15 ranges {0=262144}
2024-01-24T03:46:27.349 INFO:tasks.ceph.osd.1.smithi040.stderr:2024-01-24T03:46:27.347+0000 7faede669640 -1 log_channel(cluster) log [ERR] : Corruption detected: object 3:1b08be94:::smithi04040358-7:head is missing hash_info
2024-01-24T03:46:27.350 INFO:tasks.ceph.osd.7.smithi146.stderr:2024-01-24T03:46:27.345+0000 7f3da7e56640 -1 log_channel(cluster) log [ERR] : Corruption detected: object 3:d8fa1aab:::smithi04040358-4:head is missing hash_info
2024-01-24T03:46:27.350 INFO:tasks.ceph.osd.5.smithi146.stderr:2024-01-24T03:46:27.347+0000 7fbd1b459640 -1 log_channel(cluster) log [ERR] : Corruption detected: object 3:1b08be94:::smithi04040358-7:head is missing hash_info
2024-01-24T03:46:27.351 INFO:tasks.ceph.osd.11.smithi188.stderr:2024-01-24T03:46:27.347+0000 7f60bb869640 -1 log_channel(cluster) log [ERR] : Corruption detected: object 3:1575787c:::smithi04040358-3:head is missing hash_info
2024-01-24T03:46:27.351 INFO:tasks.ceph.osd.11.smithi188.stderr:2024-01-24T03:46:27.348+0000 7f60b7861640 -1 log_channel(cluster) log [ERR] : Corruption detected: object 3:1b08be94:::smithi04040358-7:head is missing hash_info
2024-01-24T03:46:27.351 INFO:tasks.ceph.osd.11.smithi188.stderr:2024-01-24T03:46:27.348+0000 7f60b7861640 -1 ErasureCodeLrc: _minimum_to_decode not enough chunks in  to read 0,1,4,5
2024-01-24T03:46:27.351 INFO:tasks.ceph.osd.11.smithi188.stderr:2024-01-24T03:46:27.348+0000 7f60b7861640 -1 ErasureCodeLrc: _minimum_to_decode not enough chunks in 2,3,6,7 to read 0,1,4,5
2024-01-24T03:46:27.351 INFO:tasks.rados.rados.0.smithi040.stdout:16:  writing smithi04040358-16 from 0 to 262144 tid 1
2024-01-24T03:46:27.351 INFO:tasks.rados.rados.0.smithi040.stdout: waiting on 16
2024-01-24T03:46:27.351 INFO:tasks.rados.rados.0.smithi040.stdout:1:  finishing write tid 1 to smithi04040358-1
2024-01-24T03:46:27.352 INFO:tasks.rados.rados.0.smithi040.stdout:1:  finishing write tid 2 to smithi04040358-1
2024-01-24T03:46:27.352 INFO:tasks.rados.rados.0.smithi040.stdout:2:  finishing write tid 1 to smithi04040358-2
2024-01-24T03:46:27.352 INFO:tasks.rados.rados.0.smithi040.stdout:4:  finishing write tid 1 to smithi04040358-4
2024-01-24T03:46:27.352 INFO:tasks.rados.rados.0.smithi040.stdout:3:  finishing write tid 1 to smithi04040358-3
2024-01-24T03:46:27.352 INFO:tasks.rados.rados.0.smithi040.stdout:2:  finishing write tid 2 to smithi04040358-2
2024-01-24T03:46:27.352 INFO:tasks.rados.rados.0.smithi040.stdout:4:  finishing write tid 2 to smithi04040358-4
2024-01-24T03:46:27.352 INFO:tasks.rados.rados.0.smithi040.stdout:3:  finishing write tid 2 to smithi04040358-3
2024-01-24T03:46:27.352 INFO:tasks.rados.rados.0.smithi040.stdout:8:  finishing write tid 1 to smithi04040358-8
2024-01-24T03:46:27.352 INFO:tasks.rados.rados.0.smithi040.stdout:8:  finishing write tid 2 to smithi04040358-8
2024-01-24T03:46:27.352 INFO:tasks.rados.rados.0.smithi040.stdout:8:  finishing write tid 3 to smithi04040358-8
2024-01-24T03:46:27.352 INFO:tasks.rados.rados.0.smithi040.stdout:8:  oid 8 updating version 0 to 2
2024-01-24T03:46:27.352 INFO:tasks.rados.rados.0.smithi040.stdout:8:  oid 8 version 2 is already newer than 1
2024-01-24T03:46:27.352 INFO:tasks.rados.rados.0.smithi040.stdout:update_object_version oid 8 v 2 (ObjNum 7 snap 0 seq_num 7) dirty exists
2024-01-24T03:46:27.353 INFO:tasks.rados.rados.0.smithi040.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/19.0.0-869-gb4ffca7a/rpm/el9/BUILD/ceph-19.0.0-869-gb4ffca7a/src/test/osd/RadosModel.h: In function 'virtual void WriteOp::_finish(TestOp::CallbackInfo*)' thread 7fe12f7fe640 time 2024-01-24T03:46:27.349274+0000
2024-01-24T03:46:27.353 INFO:tasks.rados.rados.0.smithi040.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/19.0.0-869-gb4ffca7a/rpm/el9/BUILD/ceph-19.0.0-869-gb4ffca7a/src/test/osd/RadosModel.h: 1071: FAILED ceph_assert(r >= 0)
2024-01-24T03:46:27.353 INFO:tasks.rados.rados.0.smithi040.stderr:Assertion details: r = -5
2024-01-24T03:46:27.353 INFO:tasks.rados.rados.0.smithi040.stderr: ceph version 19.0.0-869-gb4ffca7a (b4ffca7aef7c2dcab73fe4e0e0ab5d0681f4d743) squid (dev)
2024-01-24T03:46:27.353 INFO:tasks.rados.rados.0.smithi040.stderr: 1: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0x13f) [0x7fe14037ffc4]
2024-01-24T03:46:27.353 INFO:tasks.rados.rados.0.smithi040.stderr: 2: ceph_test_rados(+0x22998) [0x557482747998]
2024-01-24T03:46:27.353 INFO:tasks.rados.rados.0.smithi040.stderr: 3: (write_callback(void*, void*)+0x1e) [0x557482760d9e]
2024-01-24T03:46:27.353 INFO:tasks.rados.rados.0.smithi040.stderr: 4: /lib64/librados.so.2(+0xb2499) [0x7fe140e2b499]
2024-01-24T03:46:27.354 INFO:tasks.rados.rados.0.smithi040.stderr: 5: /lib64/librados.so.2(+0xb37ce) [0x7fe140e2c7ce]
2024-01-24T03:46:27.354 INFO:tasks.rados.rados.0.smithi040.stderr: 6: /lib64/librados.so.2(+0xb3bc3) [0x7fe140e2cbc3]
2024-01-24T03:46:27.354 INFO:tasks.rados.rados.0.smithi040.stderr: 7: /lib64/librados.so.2(+0x1351be) [0x7fe140eae1be]
2024-01-24T03:46:27.354 INFO:tasks.rados.rados.0.smithi040.stderr: 8: /lib64/librados.so.2(+0xcee7f) [0x7fe140e47e7f]
2024-01-24T03:46:27.354 INFO:tasks.rados.rados.0.smithi040.stderr: 9: /lib64/libstdc++.so.6(+0xdb924) [0x7fe13f8db924]
2024-01-24T03:46:27.354 INFO:tasks.rados.rados.0.smithi040.stderr: 10: /lib64/libc.so.6(+0x9f802) [0x7fe13f49f802]
2024-01-24T03:46:27.354 INFO:tasks.rados.rados.0.smithi040.stderr: 11: /lib64/libc.so.6(+0x3f450) [0x7fe13f43f450]

See more examples and the full run here: https://trello.com/c/0sIdLAdH/1937-wip-yuri5-testing-2024-01-23-0805

@rzarzynski
Copy link
Contributor Author

Replicated locally. WIP.

@rzarzynski
Copy link
Contributor Author

Found the root cause. Reworking.

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
@rzarzynski
Copy link
Contributor Author

@ljflores: pushed the fixups. They addressed the problem locally. Would you mind resting at teuthology?

Copy link
Member

@ljflores ljflores left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rzarzynski I see new test failures:

/a/yuriw-2024-02-05_16:33:30-rados-wip-yuri5-testing-2024-01-31-1657-distro-default-smithi/7547028

2024-02-05T17:15:46.204 DEBUG:teuthology.orchestra.run.smithi183:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph pg repair 11.0
2024-02-05T17:15:46.481 INFO:teuthology.orchestra.run.smithi183.stderr:instructing pg 11.0s0 on osd.1 to repair
2024-02-05T17:15:46.895 INFO:tasks.ceph.osd.1.smithi183.stderr:2024-02-05T17:15:46.893+0000 7f6ae72ae640 -1 log_channel(cluster) log [ERR] : 11.0 shard 1(0) soid 11:a0216fbc:::repair_test_obj:head : candidate had a read error, candidate had a missing hinfo key
2024-02-05T17:15:46.895 INFO:tasks.ceph.osd.1.smithi183.stderr:2024-02-05T17:15:46.893+0000 7f6ae72ae640 -1 log_channel(cluster) log [ERR] : 11.0s0> repair 0 missing, 1 inconsistent objects
2024-02-05T17:15:46.895 INFO:tasks.ceph.osd.1.smithi183.stderr:2024-02-05T17:15:46.893+0000 7f6ae72ae640 -1 log_channel(cluster) log [ERR] : 11.0s0 repair 0 missing, 1 inconsistent objects
2024-02-05T17:15:46.895 INFO:tasks.ceph.osd.1.smithi183.stderr:2024-02-05T17:15:46.893+0000 7f6ae72ae640 -1 log_channel(cluster) log [ERR] : 11.0 repair 1 errors, 1 fixed
2024-02-05T17:15:46.907 INFO:tasks.ceph.osd.1.smithi183.stderr:./src/osd/osd_types.cc: In function 'void coll_t::calc_str()' thread 7f6aeb2b6640 time 2024-02-05T17:15:46.904903+0000
2024-02-05T17:15:46.907 INFO:tasks.ceph.osd.1.smithi183.stderr:./src/osd/osd_types.cc: 950: ceph_abort_msg("unknown collection type")
2024-02-05T17:15:46.907 INFO:tasks.ceph.osd.1.smithi183.stderr: ceph version 19.0.0-1243-ge10f4b28 (e10f4b28ec4e5fd19161c663ab66bd3d25b667cc) squid (dev)
2024-02-05T17:15:46.907 INFO:tasks.ceph.osd.1.smithi183.stderr: 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xc6) [0x55e803b2a6e6]
2024-02-05T17:15:46.907 INFO:tasks.ceph.osd.1.smithi183.stderr: 2: (coll_t::calc_str()+0x5f) [0x55e803eb1a3f]
2024-02-05T17:15:46.908 INFO:tasks.ceph.osd.1.smithi183.stderr: 3: ceph-osd(+0x11e0ceb) [0x55e804914ceb]
2024-02-05T17:15:46.908 INFO:tasks.ceph.osd.1.smithi183.stderr: 4: ceph-osd(+0x507e60) [0x55e803c3be60]
2024-02-05T17:15:46.908 INFO:tasks.ceph.osd.1.smithi183.stderr: 5: (ECBackend::RecoveryBackend::handle_recovery_push(PushOp const&, RecoveryMessages*, bool)+0x68d) [0x55e804003e3d]
2024-02-05T17:15:46.908 INFO:tasks.ceph.osd.1.smithi183.stderr: 6: (ECBackend::handle_recovery_push(PushOp const&, RecoveryMessages*, bool)+0x11a) [0x55e80400475a]
2024-02-05T17:15:46.908 INFO:tasks.ceph.osd.1.smithi183.stderr: 7: (ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x1d5) [0x55e80400a1c5]
2024-02-05T17:15:46.908 INFO:tasks.ceph.osd.1.smithi183.stderr: 8: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x56) [0x55e803e064e6]
2024-02-05T17:15:46.908 INFO:tasks.ceph.osd.1.smithi183.stderr: 9: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x7ed) [0x55e803d5e4fd]
2024-02-05T17:15:46.908 INFO:tasks.ceph.osd.1.smithi183.stderr: 10: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x197) [0x55e803cb28c7]
2024-02-05T17:15:46.908 INFO:tasks.ceph.osd.1.smithi183.stderr: 11: (ceph::osd::scheduler::PGRecoveryMsg::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x198) [0x55e803efaf38]
2024-02-05T17:15:46.908 INFO:tasks.ceph.osd.1.smithi183.stderr: 12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xab3) [0x55e803cbd543]
2024-02-05T17:15:46.908 INFO:tasks.ceph.osd.1.smithi183.stderr: 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x293) [0x55e8041b66a3]
2024-02-05T17:15:46.908 INFO:tasks.ceph.osd.1.smithi183.stderr: 14: ceph-osd(+0xa82c04) [0x55e8041b6c04]
2024-02-05T17:15:46.908 INFO:tasks.ceph.osd.1.smithi183.stderr: 15: /lib/x86_64-linux-gnu/libc.so.6(+0x94b43) [0x7f6b09921b43]
2024-02-05T17:15:46.908 INFO:tasks.ceph.osd.1.smithi183.stderr: 16: /lib/x86_64-linux-gnu/libc.so.6(+0x126a00) [0x7f6b099b3a00]
2024-02-05T17:15:46.908 INFO:tasks.ceph.osd.1.smithi183.stderr:*** Caught signal (Aborted) **
2024-02-05T17:15:46.908 INFO:tasks.ceph.osd.1.smithi183.stderr: in thread 7f6aeb2b6640 thread_name:tp_osd_tp
2024-02-05T17:15:46.909 INFO:tasks.ceph.osd.1.smithi183.stderr:2024-02-05T17:15:46.905+0000 7f6aeb2b6640 -1 ./src/osd/osd_types.cc: In function 'void coll_t::calc_str()' thread 7f6aeb2b6640 time 2024-02-05T17:15:46.904903+0000
2024-02-05T17:15:46.909 INFO:tasks.ceph.osd.1.smithi183.stderr:./src/osd/osd_types.cc: 950: ceph_abort_msg("unknown collection type")
2024-02-05T17:15:46.909 INFO:tasks.ceph.osd.1.smithi183.stderr:
2024-02-05T17:15:46.909 INFO:tasks.ceph.osd.1.smithi183.stderr: ceph version 19.0.0-1243-ge10f4b28 (e10f4b28ec4e5fd19161c663ab66bd3d25b667cc) squid (dev)
2024-02-05T17:15:46.909 INFO:tasks.ceph.osd.1.smithi183.stderr: 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xc6) [0x55e803b2a6e6]
2024-02-05T17:15:46.909 INFO:tasks.ceph.osd.1.smithi183.stderr: 2: (coll_t::calc_str()+0x5f) [0x55e803eb1a3f]
2024-02-05T17:15:46.909 INFO:tasks.ceph.osd.1.smithi183.stderr: 3: ceph-osd(+0x11e0ceb) [0x55e804914ceb]
2024-02-05T17:15:46.909 INFO:tasks.ceph.osd.1.smithi183.stderr: 4: ceph-osd(+0x507e60) [0x55e803c3be60]
2024-02-05T17:15:46.909 INFO:tasks.ceph.osd.1.smithi183.stderr: 5: (ECBackend::RecoveryBackend::handle_recovery_push(PushOp const&, RecoveryMessages*, bool)+0x68d) [0x55e804003e3d]
2024-02-05T17:15:46.909 INFO:tasks.ceph.osd.1.smithi183.stderr: 6: (ECBackend::handle_recovery_push(PushOp const&, RecoveryMessages*, bool)+0x11a) [0x55e80400475a]
2024-02-05T17:15:46.909 INFO:tasks.ceph.osd.1.smithi183.stderr: 7: (ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x1d5) [0x55e80400a1c5]
2024-02-05T17:15:46.909 INFO:tasks.ceph.osd.1.smithi183.stderr: 8: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x56) [0x55e803e064e6]
2024-02-05T17:15:46.909 INFO:tasks.ceph.osd.1.smithi183.stderr: 9: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x7ed) [0x55e803d5e4fd]
2024-02-05T17:15:46.909 INFO:tasks.ceph.osd.1.smithi183.stderr: 10: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x197) [0x55e803cb28c7]
2024-02-05T17:15:46.909 INFO:tasks.ceph.osd.1.smithi183.stderr: 11: (ceph::osd::scheduler::PGRecoveryMsg::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x198) [0x55e803efaf38]
2024-02-05T17:15:46.909 INFO:tasks.ceph.osd.1.smithi183.stderr: 12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xab3) [0x55e803cbd543]
2024-02-05T17:15:46.909 INFO:tasks.ceph.osd.1.smithi183.stderr: 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x293) [0x55e8041b66a3]
2024-02-05T17:15:46.910 INFO:tasks.ceph.osd.1.smithi183.stderr: 14: ceph-osd(+0xa82c04) [0x55e8041b6c04]

/a/yuriw-2024-02-05_16:33:30-rados-wip-yuri5-testing-2024-01-31-1657-distro-default-smithi/7547086

2024-02-05T18:02:58.731 INFO:tasks.ceph.osd.0.smithi039.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/19.0.0-1243-ge10f4b28/rpm/el9/BUILD/ceph-19.0.0-1243-ge10f4b28/src/osd/PrimaryLogPG.cc: In function 'virtual void PrimaryLogPG::on_failed_pull(const std::set<pg_shard_t>&, const hobject_t&, const eversion_t&)' thread 7fabf3f53640 time 2024-02-05T18:02:58.731967+0000
2024-02-05T18:02:58.731 INFO:tasks.ceph.osd.0.smithi039.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/19.0.0-1243-ge10f4b28/rpm/el9/BUILD/ceph-19.0.0-1243-ge10f4b28/src/osd/PrimaryLogPG.cc: 12482: FAILED ceph_assert(recovering.count(soid))
2024-02-05T18:02:58.732 INFO:tasks.ceph.osd.0.smithi039.stderr:2024-02-05T18:02:58.730+0000 7fabf3f53640 -1 osd.0 pg_epoch: 31 pg[1.es1( v 31'1958 (0'0,31'1958] local-lis/les=30/31 n=1954 ec=13/13 lis/c=30/14 les/c/f=31/15/0 sis=30) [1,0,3,2]/[NONE,0,3,2]p0(1) async=[1(0)] r=1 lpr=30 pi=[14,30)/3 luod=28'1957 crt=28'1957 lcod 28'1955 mlcod 0'0 active+recovering+undersized+degraded+remapped rops=1 mbc={0={(0+0)=1951},1={(1+0)=1951},2={(1+0)=1951},3={(1+0)=1951}}] continue_recovery_op: 1:7c326e16:::benchmark_data_smithi039_44481_object21:head has inconsistent hinfo
2024-02-05T18:02:58.734 INFO:tasks.ceph.osd.0.smithi039.stderr: ceph version 19.0.0-1243-ge10f4b28 (e10f4b28ec4e5fd19161c663ab66bd3d25b667cc) squid (dev)
2024-02-05T18:02:58.734 INFO:tasks.ceph.osd.0.smithi039.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x11e) [0x55bef43bef2a]
2024-02-05T18:02:58.734 INFO:tasks.ceph.osd.0.smithi039.stderr: 2: ceph-osd(+0x3f30e6) [0x55bef43bf0e6]
2024-02-05T18:02:58.734 INFO:tasks.ceph.osd.0.smithi039.stderr: 3: ceph-osd(+0x39b274) [0x55bef4367274]
2024-02-05T18:02:58.734 INFO:tasks.ceph.osd.0.smithi039.stderr: 4: (ECBackend::RecoveryBackend::continue_recovery_op(ECBackend::RecoveryBackend::RecoveryOp&, RecoveryMessages*)+0x129f) [0x55bef48b09ff]
2024-02-05T18:02:58.734 INFO:tasks.ceph.osd.0.smithi039.stderr: 5: (ECBackend::RecoveryBackend::run_recovery_op(ECRecoveryHandle&, int)+0xf83) [0x55bef48b3123]
2024-02-05T18:02:58.735 INFO:tasks.ceph.osd.0.smithi039.stderr: 6: (ECBackend::run_recovery_op(PGBackend::RecoveryHandle*, int)+0x23) [0x55bef48b3533]
2024-02-05T18:02:58.735 INFO:tasks.ceph.osd.0.smithi039.stderr: 7: (PrimaryLogPG::recover_replicas(unsigned long, ThreadPool::TPHandle&, bool*)+0x142e) [0x55bef466f49e]
2024-02-05T18:02:58.735 INFO:tasks.ceph.osd.0.smithi039.stderr: 8: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0x25b) [0x55bef46650bb]
2024-02-05T18:02:58.735 INFO:tasks.ceph.osd.0.smithi039.stderr: 9: (OSD::do_recovery(PG*, unsigned int, unsigned long, int, ThreadPool::TPHandle&)+0x245) [0x55bef4549375]
2024-02-05T18:02:58.735 INFO:tasks.ceph.osd.0.smithi039.stderr: 10: (ceph::osd::scheduler::PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0xca) [0x55bef4794bba]
2024-02-05T18:02:58.735 INFO:tasks.ceph.osd.0.smithi039.stderr: 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xd07) [0x55bef45656f7]
2024-02-05T18:02:58.735 INFO:tasks.ceph.osd.0.smithi039.stderr: 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x2aa) [0x55bef4a63a8a]
2024-02-05T18:02:58.735 INFO:tasks.ceph.osd.0.smithi039.stderr: 13: ceph-osd(+0xa98034) [0x55bef4a64034]
2024-02-05T18:02:58.735 INFO:tasks.ceph.osd.0.smithi039.stderr: 14: /lib64/libc.so.6(+0x9f802) [0x7fac14e9f802]
2024-02-05T18:02:58.735 INFO:tasks.ceph.osd.0.smithi039.stderr: 15: /lib64/libc.so.6(+0x3f450) [0x7fac14e3f450]

The run is still going, but you can see all examples here: https://pulpito.ceph.com/yuriw-2024-02-05_16:33:30-rados-wip-yuri5-testing-2024-01-31-1657-distro-default-smithi/

@ljflores
Copy link
Member

ljflores commented Feb 5, 2024

@rzarzynski another new failures that looks maybe related:
https://pulpito.ceph.com/yuriw-2024-02-05_16:33:30-rados-wip-yuri5-testing-2024-01-31-1657-distro-default-smithi/7547050

2024-02-05T20:31:46.604 INFO:tasks.workunit.client.0.smithi120.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/divergent-priors.sh:383: TEST_divergent_ec:  for i in $osds
2024-02-05T20:31:46.605 INFO:tasks.workunit.client.0.smithi120.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/divergent-priors.sh:385: TEST_divergent_ec:  ceph tell osd.2 debug kick_recovery_wq 0
2024-02-05T20:36:46.736 INFO:tasks.workunit.client.0.smithi120.stderr:2024-02-05T20:36:46.735+0000 7f12479e1640  0 monclient(hunting): authenticate timed out after 300
2024-02-05T20:36:46.736 INFO:tasks.workunit.client.0.smithi120.stderr:[errno 110] RADOS timed out (error connecting to the cluster)
2024-02-05T20:36:46.738 INFO:tasks.workunit.client.0.smithi120.stdout:reading divergent objects
2024-02-05T20:36:46.738 INFO:tasks.workunit.client.0.smithi120.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/divergent-priors.sh:388: TEST_divergent_ec:  echo 'reading divergent objects'
2024-02-05T20:36:46.738 INFO:tasks.workunit.client.0.smithi120.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/divergent-priors.sh:389: TEST_divergent_ec:  ceph pg dump pgs
2024-02-05T20:41:46.873 INFO:tasks.workunit.client.0.smithi120.stderr:2024-02-05T20:41:46.872+0000 7f7514c51640  0 monclient(hunting): authenticate timed out after 300
2024-02-05T20:41:46.873 INFO:tasks.workunit.client.0.smithi120.stderr:[errno 110] RADOS timed out (error connecting to the cluster)
2024-02-05T20:41:46.876 INFO:tasks.workunit.client.0.smithi120.stderr:///home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/divergent-priors.sh:390: TEST_divergent_ec:  expr 2 + 2
2024-02-05T20:41:46.878 INFO:tasks.workunit.client.0.smithi120.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/divergent-priors.sh:390: TEST_divergent_ec:  seq 1 4
2024-02-05T20:41:46.879 INFO:tasks.workunit.client.0.smithi120.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/divergent-priors.sh:390: TEST_divergent_ec:  for i in $(seq 1 $(expr $DIVERGENT_WRITE + $DIVERGENT_REMOVE))
2024-02-05T20:41:46.879 INFO:tasks.workunit.client.0.smithi120.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/divergent-priors.sh:392: TEST_divergent_ec:  rados -p test get existing_1 td/divergent-priors/existing
2024-02-05T20:46:46.916 INFO:tasks.workunit.client.0.smithi120.stderr:failed to fetch mon config (--no-mon-config to skip)
2024-02-05T20:46:46.917 INFO:tasks.workunit.client.0.smithi120.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/divergent-priors.sh:392: TEST_divergent_ec:  return 1

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
@rzarzynski
Copy link
Contributor Author

@ljflores: pushed a new commit. It's a long shot but let's try.

@ljflores
Copy link
Member

ljflores commented Feb 9, 2024

@rzarzynski I found some more failures, all logged here: https://pad.ceph.com/p/osd_rework_ec_crimson

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Copy link
Member

@ljflores ljflores left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ljflores
Copy link
Member

jenkins test api

1 similar comment
@yuriw
Copy link
Contributor

yuriw commented Feb 14, 2024

jenkins test api

@yuriw
Copy link
Contributor

yuriw commented Feb 14, 2024

@rzarzynski this PR stuck at testing api

@rzarzynski
Copy link
Contributor Author

@yuriw: @ronen-fr has mentioned similar issues with the API check during today's core sync.

@yuriw
Copy link
Contributor

yuriw commented Feb 14, 2024

@rzarzynski This is ready for merge as soon as all checks passed
ref: https://trello.com/c/0sIdLAdH

@rzarzynski
Copy link
Contributor Author

rzarzynski commented Feb 15, 2024

Looks like unrelated dashboard failure:

2024-02-14 22:37:17,369.369 INFO:__main__:> ./bin/ceph dashboard ac-user-show admin2
2024-02-14T22:37:17.603+0000 7fedc160f640 -1 WARNING: all dangerous and experimental features are enabled.
2024-02-14T22:37:17.643+0000 7fedc160f640 -1 WARNING: all dangerous and experimental features are enabled.
Error ENOENT: User 'admin2' does not exist
2024-02-14 22:37:17,953.953 INFO:__main__:> sh -c 'echo -n admin2 > /tmp/n4gzl74ihOVUAJEndTRE'
2024-02-14 22:37:17,956.956 INFO:__main__:> ./bin/ceph dashboard ac-user-create admin2 --force-password -i /tmp/n4gzl74ihOVUAJEndTRE
2024-02-14 22:37:18,872.872 INFO:__main__:> ./bin/ceph dashboard ac-user-set-roles admin2 administrator
2024-02-14 22:37:20,356.356 INFO:__main__:> ./bin/ceph dashboard ac-user-delete admin2
2024-02-14 22:37:21,068.068 INFO:__main__:> ./bin/ceph dashboard ac-user-show admin2
2024-02-14T22:37:21.291+0000 7f8429d7f640 -1 WARNING: all dangerous and experimental features are enabled.
2024-02-14T22:37:21.355+0000 7f8429d7f640 -1 WARNING: all dangerous and experimental features are enabled.
Error ENOENT: User 'admin2' does not exist
2024-02-14 22:37:21,678.678 INFO:__main__:> sh -c 'echo -n admin2 > /tmp/xUy6pxyq9m5CfPo5djek'

To Mr CheckBot: one more time, please!

@rzarzynski
Copy link
Contributor Author

jenkins test api

@ljflores ljflores merged commit f50f7fd into ceph:main Feb 15, 2024
@ronen-fr
Copy link
Contributor

Hi @rzarzynski: the PR seems to have changed the way num_shards_repaired is counted,
causing a specific failure in osd-scrub-repair.sh.

I'll issue a PR to fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants