osd: rework EC for the sake of integration with crimson#54930
osd: rework EC for the sake of integration with crimson#54930
Conversation
|
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
|
@rzarzynski Did you want a review on this one, or should I wait for the split up PRs? |
cee8f06 to
c22ed9a
Compare
|
@athanatos: just pushed out the crimson commits. |
|
jenkins retest this please |
|
Oops, it got an undetected merge conflict with Fixing. |
`read()` is simply too common for grep. Also, this commit will ensure there is no unnoticed new call site which could be problematic in the spot of the upcoming recovery read rework. Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
`pg` for `ECBackend*` is simply misleading. Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
…kend
`complete_read_op()` being aware about `RecoveryMessages` was too much.
TODO:
* rename or rework `RecoveryMessages`. Since now it's also
a callack for `ReadOp::on_complete`. I don't like that.
* drop the `pair<RecoveryMessages*, read_request_t&>`.
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Signed-off-by: Radosław Zarzyński <rzarzyns@redhat.com>
Signed-off-by: Radosław Zarzyński <rzarzyns@redhat.com>
Signed-off-by: Radosław Zarzyński <rzarzyns@redhat.com>
Signed-off-by: Radosław Zarzyński <rzarzyns@redhat.com>
1. Move some of them to .cc and 2. switch their implementations to use lower-layer methods instead of touching `temp_contents` directly. Signed-off-by: Radosław Zarzyński <rzarzyns@redhat.com>
|
Should 'mon/OSDMonitor: let crimson handle ECPools' still be here with the crimson bits removed? |
athanatos
left a comment
There was a problem hiding this comment.
LGTM otherwise -- sorry for the delay!
c22ed9a to
3486b07
Compare
|
jenkins test api |
|
@rzarzynski seeing a lot of failures in the rados suite that look like this: /a/yuriw-2024-01-23_23:02:31-rados-wip-yuri5-testing-2024-01-23-0805-distro-default-smithi/7530052 See more examples and the full run here: https://trello.com/c/0sIdLAdH/1937-wip-yuri5-testing-2024-01-23-0805 |
|
Replicated locally. WIP. |
|
Found the root cause. Reworking. |
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
|
@ljflores: pushed the fixups. They addressed the problem locally. Would you mind resting at teuthology? |
There was a problem hiding this comment.
@rzarzynski I see new test failures:
/a/yuriw-2024-02-05_16:33:30-rados-wip-yuri5-testing-2024-01-31-1657-distro-default-smithi/7547028
2024-02-05T17:15:46.204 DEBUG:teuthology.orchestra.run.smithi183:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph pg repair 11.0
2024-02-05T17:15:46.481 INFO:teuthology.orchestra.run.smithi183.stderr:instructing pg 11.0s0 on osd.1 to repair
2024-02-05T17:15:46.895 INFO:tasks.ceph.osd.1.smithi183.stderr:2024-02-05T17:15:46.893+0000 7f6ae72ae640 -1 log_channel(cluster) log [ERR] : 11.0 shard 1(0) soid 11:a0216fbc:::repair_test_obj:head : candidate had a read error, candidate had a missing hinfo key
2024-02-05T17:15:46.895 INFO:tasks.ceph.osd.1.smithi183.stderr:2024-02-05T17:15:46.893+0000 7f6ae72ae640 -1 log_channel(cluster) log [ERR] : 11.0s0> repair 0 missing, 1 inconsistent objects
2024-02-05T17:15:46.895 INFO:tasks.ceph.osd.1.smithi183.stderr:2024-02-05T17:15:46.893+0000 7f6ae72ae640 -1 log_channel(cluster) log [ERR] : 11.0s0 repair 0 missing, 1 inconsistent objects
2024-02-05T17:15:46.895 INFO:tasks.ceph.osd.1.smithi183.stderr:2024-02-05T17:15:46.893+0000 7f6ae72ae640 -1 log_channel(cluster) log [ERR] : 11.0 repair 1 errors, 1 fixed
2024-02-05T17:15:46.907 INFO:tasks.ceph.osd.1.smithi183.stderr:./src/osd/osd_types.cc: In function 'void coll_t::calc_str()' thread 7f6aeb2b6640 time 2024-02-05T17:15:46.904903+0000
2024-02-05T17:15:46.907 INFO:tasks.ceph.osd.1.smithi183.stderr:./src/osd/osd_types.cc: 950: ceph_abort_msg("unknown collection type")
2024-02-05T17:15:46.907 INFO:tasks.ceph.osd.1.smithi183.stderr: ceph version 19.0.0-1243-ge10f4b28 (e10f4b28ec4e5fd19161c663ab66bd3d25b667cc) squid (dev)
2024-02-05T17:15:46.907 INFO:tasks.ceph.osd.1.smithi183.stderr: 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xc6) [0x55e803b2a6e6]
2024-02-05T17:15:46.907 INFO:tasks.ceph.osd.1.smithi183.stderr: 2: (coll_t::calc_str()+0x5f) [0x55e803eb1a3f]
2024-02-05T17:15:46.908 INFO:tasks.ceph.osd.1.smithi183.stderr: 3: ceph-osd(+0x11e0ceb) [0x55e804914ceb]
2024-02-05T17:15:46.908 INFO:tasks.ceph.osd.1.smithi183.stderr: 4: ceph-osd(+0x507e60) [0x55e803c3be60]
2024-02-05T17:15:46.908 INFO:tasks.ceph.osd.1.smithi183.stderr: 5: (ECBackend::RecoveryBackend::handle_recovery_push(PushOp const&, RecoveryMessages*, bool)+0x68d) [0x55e804003e3d]
2024-02-05T17:15:46.908 INFO:tasks.ceph.osd.1.smithi183.stderr: 6: (ECBackend::handle_recovery_push(PushOp const&, RecoveryMessages*, bool)+0x11a) [0x55e80400475a]
2024-02-05T17:15:46.908 INFO:tasks.ceph.osd.1.smithi183.stderr: 7: (ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x1d5) [0x55e80400a1c5]
2024-02-05T17:15:46.908 INFO:tasks.ceph.osd.1.smithi183.stderr: 8: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x56) [0x55e803e064e6]
2024-02-05T17:15:46.908 INFO:tasks.ceph.osd.1.smithi183.stderr: 9: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x7ed) [0x55e803d5e4fd]
2024-02-05T17:15:46.908 INFO:tasks.ceph.osd.1.smithi183.stderr: 10: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x197) [0x55e803cb28c7]
2024-02-05T17:15:46.908 INFO:tasks.ceph.osd.1.smithi183.stderr: 11: (ceph::osd::scheduler::PGRecoveryMsg::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x198) [0x55e803efaf38]
2024-02-05T17:15:46.908 INFO:tasks.ceph.osd.1.smithi183.stderr: 12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xab3) [0x55e803cbd543]
2024-02-05T17:15:46.908 INFO:tasks.ceph.osd.1.smithi183.stderr: 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x293) [0x55e8041b66a3]
2024-02-05T17:15:46.908 INFO:tasks.ceph.osd.1.smithi183.stderr: 14: ceph-osd(+0xa82c04) [0x55e8041b6c04]
2024-02-05T17:15:46.908 INFO:tasks.ceph.osd.1.smithi183.stderr: 15: /lib/x86_64-linux-gnu/libc.so.6(+0x94b43) [0x7f6b09921b43]
2024-02-05T17:15:46.908 INFO:tasks.ceph.osd.1.smithi183.stderr: 16: /lib/x86_64-linux-gnu/libc.so.6(+0x126a00) [0x7f6b099b3a00]
2024-02-05T17:15:46.908 INFO:tasks.ceph.osd.1.smithi183.stderr:*** Caught signal (Aborted) **
2024-02-05T17:15:46.908 INFO:tasks.ceph.osd.1.smithi183.stderr: in thread 7f6aeb2b6640 thread_name:tp_osd_tp
2024-02-05T17:15:46.909 INFO:tasks.ceph.osd.1.smithi183.stderr:2024-02-05T17:15:46.905+0000 7f6aeb2b6640 -1 ./src/osd/osd_types.cc: In function 'void coll_t::calc_str()' thread 7f6aeb2b6640 time 2024-02-05T17:15:46.904903+0000
2024-02-05T17:15:46.909 INFO:tasks.ceph.osd.1.smithi183.stderr:./src/osd/osd_types.cc: 950: ceph_abort_msg("unknown collection type")
2024-02-05T17:15:46.909 INFO:tasks.ceph.osd.1.smithi183.stderr:
2024-02-05T17:15:46.909 INFO:tasks.ceph.osd.1.smithi183.stderr: ceph version 19.0.0-1243-ge10f4b28 (e10f4b28ec4e5fd19161c663ab66bd3d25b667cc) squid (dev)
2024-02-05T17:15:46.909 INFO:tasks.ceph.osd.1.smithi183.stderr: 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xc6) [0x55e803b2a6e6]
2024-02-05T17:15:46.909 INFO:tasks.ceph.osd.1.smithi183.stderr: 2: (coll_t::calc_str()+0x5f) [0x55e803eb1a3f]
2024-02-05T17:15:46.909 INFO:tasks.ceph.osd.1.smithi183.stderr: 3: ceph-osd(+0x11e0ceb) [0x55e804914ceb]
2024-02-05T17:15:46.909 INFO:tasks.ceph.osd.1.smithi183.stderr: 4: ceph-osd(+0x507e60) [0x55e803c3be60]
2024-02-05T17:15:46.909 INFO:tasks.ceph.osd.1.smithi183.stderr: 5: (ECBackend::RecoveryBackend::handle_recovery_push(PushOp const&, RecoveryMessages*, bool)+0x68d) [0x55e804003e3d]
2024-02-05T17:15:46.909 INFO:tasks.ceph.osd.1.smithi183.stderr: 6: (ECBackend::handle_recovery_push(PushOp const&, RecoveryMessages*, bool)+0x11a) [0x55e80400475a]
2024-02-05T17:15:46.909 INFO:tasks.ceph.osd.1.smithi183.stderr: 7: (ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x1d5) [0x55e80400a1c5]
2024-02-05T17:15:46.909 INFO:tasks.ceph.osd.1.smithi183.stderr: 8: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x56) [0x55e803e064e6]
2024-02-05T17:15:46.909 INFO:tasks.ceph.osd.1.smithi183.stderr: 9: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x7ed) [0x55e803d5e4fd]
2024-02-05T17:15:46.909 INFO:tasks.ceph.osd.1.smithi183.stderr: 10: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x197) [0x55e803cb28c7]
2024-02-05T17:15:46.909 INFO:tasks.ceph.osd.1.smithi183.stderr: 11: (ceph::osd::scheduler::PGRecoveryMsg::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x198) [0x55e803efaf38]
2024-02-05T17:15:46.909 INFO:tasks.ceph.osd.1.smithi183.stderr: 12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xab3) [0x55e803cbd543]
2024-02-05T17:15:46.909 INFO:tasks.ceph.osd.1.smithi183.stderr: 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x293) [0x55e8041b66a3]
2024-02-05T17:15:46.910 INFO:tasks.ceph.osd.1.smithi183.stderr: 14: ceph-osd(+0xa82c04) [0x55e8041b6c04]
/a/yuriw-2024-02-05_16:33:30-rados-wip-yuri5-testing-2024-01-31-1657-distro-default-smithi/7547086
2024-02-05T18:02:58.731 INFO:tasks.ceph.osd.0.smithi039.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/19.0.0-1243-ge10f4b28/rpm/el9/BUILD/ceph-19.0.0-1243-ge10f4b28/src/osd/PrimaryLogPG.cc: In function 'virtual void PrimaryLogPG::on_failed_pull(const std::set<pg_shard_t>&, const hobject_t&, const eversion_t&)' thread 7fabf3f53640 time 2024-02-05T18:02:58.731967+0000
2024-02-05T18:02:58.731 INFO:tasks.ceph.osd.0.smithi039.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/19.0.0-1243-ge10f4b28/rpm/el9/BUILD/ceph-19.0.0-1243-ge10f4b28/src/osd/PrimaryLogPG.cc: 12482: FAILED ceph_assert(recovering.count(soid))
2024-02-05T18:02:58.732 INFO:tasks.ceph.osd.0.smithi039.stderr:2024-02-05T18:02:58.730+0000 7fabf3f53640 -1 osd.0 pg_epoch: 31 pg[1.es1( v 31'1958 (0'0,31'1958] local-lis/les=30/31 n=1954 ec=13/13 lis/c=30/14 les/c/f=31/15/0 sis=30) [1,0,3,2]/[NONE,0,3,2]p0(1) async=[1(0)] r=1 lpr=30 pi=[14,30)/3 luod=28'1957 crt=28'1957 lcod 28'1955 mlcod 0'0 active+recovering+undersized+degraded+remapped rops=1 mbc={0={(0+0)=1951},1={(1+0)=1951},2={(1+0)=1951},3={(1+0)=1951}}] continue_recovery_op: 1:7c326e16:::benchmark_data_smithi039_44481_object21:head has inconsistent hinfo
2024-02-05T18:02:58.734 INFO:tasks.ceph.osd.0.smithi039.stderr: ceph version 19.0.0-1243-ge10f4b28 (e10f4b28ec4e5fd19161c663ab66bd3d25b667cc) squid (dev)
2024-02-05T18:02:58.734 INFO:tasks.ceph.osd.0.smithi039.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x11e) [0x55bef43bef2a]
2024-02-05T18:02:58.734 INFO:tasks.ceph.osd.0.smithi039.stderr: 2: ceph-osd(+0x3f30e6) [0x55bef43bf0e6]
2024-02-05T18:02:58.734 INFO:tasks.ceph.osd.0.smithi039.stderr: 3: ceph-osd(+0x39b274) [0x55bef4367274]
2024-02-05T18:02:58.734 INFO:tasks.ceph.osd.0.smithi039.stderr: 4: (ECBackend::RecoveryBackend::continue_recovery_op(ECBackend::RecoveryBackend::RecoveryOp&, RecoveryMessages*)+0x129f) [0x55bef48b09ff]
2024-02-05T18:02:58.734 INFO:tasks.ceph.osd.0.smithi039.stderr: 5: (ECBackend::RecoveryBackend::run_recovery_op(ECRecoveryHandle&, int)+0xf83) [0x55bef48b3123]
2024-02-05T18:02:58.735 INFO:tasks.ceph.osd.0.smithi039.stderr: 6: (ECBackend::run_recovery_op(PGBackend::RecoveryHandle*, int)+0x23) [0x55bef48b3533]
2024-02-05T18:02:58.735 INFO:tasks.ceph.osd.0.smithi039.stderr: 7: (PrimaryLogPG::recover_replicas(unsigned long, ThreadPool::TPHandle&, bool*)+0x142e) [0x55bef466f49e]
2024-02-05T18:02:58.735 INFO:tasks.ceph.osd.0.smithi039.stderr: 8: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0x25b) [0x55bef46650bb]
2024-02-05T18:02:58.735 INFO:tasks.ceph.osd.0.smithi039.stderr: 9: (OSD::do_recovery(PG*, unsigned int, unsigned long, int, ThreadPool::TPHandle&)+0x245) [0x55bef4549375]
2024-02-05T18:02:58.735 INFO:tasks.ceph.osd.0.smithi039.stderr: 10: (ceph::osd::scheduler::PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0xca) [0x55bef4794bba]
2024-02-05T18:02:58.735 INFO:tasks.ceph.osd.0.smithi039.stderr: 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xd07) [0x55bef45656f7]
2024-02-05T18:02:58.735 INFO:tasks.ceph.osd.0.smithi039.stderr: 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x2aa) [0x55bef4a63a8a]
2024-02-05T18:02:58.735 INFO:tasks.ceph.osd.0.smithi039.stderr: 13: ceph-osd(+0xa98034) [0x55bef4a64034]
2024-02-05T18:02:58.735 INFO:tasks.ceph.osd.0.smithi039.stderr: 14: /lib64/libc.so.6(+0x9f802) [0x7fac14e9f802]
2024-02-05T18:02:58.735 INFO:tasks.ceph.osd.0.smithi039.stderr: 15: /lib64/libc.so.6(+0x3f450) [0x7fac14e3f450]
The run is still going, but you can see all examples here: https://pulpito.ceph.com/yuriw-2024-02-05_16:33:30-rados-wip-yuri5-testing-2024-01-31-1657-distro-default-smithi/
|
@rzarzynski another new failures that looks maybe related: |
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
|
@ljflores: pushed a new commit. It's a long shot but let's try. |
|
@rzarzynski I found some more failures, all logged here: https://pad.ceph.com/p/osd_rework_ec_crimson |
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
|
jenkins test api |
1 similar comment
|
jenkins test api |
|
@rzarzynski this PR stuck at testing api |
|
@rzarzynski This is ready for merge as soon as all checks passed |
|
Looks like unrelated dashboard failure: To Mr CheckBot: one more time, please! |
|
jenkins test api |
|
Hi @rzarzynski: the PR seems to have changed the way num_shards_repaired is counted, I'll issue a PR to fix. |
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an
xbetween the brackets:[x]. Spaces and capitalization matter when checking off items this way.Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windowsjenkins test rook e2e