Project

General

Profile

Actions

Bug #48793

closed

out of order op

Added by Neha Ojha about 5 years ago. Updated 8 months ago.

Status:
Resolved
Priority:
Immediate
Category:
-
Target version:
-
% Done:

0%

Source:
Backport:
pacific
Regression:
No
Severity:
3 - minor
Reviewed:
01/30/2021
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Tags (freeform):
Fixed In:
v17.0.0-462-g0dbc0b67bc
Released In:
v17.2.0~3136
Upkeep Timestamp:
2025-07-14T18:04:34+00:00

Description

2021-01-07T01:57:26.031 INFO:tasks.ceph.osd.5.smithi105.stderr:2021-01-07T01:57:26.020+0000 7f4e73168700 -1 osd.5 pg_epoch: 77 pg[3.16( v 77'4158 (77'4155,77'4158] local-lis/les=57/58 n=325 ec=50/20 lis/c=57/57 les/c/f=58/58/0 sis=57) [5,7,1] r=0 lpr=57 luod=77'4155 crt=77'4158 lcod 77'4154 mlcod 77'4154 active+clean+scrubbing+deep [ 3.16:  ] ] bad op order, already applied 78736 > this 78730
2021-01-07T01:57:26.035 INFO:tasks.ceph.osd.5.smithi105.stderr:/build/ceph-16.0.0-8716-g7b39daab/src/osd/PrimaryLogPG.cc: In function 'void PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)' thread 7f4e73168700 time 2021-01-07T01:57:26.026716+0000
2021-01-07T01:57:26.036 INFO:tasks.ceph.osd.5.smithi105.stderr:/build/ceph-16.0.0-8716-g7b39daab/src/osd/PrimaryLogPG.cc: 4043: ceph_abort_msg("out of order op")
2021-01-07T01:57:26.036 INFO:tasks.ceph.osd.5.smithi105.stderr: ceph version 16.0.0-8716-g7b39daab (7b39daab79ea1fd399e4aad34b344c16206c8b0c) pacific (dev)
2021-01-07T01:57:26.037 INFO:tasks.ceph.osd.5.smithi105.stderr: 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe1) [0x5644b004fa79]
2021-01-07T01:57:26.037 INFO:tasks.ceph.osd.5.smithi105.stderr: 2: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0x1a79) [0x5644b023fb59]
2021-01-07T01:57:26.037 INFO:tasks.ceph.osd.5.smithi105.stderr: 3: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x4006) [0x5644b0243c46]
2021-01-07T01:57:26.037 INFO:tasks.ceph.osd.5.smithi105.stderr: 4: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xd0a) [0x5644b024f1aa]
2021-01-07T01:57:26.038 INFO:tasks.ceph.osd.5.smithi105.stderr: 5: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x17b) [0x5644b00d6a6b]
2021-01-07T01:57:26.038 INFO:tasks.ceph.osd.5.smithi105.stderr: 6: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x6a) [0x5644b03373fa]
2021-01-07T01:57:26.038 INFO:tasks.ceph.osd.5.smithi105.stderr: 7: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xcd5) [0x5644b00f4865]
2021-01-07T01:57:26.038 INFO:tasks.ceph.osd.5.smithi105.stderr: 8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0x5644b076698c]
2021-01-07T01:57:26.038 INFO:tasks.ceph.osd.5.smithi105.stderr: 9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5644b0769c40]
2021-01-07T01:57:26.039 INFO:tasks.ceph.osd.5.smithi105.stderr: 10: /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f4e9e0ba6db]
2021-01-07T01:57:26.039 INFO:tasks.ceph.osd.5.smithi105.stderr: 11: clone()
rados/thrash/{0-size-min-size-overrides/3-size-2-min-size 1-pg-log-overrides/short_pg_log 2-recovery-overrides/{default} backoff/peering ceph clusters/{fixed-2 openstack} crc-failures/default d-balancer/on mon_election/connectivity msgr-failures/few msgr/async-v2only objectstore/bluestore-stupid rados supported-random-distro$/{ubuntu_latest} thrashers/default thrashosds-health workloads/radosbench-high-concurrency}

/a/nojha-2021-01-07_00:06:49-rados-master-distro-basic-smithi/5760910


Related issues 2 (0 open2 closed)

Has duplicate RADOS - Bug #47003: ceph_test_rados test error. Reponses out of order due to the connection drops data.Duplicate

Actions
Copied to RADOS - Backport #49145: pacific: out of order opResolvedNeha OjhaActions
Actions #1

Updated by Deepika Upadhyay about 5 years ago

http://qa-proxy.ceph.com/teuthology/ideepika-2020-12-18_14:27:53-rados:thrash-erasure-code-master-distro-basic-smithi/5718722/teuthology.log

new options added:

osd debug inject dispatch delay duration: 0.1
osd debug inject dispatch delay probability: 0

Actions #2

Updated by Neha Ojha about 5 years ago

  • Status changed from New to Triaged
  • Assignee set to Ronen Friedman
Actions #3

Updated by Neha Ojha about 5 years ago

/a/jafaj-2021-01-05_16:20:30-rados-wip-jan-testing-2021-01-05-1401-distro-basic-smithi/5756733 with logs

Actions #4

Updated by Neha Ojha about 5 years ago

  • Assignee deleted (Ronen Friedman)
  • Backport set to pacific

This is not related to https://github.com/ceph/ceph/pull/38111.

rados/thrash/{0-size-min-size-overrides/2-size-2-min-size 1-pg-log-overrides/normal_pg_log 2-recovery-overrides/{more-async-recovery} backoff/peering_and_degraded ceph clusters/{fixed-2 openstack} crc-failures/bad_map_crc_failure d-balancer/crush-compat mon_election/classic msgr-failures/osd-delay msgr/async objectstore/bluestore-comp-snappy rados supported-random-distro$/{centos_8} thrashers/mapgap thrashosds-health workloads/radosbench-high-concurrency}

/a/nojha-2021-01-26_00:09:12-rados-pacific-distro-basic-smithi/5828719 - no logs

Actions #5

Updated by Neha Ojha about 5 years ago

  • Status changed from Triaged to Need More Info
  • Priority changed from Urgent to Immediate

rados/thrash-erasure-code-big/{ceph cluster/{12-osds openstack} mon_election/classic msgr-failures/osd-delay objectstore/bluestore-comp-snappy rados recovery-overrides/{more-active-recovery} supported-random-distro$/{rhel_8} thrashers/default thrashosds-health workloads/ec-rados-plugin=jerasure-k=4-m=2}

/a/nojha-2021-01-26_00:09:12-rados-pacific-distro-basic-smithi/5828854 - no logs

Actions #6

Updated by Neha Ojha about 5 years ago

  • Status changed from Need More Info to New
Actions #7

Updated by Neha Ojha about 5 years ago

/a/jafaj-2021-01-05_16:20:30-rados-wip-jan-testing-2021-01-05-1401-distro-basic-smithi/5756733

Following is the problem.

2021-01-05T23:58:23.764+0000 7fa98d622700 20 osd.5 pg_epoch: 531 pg[9.8( v 531'1354 (0'0,531'1354] local-lis/les=528/529 n=94 ec=520/520 lis/c=528/528 les/c/f=529/529/0 sis=528) [5,7] r=0 lpr=528 luod=531'1352 crt=531'1354 lcod 531'1351 mlcod 531'1351 active+clean+scrubbing [ 9.8:  ] ] do_op: op osd_op(client.9419.0:24883 9.8 9:130ad8eb:::benchmark_data_smithi040_141080_object3110:head [set-alloc-hint object_size 65536 write_size 8192,write 16384~8192 in=8192b] snapc 0=[] ondisk+write+known_if_redirected e531) v8
2021-01-05T23:58:23.764+0000 7fa98d622700 20 osd.5 pg_epoch: 531 pg[9.8( v 531'1354 (0'0,531'1354] local-lis/les=528/529 n=94 ec=520/520 lis/c=528/528 les/c/f=529/529/0 sis=528) [5,7] r=0 lpr=528 luod=531'1352 crt=531'1354 lcod 531'1351 mlcod 531'1351 active+clean+scrubbing [ 9.8:  ] ]  scrubber pg(9.8) write_blocked_by_scrub 9:130ad8eb:::benchmark_data_smithi040_141080_object3110:head can preempt? 0 already preempted? 1
2021-01-05T23:58:23.764+0000 7fa98d622700 20 osd.5 pg_epoch: 531 pg[9.8( v 531'1354 (0'0,531'1354] local-lis/les=528/529 n=94 ec=520/520 lis/c=528/528 les/c/f=529/529/0 sis=528) [5,7] r=0 lpr=528 luod=531'1352 crt=531'1354 lcod 531'1351 mlcod 531'1351 active+clean+scrubbing [ 9.8:  ] ] do_op: waiting for scrub

Then,

2021-01-05T23:58:23.773+0000 7fa99162a700 20 osd.5 pg_epoch: 531 pg[9.8( v 531'1358 (0'0,531'1358] local-lis/les=528/529 n=94 ec=520/520 lis/c=528/528 les/c/f=529/529/0 sis=528) [5,7] r=0 lpr=528 luod=531'1352 crt=531'1358 lcod 531'1351 mlcod 531'1351 active+clean+scrubbing [ 9.8:  ] ] do_op: op osd_op(client.9419.0:24888 9.8 9:130ad8eb:::benchmark_data_smithi040_141080_object3110:head [set-alloc-hint object_size 65536 write_size 8192,write 57344~8192 in=8192b] snapc 0=[] ondisk+write+known_if_redirected e531) v8

2021-01-05T23:58:23.773+0000 7fa99162a700 20 osd.5 pg_epoch: 531 pg[9.8( v 531'1358 (0'0,531'1358] local-lis/les=528/529 n=94 ec=520/520 lis/c=528/528 les/c/f=529/529/0 sis=528) [5,7] r=0 lpr=528 luod=531'1352 crt=531'1358 lcod 531'1351 mlcod 531'1351 active+clean+scrubbing [ 9.8:  ] ]  op order client.9419 tid 24888 last was 24887

But,

2021-01-05T23:58:23.777+0000 7fa99162a700 20 osd.5 pg_epoch: 531 pg[9.8( v 531'1359 (0'0,531'1359] local-lis/les=528/529 n=94 ec=520/520 lis/c=528/528 les/c/f=529/529/0 sis=528) [5,7] r=0 lpr=528 luod=531'1354 crt=531'1359 lcod 531'1353 mlcod 531'1353 active+clean+scrubbing [ 9.8:  ] ] do_op: op osd_op(client.9419.0:24883 9.8 9:130ad8eb:::benchmark_data_smithi040_141080_object3110:head [set-alloc-hint object_size 65536 write_size 8192,write 16384~8192 in=8192b] snapc 0=[] ondisk+write+known_if_redirected e531) v8

2021-01-05T23:58:23.777+0000 7fa99162a700 -1 osd.5 pg_epoch: 531 pg[9.8( v 531'1359 (0'0,531'1359] local-lis/les=528/529 n=94 ec=520/520 lis/c=528/528 les/c/f=529/529/0 sis=528) [5,7] r=0 lpr=528 luod=531'1354 crt=531'1359 lcod 531'1353 mlcod 531'1353 active+clean+scrubbing [ 9.8:  ] ] bad op order, already applied 24888 > this 24883
Actions #8

Updated by Neha Ojha about 5 years ago

  • Assignee set to Ronen Friedman

Ronen, can you please check if this is related to your scrub refactor?

Actions #9

Updated by Neha Ojha about 5 years ago

  • Status changed from New to Triaged
Actions #10

Updated by Ronen Friedman about 5 years ago

  • Status changed from Triaged to In Progress

In the revised scrub code there is a period in which:
- the scrub is marked as 'preempted', and
- preemption is already disabled, but
- the scrub range is still set (start != end).

(in the original code - that sequence is part of one block of execution)

A write request at that time will be blocked, but will not be restarted on time.
(It will eventually be restarted, as we do not clear the 'waiting for scrub', but much later and
in the wrong order).

Actions #11

Updated by Ronen Friedman about 5 years ago

  • Status changed from In Progress to Fix Under Review
  • Reviewed set to 01/30/2021
  • Pull request ID set to 39145
Actions #12

Updated by Neha Ojha about 5 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #13

Updated by Upkeep Bot about 5 years ago

Actions #14

Updated by Neha Ojha about 5 years ago

  • Has duplicate Bug #47003: ceph_test_rados test error. Reponses out of order due to the connection drops data. added
Actions #15

Updated by Nathan Cutler about 5 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions #16

Updated by Laura Flores almost 4 years ago

@Neha . @Ronen this Octopus failure looks a lot like this Tracker. Was the revised scrub code backported to Octopus, or is this something different?

/a/lflores-2022-05-04_18:59:38-rados-wip-55077-octopus-distro-default-smithi/6820808

2022-05-06T02:24:42.307 INFO:tasks.ceph.osd.0.smithi089.stderr:2022-05-06T02:24:42.305+0000 7f9d28b54700 -1 osd.0 pg_epoch: 41 pg[4.0( v 41'162 (41'159,41'162] local-lis/les=40/41 n=5 ec=21/21 lis/c=40/21 les/c/f=41/22/0 sis=40) [3,7,0]/[0,7] backfill=[3] r=0 lpr=40 pi=[21,40)/1 crt=41'162 lcod 41'161 mlcod 41'161 active+undersized+degraded+remapped+backfilling rops=1 mbc={255={}}] bad op order, already applied 2447 > this 2446
2022-05-06T02:24:42.309 INFO:tasks.rados.rados.0.smithi089.stdout:1072:  writing smithi08915060-44 from 608486 to 1170455 tid 1
2022-05-06T02:24:42.310 INFO:tasks.ceph.osd.0.smithi089.stderr:/build/ceph-15.2.16-87-g7f781e26/src/osd/PrimaryLogPG.cc: In function 'void PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)' thread 7f9d28b54700 time 2022-05-06T02:24:42.308750+0000
2022-05-06T02:24:42.310 INFO:tasks.ceph.osd.0.smithi089.stderr:/build/ceph-15.2.16-87-g7f781e26/src/osd/PrimaryLogPG.cc: 3982: ceph_abort_msg("out of order op")
2022-05-06T02:24:42.311 INFO:tasks.ceph.osd.0.smithi089.stderr: ceph version 15.2.16-87-g7f781e26 (7f781e26c2a8d3e1d920c6fd32e81b5ef3019541) octopus (stable)
2022-05-06T02:24:42.311 INFO:tasks.ceph.osd.0.smithi089.stderr: 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe0) [0x558480d2e074]
2022-05-06T02:24:42.311 INFO:tasks.ceph.osd.0.smithi089.stderr: 2: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0x193c) [0x558480f7f0bc]
2022-05-06T02:24:42.311 INFO:tasks.ceph.osd.0.smithi089.stderr: 3: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x3570) [0x558480f827e0]
2022-05-06T02:24:42.312 INFO:tasks.ceph.osd.0.smithi089.stderr: 4: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xe50) [0x558480f8d290]
2022-05-06T02:24:42.312 INFO:tasks.ceph.osd.0.smithi089.stderr: 5: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1c0) [0x558480dddc40]
2022-05-06T02:24:42.312 INFO:tasks.ceph.osd.0.smithi089.stderr: 6: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x66) [0x55848108b886]
2022-05-06T02:24:42.312 INFO:tasks.ceph.osd.0.smithi089.stderr: 7: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xe8d) [0x558480e0afdd]
2022-05-06T02:24:42.312 INFO:tasks.ceph.osd.0.smithi089.stderr: 8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x403) [0x5584814f4453]
2022-05-06T02:24:42.313 INFO:tasks.ceph.osd.0.smithi089.stderr: 9: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x5584814f7484]
2022-05-06T02:24:42.313 INFO:tasks.ceph.osd.0.smithi089.stderr: 10: (()+0x8609) [0x7f9d474fd609]
2022-05-06T02:24:42.313 INFO:tasks.ceph.osd.0.smithi089.stderr: 11: (clone()+0x43) [0x7f9d47067163]

Actions #17

Updated by Upkeep Bot 8 months ago

  • Merge Commit set to 0dbc0b67bcf852692a0af692e64bfae11ba0a429
  • Fixed In set to v17.0.0-462-g0dbc0b67bc
  • Released In set to v17.2.0~3136
  • Upkeep Timestamp set to 2025-07-14T18:04:34+00:00
Actions

Also available in: Atom PDF