Bug #48793: out of order op - RADOS - Ceph

Actions

Copy link

Bug #48793

closed

out of order op

Added by Neha Ojha about 5 years ago. Updated 8 months ago.

Status:

Resolved

Priority:

Immediate

Assignee:

Ronen Friedman

Category:

Target version:

% Done:

Source:

Backport:

pacific

Regression:

Severity:

3 - minor

Reviewed:

01/30/2021

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

39145

Tags (freeform):

Merge Commit:

0dbc0b67bcf852692a0af692e64bfae11ba0a429

Fixed In:

v17.0.0-462-g0dbc0b67bc

Released In:

v17.2.0~3136

Upkeep Timestamp:

2025-07-14T18:04:34+00:00

Description

2021-01-07T01:57:26.031 INFO:tasks.ceph.osd.5.smithi105.stderr:2021-01-07T01:57:26.020+0000 7f4e73168700 -1 osd.5 pg_epoch: 77 pg[3.16( v 77'4158 (77'4155,77'4158] local-lis/les=57/58 n=325 ec=50/20 lis/c=57/57 les/c/f=58/58/0 sis=57) [5,7,1] r=0 lpr=57 luod=77'4155 crt=77'4158 lcod 77'4154 mlcod 77'4154 active+clean+scrubbing+deep [ 3.16:  ] ] bad op order, already applied 78736 > this 78730
2021-01-07T01:57:26.035 INFO:tasks.ceph.osd.5.smithi105.stderr:/build/ceph-16.0.0-8716-g7b39daab/src/osd/PrimaryLogPG.cc: In function 'void PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)' thread 7f4e73168700 time 2021-01-07T01:57:26.026716+0000
2021-01-07T01:57:26.036 INFO:tasks.ceph.osd.5.smithi105.stderr:/build/ceph-16.0.0-8716-g7b39daab/src/osd/PrimaryLogPG.cc: 4043: ceph_abort_msg("out of order op")
2021-01-07T01:57:26.036 INFO:tasks.ceph.osd.5.smithi105.stderr: ceph version 16.0.0-8716-g7b39daab (7b39daab79ea1fd399e4aad34b344c16206c8b0c) pacific (dev)
2021-01-07T01:57:26.037 INFO:tasks.ceph.osd.5.smithi105.stderr: 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe1) [0x5644b004fa79]
2021-01-07T01:57:26.037 INFO:tasks.ceph.osd.5.smithi105.stderr: 2: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0x1a79) [0x5644b023fb59]
2021-01-07T01:57:26.037 INFO:tasks.ceph.osd.5.smithi105.stderr: 3: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x4006) [0x5644b0243c46]
2021-01-07T01:57:26.037 INFO:tasks.ceph.osd.5.smithi105.stderr: 4: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xd0a) [0x5644b024f1aa]
2021-01-07T01:57:26.038 INFO:tasks.ceph.osd.5.smithi105.stderr: 5: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x17b) [0x5644b00d6a6b]
2021-01-07T01:57:26.038 INFO:tasks.ceph.osd.5.smithi105.stderr: 6: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x6a) [0x5644b03373fa]
2021-01-07T01:57:26.038 INFO:tasks.ceph.osd.5.smithi105.stderr: 7: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xcd5) [0x5644b00f4865]
2021-01-07T01:57:26.038 INFO:tasks.ceph.osd.5.smithi105.stderr: 8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0x5644b076698c]
2021-01-07T01:57:26.038 INFO:tasks.ceph.osd.5.smithi105.stderr: 9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5644b0769c40]
2021-01-07T01:57:26.039 INFO:tasks.ceph.osd.5.smithi105.stderr: 10: /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f4e9e0ba6db]
2021-01-07T01:57:26.039 INFO:tasks.ceph.osd.5.smithi105.stderr: 11: clone()

rados/thrash/{0-size-min-size-overrides/3-size-2-min-size 1-pg-log-overrides/short_pg_log 2-recovery-overrides/{default} backoff/peering ceph clusters/{fixed-2 openstack} crc-failures/default d-balancer/on mon_election/connectivity msgr-failures/few msgr/async-v2only objectstore/bluestore-stupid rados supported-random-distro$/{ubuntu_latest} thrashers/default thrashosds-health workloads/radosbench-high-concurrency}

/a/nojha-2021-01-07_00:06:49-rados-master-distro-basic-smithi/5760910

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by Deepika Upadhyay about 5 years ago

http://qa-proxy.ceph.com/teuthology/ideepika-2020-12-18_14:27:53-rados:thrash-erasure-code-master-distro-basic-smithi/5718722/teuthology.log

new options added:

osd debug inject dispatch delay duration: 0.1
osd debug inject dispatch delay probability: 0

Actions

Copy link

Updated by Neha Ojha about 5 years ago

Status changed from New to Triaged
Assignee set to Ronen Friedman

details in https://tracker.ceph.com/issues/48777#note-7

Actions

Copy link

Updated by Neha Ojha about 5 years ago

/a/jafaj-2021-01-05_16:20:30-rados-wip-jan-testing-2021-01-05-1401-distro-basic-smithi/5756733 with logs

Actions

Copy link

Updated by Neha Ojha about 5 years ago

Assignee deleted (~~Ronen Friedman~~)
Backport set to pacific

This is not related to https://github.com/ceph/ceph/pull/38111.

rados/thrash/{0-size-min-size-overrides/2-size-2-min-size 1-pg-log-overrides/normal_pg_log 2-recovery-overrides/{more-async-recovery} backoff/peering_and_degraded ceph clusters/{fixed-2 openstack} crc-failures/bad_map_crc_failure d-balancer/crush-compat mon_election/classic msgr-failures/osd-delay msgr/async objectstore/bluestore-comp-snappy rados supported-random-distro$/{centos_8} thrashers/mapgap thrashosds-health workloads/radosbench-high-concurrency}

/a/nojha-2021-01-26_00:09:12-rados-pacific-distro-basic-smithi/5828719 - no logs

Actions

Copy link

Updated by Neha Ojha about 5 years ago

Status changed from Triaged to Need More Info
Priority changed from Urgent to Immediate

rados/thrash-erasure-code-big/{ceph cluster/{12-osds openstack} mon_election/classic msgr-failures/osd-delay objectstore/bluestore-comp-snappy rados recovery-overrides/{more-active-recovery} supported-random-distro$/{rhel_8} thrashers/default thrashosds-health workloads/ec-rados-plugin=jerasure-k=4-m=2}

/a/nojha-2021-01-26_00:09:12-rados-pacific-distro-basic-smithi/5828854 - no logs

Actions

Copy link

Updated by Neha Ojha about 5 years ago

Status changed from Need More Info to New

Actions

Copy link

Updated by Neha Ojha about 5 years ago

/a/jafaj-2021-01-05_16:20:30-rados-wip-jan-testing-2021-01-05-1401-distro-basic-smithi/5756733

Following is the problem.

2021-01-05T23:58:23.764+0000 7fa98d622700 20 osd.5 pg_epoch: 531 pg[9.8( v 531'1354 (0'0,531'1354] local-lis/les=528/529 n=94 ec=520/520 lis/c=528/528 les/c/f=529/529/0 sis=528) [5,7] r=0 lpr=528 luod=531'1352 crt=531'1354 lcod 531'1351 mlcod 531'1351 active+clean+scrubbing [ 9.8:  ] ] do_op: op osd_op(client.9419.0:24883 9.8 9:130ad8eb:::benchmark_data_smithi040_141080_object3110:head [set-alloc-hint object_size 65536 write_size 8192,write 16384~8192 in=8192b] snapc 0=[] ondisk+write+known_if_redirected e531) v8
2021-01-05T23:58:23.764+0000 7fa98d622700 20 osd.5 pg_epoch: 531 pg[9.8( v 531'1354 (0'0,531'1354] local-lis/les=528/529 n=94 ec=520/520 lis/c=528/528 les/c/f=529/529/0 sis=528) [5,7] r=0 lpr=528 luod=531'1352 crt=531'1354 lcod 531'1351 mlcod 531'1351 active+clean+scrubbing [ 9.8:  ] ]  scrubber pg(9.8) write_blocked_by_scrub 9:130ad8eb:::benchmark_data_smithi040_141080_object3110:head can preempt? 0 already preempted? 1
2021-01-05T23:58:23.764+0000 7fa98d622700 20 osd.5 pg_epoch: 531 pg[9.8( v 531'1354 (0'0,531'1354] local-lis/les=528/529 n=94 ec=520/520 lis/c=528/528 les/c/f=529/529/0 sis=528) [5,7] r=0 lpr=528 luod=531'1352 crt=531'1354 lcod 531'1351 mlcod 531'1351 active+clean+scrubbing [ 9.8:  ] ] do_op: waiting for scrub

Then,

2021-01-05T23:58:23.773+0000 7fa99162a700 20 osd.5 pg_epoch: 531 pg[9.8( v 531'1358 (0'0,531'1358] local-lis/les=528/529 n=94 ec=520/520 lis/c=528/528 les/c/f=529/529/0 sis=528) [5,7] r=0 lpr=528 luod=531'1352 crt=531'1358 lcod 531'1351 mlcod 531'1351 active+clean+scrubbing [ 9.8:  ] ] do_op: op osd_op(client.9419.0:24888 9.8 9:130ad8eb:::benchmark_data_smithi040_141080_object3110:head [set-alloc-hint object_size 65536 write_size 8192,write 57344~8192 in=8192b] snapc 0=[] ondisk+write+known_if_redirected e531) v8

2021-01-05T23:58:23.773+0000 7fa99162a700 20 osd.5 pg_epoch: 531 pg[9.8( v 531'1358 (0'0,531'1358] local-lis/les=528/529 n=94 ec=520/520 lis/c=528/528 les/c/f=529/529/0 sis=528) [5,7] r=0 lpr=528 luod=531'1352 crt=531'1358 lcod 531'1351 mlcod 531'1351 active+clean+scrubbing [ 9.8:  ] ]  op order client.9419 tid 24888 last was 24887

But,

2021-01-05T23:58:23.777+0000 7fa99162a700 20 osd.5 pg_epoch: 531 pg[9.8( v 531'1359 (0'0,531'1359] local-lis/les=528/529 n=94 ec=520/520 lis/c=528/528 les/c/f=529/529/0 sis=528) [5,7] r=0 lpr=528 luod=531'1354 crt=531'1359 lcod 531'1353 mlcod 531'1353 active+clean+scrubbing [ 9.8:  ] ] do_op: op osd_op(client.9419.0:24883 9.8 9:130ad8eb:::benchmark_data_smithi040_141080_object3110:head [set-alloc-hint object_size 65536 write_size 8192,write 16384~8192 in=8192b] snapc 0=[] ondisk+write+known_if_redirected e531) v8

2021-01-05T23:58:23.777+0000 7fa99162a700 -1 osd.5 pg_epoch: 531 pg[9.8( v 531'1359 (0'0,531'1359] local-lis/les=528/529 n=94 ec=520/520 lis/c=528/528 les/c/f=529/529/0 sis=528) [5,7] r=0 lpr=528 luod=531'1354 crt=531'1359 lcod 531'1353 mlcod 531'1353 active+clean+scrubbing [ 9.8:  ] ] bad op order, already applied 24888 > this 24883

Actions

Copy link

Updated by Neha Ojha about 5 years ago

Assignee set to Ronen Friedman

Ronen, can you please check if this is related to your scrub refactor?

Actions

Copy link

Updated by Neha Ojha about 5 years ago

Status changed from New to Triaged

Actions

Copy link

#10

Updated by Ronen Friedman about 5 years ago

Status changed from Triaged to In Progress

In the revised scrub code there is a period in which:
- the scrub is marked as 'preempted', and
- preemption is already disabled, but
- the scrub range is still set (start != end).

(in the original code - that sequence is part of one block of execution)

A write request at that time will be blocked, but will not be restarted on time.
(It will eventually be restarted, as we do not clear the 'waiting for scrub', but much later and
in the wrong order).

Actions

Copy link

#11

Updated by Ronen Friedman about 5 years ago

Status changed from In Progress to Fix Under Review
Reviewed set to 01/30/2021
Pull request ID set to 39145

Actions

Copy link

#12

Updated by Neha Ojha about 5 years ago

Status changed from Fix Under Review to Pending Backport

Actions

Copy link

#13

Updated by Upkeep Bot about 5 years ago

Copied to Backport #49145: pacific: out of order op added

Actions

Copy link

#14

Updated by Neha Ojha about 5 years ago

Has duplicate Bug #47003: ceph_test_rados test error. Reponses out of order due to the connection drops data. added

Actions

Copy link

#15

Updated by Nathan Cutler about 5 years ago

Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Copy link

#16

Updated by Laura Flores almost 4 years ago

@Neha . @Ronen this Octopus failure looks a lot like this Tracker. Was the revised scrub code backported to Octopus, or is this something different?

/a/lflores-2022-05-04_18:59:38-rados-wip-55077-octopus-distro-default-smithi/6820808

2022-05-06T02:24:42.307 INFO:tasks.ceph.osd.0.smithi089.stderr:2022-05-06T02:24:42.305+0000 7f9d28b54700 -1 osd.0 pg_epoch: 41 pg[4.0( v 41'162 (41'159,41'162] local-lis/les=40/41 n=5 ec=21/21 lis/c=40/21 les/c/f=41/22/0 sis=40) [3,7,0]/[0,7] backfill=[3] r=0 lpr=40 pi=[21,40)/1 crt=41'162 lcod 41'161 mlcod 41'161 active+undersized+degraded+remapped+backfilling rops=1 mbc={255={}}] bad op order, already applied 2447 > this 2446
2022-05-06T02:24:42.309 INFO:tasks.rados.rados.0.smithi089.stdout:1072:  writing smithi08915060-44 from 608486 to 1170455 tid 1
2022-05-06T02:24:42.310 INFO:tasks.ceph.osd.0.smithi089.stderr:/build/ceph-15.2.16-87-g7f781e26/src/osd/PrimaryLogPG.cc: In function 'void PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)' thread 7f9d28b54700 time 2022-05-06T02:24:42.308750+0000
2022-05-06T02:24:42.310 INFO:tasks.ceph.osd.0.smithi089.stderr:/build/ceph-15.2.16-87-g7f781e26/src/osd/PrimaryLogPG.cc: 3982: ceph_abort_msg("out of order op")
2022-05-06T02:24:42.311 INFO:tasks.ceph.osd.0.smithi089.stderr: ceph version 15.2.16-87-g7f781e26 (7f781e26c2a8d3e1d920c6fd32e81b5ef3019541) octopus (stable)
2022-05-06T02:24:42.311 INFO:tasks.ceph.osd.0.smithi089.stderr: 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe0) [0x558480d2e074]
2022-05-06T02:24:42.311 INFO:tasks.ceph.osd.0.smithi089.stderr: 2: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0x193c) [0x558480f7f0bc]
2022-05-06T02:24:42.311 INFO:tasks.ceph.osd.0.smithi089.stderr: 3: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x3570) [0x558480f827e0]
2022-05-06T02:24:42.312 INFO:tasks.ceph.osd.0.smithi089.stderr: 4: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xe50) [0x558480f8d290]
2022-05-06T02:24:42.312 INFO:tasks.ceph.osd.0.smithi089.stderr: 5: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1c0) [0x558480dddc40]
2022-05-06T02:24:42.312 INFO:tasks.ceph.osd.0.smithi089.stderr: 6: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x66) [0x55848108b886]
2022-05-06T02:24:42.312 INFO:tasks.ceph.osd.0.smithi089.stderr: 7: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xe8d) [0x558480e0afdd]
2022-05-06T02:24:42.312 INFO:tasks.ceph.osd.0.smithi089.stderr: 8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x403) [0x5584814f4453]
2022-05-06T02:24:42.313 INFO:tasks.ceph.osd.0.smithi089.stderr: 9: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x5584814f7484]
2022-05-06T02:24:42.313 INFO:tasks.ceph.osd.0.smithi089.stderr: 10: (()+0x8609) [0x7f9d474fd609]
2022-05-06T02:24:42.313 INFO:tasks.ceph.osd.0.smithi089.stderr: 11: (clone()+0x43) [0x7f9d47067163]

Actions

Copy link

#17

Updated by Upkeep Bot 8 months ago

Merge Commit set to 0dbc0b67bcf852692a0af692e64bfae11ba0a429
Fixed In set to v17.0.0-462-g0dbc0b67bc
Released In set to v17.2.0~3136
Upkeep Timestamp set to 2025-07-14T18:04:34+00:00

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Tags

Custom queries

Bug #48793

out of order op

Updated by Deepika Upadhyay about 5 years ago

Updated by Neha Ojha about 5 years ago

Updated by Neha Ojha about 5 years ago

Updated by Neha Ojha about 5 years ago

Updated by Neha Ojha about 5 years ago

Updated by Neha Ojha about 5 years ago

Updated by Neha Ojha about 5 years ago

Updated by Neha Ojha about 5 years ago

Updated by Neha Ojha about 5 years ago

Updated by Ronen Friedman about 5 years ago

Updated by Ronen Friedman about 5 years ago

Updated by Neha Ojha about 5 years ago

Updated by Upkeep Bot about 5 years ago

Updated by Neha Ojha about 5 years ago

Updated by Nathan Cutler about 5 years ago

Updated by Laura Flores almost 4 years ago

Updated by Upkeep Bot 8 months ago