Project

General

Profile

Actions

Bug #71002

open

crimson: scan_for_backfill does not complete

Added by Samuel Just 11 months ago. Updated 6 months ago.

Status:
New
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Tags (freeform):
Merge Commit:
Fixed In:
Released In:
Upkeep Timestamp:

Description

sjust-2025-04-17_01:11:15-crimson-rados-wip-sjust-crimson-testing-2025-04-16-1744840247-distro-default-gibba/8245974/mvdlogs

DEBUG 2025-04-18 15:24:54,982 [shard 2:main] osd -  pg_epoch 152 pg[5.6( v 144'1193 (143'1189,144'1193] local-lis/les=146/147 n=1143 ec=112/112 lis/c=146/112 les/c/f=147/113/0 sis=146) [2,3,0]/[3,0] backfill=[2] r=0 lpr=146 pi=[112,146)/1 crt=144'1193 mlcod 0'0 active+undersized+remapped+backfilling  RecoveryBackend::scan_for_backfill: querying obj:5:6ff358e6:::benchmark_data_gibba003_45663_object17589:head
...
DEBUG 2025-04-18 15:24:54,982 [shard 2:main] osd -  pg_epoch 152 pg[5.6( v 144'1193 (143'1189,144'1193] local-lis/les=146/147 n=1143 ec=112/112 lis/c=146/112 les/c/f=147/113/0 sis=146) [2,3,0]/[3,0] backfill=[2] r=0 lpr=146 pi=[112,146)/1 crt=144'1193 mlcod 0'0 active+undersized+remapped+backfilling  ObjectContextLoader::load_and_lock: 5:6ff358e6:::benchmark_data_gibba003_45663_object17589:head 1
...
DEBUG 2025-04-18 15:24:54,982 [shard 2:main] osd -  pg_epoch 152 pg[5.6( v 144'1193 (143'1189,144'1193] local-lis/les=146/147 n=1143 ec=112/112 lis/c=146/112 les/c/f=147/113/0 sis=146) [2,3,0]/[3,0] backfill=[2] r=0 lpr=146 pi=[112,146)/1 crt=144'1193 mlcod 0'0 active+undersized+remapped+backfilling  ObjectContextLoader::load_and_lock_head: 5:6ff358e6:::benchmark_data_gibba003_45663_object17589:head 1
...
DEBUG 2025-04-18 15:24:54,983 [shard 2:main] osd -  pg_epoch 152 pg[5.6( v 144'1193 (143'1189,144'1193] local-lis/les=146/147 n=1143 ec=112/112 lis/c=146/112 les/c/f=147/113/0 sis=146) [2,3,0]/[3,0] backfill=[2] r=0 lpr=146 pi=[112,146)/1 crt=144'1193 mlcod 0'0 active+undersized+remapped+backfilling  RecoveryBackend::scan_for_backfill: querying obj:5:6ffcf217:::benchmark_data_gibba003_45663_object12439:head
DEBUG 2025-04-18 15:24:54,983 [shard 2:main] osd -  pg_epoch 152 pg[5.6( v 144'1193 (143'1189,144'1193] local-lis/les=146/147 n=1143 ec=112/112 lis/c=146/112 les/c/f=147/113/0 sis=146) [2,3,0]/[3,0] backfill=[2] r=0 lpr=146 pi=[112,146)/1 crt=144'1193 mlcod 0'0 active+undersized+remapped+backfilling  ObjectContextLoader::load_and_lock: 5:6ffcf217:::benchmark_data_gibba003_45663_object12439:head 1
DEBUG 2025-04-18 15:24:54,983 [shard 2:main] osd -  pg_epoch 152 pg[5.6( v 144'1193 (143'1189,144'1193] local-lis/les=146/147 n=1143 ec=112/112 lis/c=146/112 les/c/f=147/113/0 sis=146) [2,3,0]/[3,0] backfill=[2] r=0 lpr=146 pi=[112,146)/1 crt=144'1193 mlcod 0'0 active+undersized+remapped+backfilling  ObjectContextLoader::load_and_lock_head: 5:6ffcf217:::benchmark_data_gibba003_45663_object12439:head 1
...
DEBUG 2025-04-18 15:24:54,983 [shard 2:main] osd -  pg_epoch 152 pg[5.6( v 144'1193 (143'1189,144'1193] local-lis/les=146/147 n=1143 ec=112/112 lis/c=146/112 les/c/f=147/113/0 sis=146) [2,3,0]/[3,0] backfill=[2] r=0 lpr=146 pi=[112,146)/1 crt=144'1193 mlcod 0'0 active+undersized+remapped+backfilling  RecoveryBackend::scan_for_backfill: querying obj:5:6ffec17c:::benchmark_data_gibba003_45663_object9977:head
DEBUG 2025-04-18 15:24:54,983 [shard 2:main] osd -  pg_epoch 152 pg[5.6( v 144'1193 (143'1189,144'1193] local-lis/les=146/147 n=1143 ec=112/112 lis/c=146/112 les/c/f=147/113/0 sis=146) [2,3,0]/[3,0] backfill=[2] r=0 lpr=146 pi=[112,146)/1 crt=144'1193 mlcod 0'0 active+undersized+remapped+backfilling  ObjectContextLoader::load_and_lock: 5:6ffec17c:::benchmark_data_gibba003_45663_object9977:head 1
DEBUG 2025-04-18 15:24:54,983 [shard 2:main] osd -  pg_epoch 152 pg[5.6( v 144'1193 (143'1189,144'1193] local-lis/les=146/147 n=1143 ec=112/112 lis/c=146/112 les/c/f=147/113/0 sis=146) [2,3,0]/[3,0] backfill=[2] r=0 lpr=146 pi=[112,146)/1 crt=144'1193 mlcod 0'0 active+undersized+remapped+backfilling  ObjectContextLoader::load_and_lock_head: 5:6ffec17c:::benchmark_data_gibba003_45663_object9977:head
...
DEBUG 2025-04-18 15:24:54,984 [shard 2:main] osd -  pg_epoch 152 pg[5.6( v 144'1193 (143'1189,144'1193] local-lis/les=146/147 n=1143 ec=112/112 lis/c=146/112 les/c/f=147/113/0 sis=146) [2,3,0]/[3,0] backfill=[2] r=0 lpr=146 pi=[112,146)/1 crt=144'1193 mlcod 0'0 active+undersized+remapped+backfilling  RecoveryBackend::scan_for_backfill: found: 5:6ffec17c:::benchmark_data_gibba003_45663_object9977:head  118'618
...
DEBUG 2025-04-18 15:24:54,985 [shard 2:main] osd -  pg_epoch 152 pg[5.6( v 144'1193 (143'1189,144'1193] local-lis/les=146/147 n=1143 ec=112/112 lis/c=146/112 les/c/f=147/113/0 sis=146) [2,3,0]/[3,0] backfill=[2] r=0 lpr=146 pi=[112,146)/1 crt=144'1193 mlcod 0'0 active+undersized+remapped+backfilling  RecoveryBackend::scan_for_backfill: found: 5:6ffcf217:::benchmark_data_gibba003_45663_object12439:head  118'752
...
DEBUG 2025-04-18 15:24:54,985 [shard 2:main] osd -  pg_epoch 152 pg[5.6( v 144'1193 (143'1189,144'1193] local-lis/les=146/147 n=1143 ec=112/112 lis/c=146/112 les/c/f=147/113/0 sis=146) [2,3,0]/[3,0] backfill=[2] r=0 lpr=146 pi=[112,146)/1 crt=144'1193 mlcod 0'0 active+undersized+remapped+backfilling  RecoveryBackend::scan_for_backfill: found: 5:6ff358e6:::benchmark_data_gibba003_45663_object17589:head  120'1088
RecoveryBackend::interruptible_future<BackfillInterval>
RecoveryBackend::scan_for_backfill(
  const hobject_t start,
  [[maybe_unused]] const std::int64_t min,
  const std::int64_t max)
{
  LOG_PREFIX(RecoveryBackend::scan_for_backfill);
  DEBUGDPP("starting from {}", pg, start);
  auto version_map = seastar::make_lw_shared<std::map<hobject_t, eversion_t>>();
  auto&& [objects, next] = co_await backend->list_objects(start, max);
  co_await interruptor::parallel_for_each(objects, seastar::coroutine::lambda([FNAME, this, version_map]
    (const hobject_t& object) -> interruptible_future<> {
    DEBUGDPP("querying obj:{}", pg, object);
    auto obc_manager = pg.obc_loader.get_obc_manager(
      object, /* resolve_clone = */ false);

    auto found = co_await pg.obc_loader.load_and_lock(
      obc_manager, RWState::RWREAD
    ).si_then([] {
      return true;
    }).handle_error_interruptible(
      crimson::ct_error::enoent::handle([](auto) {
    return false;
      }),
      crimson::ct_error::assert_all("unexpected error")
    );
    if (!found) {
      // if the object does not exist here, it must have been removed
      // between the collection_list_partial and here.  This can happen
      // for the first item in the range, which is usually last_backfill.
      co_return;
    }

    if (obc_manager.get_obc()->obs.exists) {
      auto version = obc_manager.get_obc()->obs.oi.version;
      version_map->emplace(object, version);
      DEBUGDPP("found: {}  {}", pg,
               object, version);
      co_return;
    } else {
      // if the object does not exist here, it must have been removed
      // between the collection_list_partial and here.  This can happen
      // for the first item in the range, which is usually last_backfill.
      co_return;
    }
  }));
  BackfillInterval bi;
  bi.begin = std::move(start);
  bi.end = std::move(next);
  bi.objects = std::move(*version_map);
  DEBUGDPP("{} BackfillInterval filled, leaving, {}",
           "scan_for_backfill",
           pg, bi);
  co_return std::move(bi);
}

So, we query 3 objects and find 3 objects. We never actually see that last debug line.

Going to do some cleanups here and add more debugging to start.


Related issues 1 (1 open0 closed)

Related to crimson - Bug #73071: Crimson: RecoveryBackend::scan_for_backfill_primary() never endNew

Actions
Actions #1

Updated by Matan Breizman 11 months ago · Edited

This can be spotted from client side by single pg unable to go back to active+clean:

2025-05-07T08:55:06.990 INFO:tasks.ceph.ceph_manager.ceph:PG 3.8 is not active+clean
2025-05-07T08:55:06.991 INFO:tasks.ceph.ceph_manager.ceph:{'pgid': '3.8', 'version': "38'145", 'reported_seq': 322, 'reported_epoch': 74, 'state': 'active+undersized+degraded+remapped+backfilling', 'last_fresh': '2025-05-07T08:15:13.073487+0000', 'last_change': '2025-05-07T08:14:49.407749+0000', 'last_active': '2025-05-07T08:15:13.073487+0000', 'last_peered': '2025-05-07T08:15:13.073487+0000', 'last_clean': '2025-05-07T08:13:19.801282+0000', 'last_became_active': '2025-05-07T08:14:19.541883+0000', 'last_became_peered': '2025-05-07T08:14:19.541883+0000', 'last_unstale': '2025-05-07T08:15:13.073487+0000', 'last_undegraded': '2025-05-07T08:14:23.488215+0000', 'last_fullsized': '2025-05-07T08:14:19.098102+0000', 'mapping_epoch': 40, 'log_start': "38'141", 'ondisk_log_start': "38'141", 'created': 18, 'last_epoch_clean': 19, 'parent': '0.0', 'parent_split_bits': 0, 'last_scrub': "0'0", 'last_scrub_stamp': '2025-05-07T08:13:09.529059+0000', 'last_deep_scrub': "0'0", 'last_deep_scrub_stamp': '2025-05-07T08:13:09.529059+0000', 'last_clean_scrub_stamp': '2025-05-07T08:13:09.529059+0000', 'objects_scrubbed': 0, 'log_size': 4, 'log_dups_size': 141, 'ondisk_log_size': 4, 'stats_invalid': False, 'dirty_stats_invalid': False, 'omap_stats_invalid': False, 'hitset_stats_invalid': False, 'hitset_bytes_stats_invalid': False, 'pin_stats_invalid': False, 'manifest_stats_invalid': False, 'snaptrimq_len': 3, 'last_scrub_duration': 0, 'scrub_schedule': '--', 'scrub_duration': 0, 'objects_trimmed': 0, 'snaptrim_duration': 0, 'stat_sum': {'num_bytes': 124239720, 'num_objects': 41, 'num_object_clones': 0, 'num_object_copies': 123, 'num_objects_missing_on_primary': 0, 'num_objects_missing': 0, 'num_objects_degraded': 41, 'num_objects_misplaced': 0, 'num_objects_unfound': 0, 'num_objects_dirty': 0, 'num_whiteouts': 0, 'num_read': 80, 'num_read_kb': 39, 'num_write': 185, 'num_write_kb': 56477, 'num_scrub_errors': 0, 'num_shallow_scrub_errors': 0, 'num_deep_scrub_errors': 0, 'num_objects_recovered': 1, 'num_bytes_recovered': 525480, 'num_keys_recovered': 0, 'num_objects_omap': 0, 'num_objects_hit_set_archive': 0, 'num_bytes_hit_set_archive': 0, 'num_flush': 0, 'num_flush_kb': 0, 'num_evict': 0, 'num_evict_kb': 0, 'num_promote': 0, 'num_flush_mode_high': 0, 'num_flush_mode_low': 0, 'num_evict_mode_some': 0, 'num_evict_mode_full': 0, 'num_objects_pinned': 0, 'num_legacy_snapsets': 0, 'num_large_omap_objects': 0, 'num_objects_manifest': 0, 'num_omap_bytes': 0, 'num_omap_keys': 0, 'num_objects_repaired': 0}, 'up': [1, 3, 2], 'acting': [1, 3], 'avail_no_missing': ['1', '3'], 'object_location_counts': [{'shards': '1,3', 'objects': 41}], 'blocked_by': [], 'up_primary': 1, 'acting_primary': 1, 'purged_snaps': []}
2025-05-07T08:55:06.991 INFO:tasks.thrashosds.thrasher:Traceback (most recent call last):

https://pulpito.ceph.com/teuthology-2025-05-06_20:56:03-crimson-rados-main-distro-crimson-smithi/8274153

Original report:
https://pulpito.ceph.com/sjust-2025-04-17_01:11:15-crimson-rados-wip-sjust-crimson-testing-2025-04-16-1744840247-distro-default-gibba/8245974

2025-04-18T16:05:14.442 INFO:tasks.ceph.ceph_manager.ceph:PG 5.6 is not active+clean
2025-04-18T16:05:14.442 INFO:tasks.ceph.ceph_manager.ceph:{'pgid': '5.6', 'version': "144'1193", 'reported_seq': 1912, 'reported_epoch': 162, 'state': 'active+undersized+remapped+backfilling
', 'last_fresh': '2025-04-18T15:25:05.388965+0000', 'last_change': '2025-04-18T15:24:54.981674+0000', 'last_active': '2025-04-18T15:25:05.388965+0000', 'last_peered': '2025-04-18T15:25:05.38
8965+0000', 'last_clean': '2025-04-18T15:22:38.317565+0000', 'last_became_active': '2025-04-18T15:24:49.716763+0000', 'last_became_peered': '2025-04-18T15:24:49.716763+0000', 'last_unstale':
 '2025-04-18T15:25:05.388965+0000', 'last_undegraded': '2025-04-18T15:25:05.388965+0000', 'last_fullsized': '2025-04-18T15:24:49.462433+0000', 'mapping_epoch': 146, 'log_start': "143'1189", 
'ondisk_log_start': "143'1189", 'created': 112, 'last_epoch_clean': 113, 'parent': '0.0', 'parent_split_bits': 0, 'last_scrub': "0'0", 'last_scrub_stamp': '2025-04-18T15:21:55.517270+0000', 
'last_deep_scrub': "0'0", 'last_deep_scrub_stamp': '2025-04-18T15:21:55.517270+0000', 'last_clean_scrub_stamp': '2025-04-18T15:21:55.517270+0000', 'objects_scrubbed': 0, 'log_size': 4, 'log_
dups_size': 1189, 'ondisk_log_size': 4, 'stats_invalid': False, 'dirty_stats_invalid': False, 'omap_stats_invalid': False, 'hitset_stats_invalid': False, 'hitset_bytes_stats_invalid': False,
 'pin_stats_invalid': False, 'manifest_stats_invalid': False, 'snaptrimq_len': 0, 'last_scrub_duration': 0, 'scrub_schedule': '--', 'scrub_duration': 0, 'objects_trimmed': 0, 'snaptrim_durat
ion': 0, 'stat_sum': {'num_bytes': 74907648, 'num_objects': 1143, 'num_object_clones': 0, 'num_object_copies': 3429, 'num_objects_missing_on_primary': 0, 'num_objects_missing': 0, 'num_objec
ts_degraded': 0, 'num_objects_misplaced': 0, 'num_objects_unfound': 0, 'num_objects_dirty': 0, 'num_whiteouts': 0, 'num_read': 0, 'num_read_kb': 0, 'num_write': 1193, 'num_write_kb': 74752, 
'num_scrub_errors': 0, 'num_shallow_scrub_errors': 0, 'num_deep_scrub_errors': 0, 'num_objects_recovered': 298, 'num_bytes_recovered': 19529728, 'num_keys_recovered': 0, 'num_objects_omap': 
0, 'num_objects_hit_set_archive': 0, 'num_bytes_hit_set_archive': 0, 'num_flush': 0, 'num_flush_kb': 0, 'num_evict': 0, 'num_evict_kb': 0, 'num_promote': 0, 'num_flush_mode_high': 0, 'num_fl
ush_mode_low': 0, 'num_evict_mode_some': 0, 'num_evict_mode_full': 0, 'num_objects_pinned': 0, 'num_legacy_snapsets': 0, 'num_large_omap_objects': 0, 'num_objects_manifest': 0, 'num_omap_byt
es': 0, 'num_omap_keys': 0, 'num_objects_repaired': 0}, 'up': [2, 3, 0], 'acting': [3, 0], 'avail_no_missing': ['3', '0', '2'], 'object_location_counts': [{'shards': '0,3', 'objects': 1143}]
, 'blocked_by': [], 'up_primary': 2, 'acting_primary': 3, 'purged_snaps': []}
2025-04-18T16:05:14.444 INFO:tasks.thrashosds.thrasher:Traceback (most recent call last):
Actions #2

Updated by Matan Breizman 10 months ago · Edited

https://pulpito.ceph.com/matan-2025-05-13_09:25:48-crimson-rados-wip-matanb-crimson-error-reply-distro-crimson-smithi/8282931

radosbench-high-concurrency + short_pg_log

2025-05-13T11:16:35.859 INFO:tasks.ceph.ceph_manager.ceph:PG 6.6 is not active+clean
2025-05-13T11:16:35.859 INFO:tasks.ceph.ceph_manager.ceph:{'pgid': '6.6', 'version': "143'434", 'reported_seq': 649, 'reported_epoch': 164, 'state': 'active+undersized+degraded+remapped+backfilling', 'last_fresh': '2025-05-13T10:36:24.199622+0000', 'last_change': '2025-05-13T10:35:53.613636+0000', 'last_active': '2025-05-13T10:36:24.199622+0000', 'last_peered': '2025-05-13T10:36:24.199622+0000', 'last_clean': '2025-05-13T10:35:29.165986+0000', 'last_became_active': '2025-05-13T10:35:48.310649+0000', 'last_became_peered': '2025-05-13T10:35:48.310649+0000', 'last_unstale': '2025-05-13T10:36:24.199622+0000', 'last_undegraded': '2025-05-13T10:35:48.489613+0000', 'last_fullsized': '2025-05-13T10:35:48.207426+0000', 'mapping_epoch': 145, 'log_start': "143'430", 'ondisk_log_start': "143'430", 'created': 106, 'last_epoch_clean': 122, 'parent': '0.0', 'parent_split_bits': 0, 'last_scrub': "0'0", 'last_scrub_stamp': '2025-05-13T10:34:29.299090+0000', 'last_deep_scrub': "0'0", 'last_deep_scrub_stamp': '2025-05-13T10:34:29.299090+0000', 'last_clean_scrub_stamp': '2025-05-13T10:34:29.299090+0000', 'objects_scrubbed': 0, 'log_size': 4, 'log_dups_size': 430, 'ondisk_log_size': 4, 'stats_invalid': False, 'dirty_stats_invalid': False, 'omap_stats_invalid': False, 'hitset_stats_invalid': False, 'hitset_bytes_stats_invalid': False, 'pin_stats_invalid': False, 'manifest_stats_invalid': False, 'snaptrimq_len': 0, 'last_scrub_duration': 0, 'scrub_schedule': '--', 'scrub_duration': 0, 'objects_trimmed': 0, 'snaptrim_duration': 0, 'stat_sum': {'num_bytes': 3555328, 'num_objects': 55, 'num_object_clones': 0, 'num_object_copies': 165, 'num_objects_missing_on_primary': 0, 'num_objects_missing': 0, 'num_objects_degraded': 55, 'num_objects_misplaced': 0, 'num_objects_unfound': 0, 'num_objects_dirty': 0, 'num_whiteouts': 0, 'num_read': 0, 'num_read_kb': 0, 'num_write': 434, 'num_write_kb': 3472, 'num_scrub_errors': 0, 'num_shallow_scrub_errors': 0, 'num_deep_scrub_errors': 0, 'num_objects_recovered': 19, 'num_bytes_recovered': 1187840, 'num_keys_recovered': 0, 'num_objects_omap': 0, 'num_objects_hit_set_archive': 0, 'num_bytes_hit_set_archive': 0, 'num_flush': 0, 'num_flush_kb': 0, 'num_evict': 0, 'num_evict_kb': 0, 'num_promote': 0, 'num_flush_mode_high': 0, 'num_flush_mode_low': 0, 'num_evict_mode_some': 0, 'num_evict_mode_full': 0, 'num_objects_pinned': 0, 'num_legacy_snapsets': 0, 'num_large_omap_objects': 0, 'num_objects_manifest': 0, 'num_omap_bytes': 0, 'num_omap_keys': 0, 'num_objects_repaired': 0}, 'up': [3, 1, 0], 'acting': [3, 0], 'avail_no_missing': ['3', '0'], 'object_location_counts': [{'shards': '0,3', 'objects': 55}], 'blocked_by': [], 'up_primary': 3, 'acting_primary': 3, 'purged_snaps': []}
Actions #3

Updated by Jose J Palacios Perez 6 months ago

2025-09-17T17:18:00.963 INFO:tasks.ceph.ceph_manager.ceph:PG 6.f is not active+clean
2025-09-17T17:18:00.963 INFO:tasks.ceph.ceph_manager.ceph:PG 6.e is not active+clean
2025-09-17T17:18:00.963 INFO:tasks.ceph.ceph_manager.ceph:PG 6.d is not active+clean
2025-09-17T17:18:00.964 INFO:tasks.ceph.ceph_manager.ceph:PG 6.c is not active+clean
Actions #4

Updated by Matan Breizman 6 months ago

  • Related to Bug #73071: Crimson: RecoveryBackend::scan_for_backfill_primary() never end added
Actions

Also available in: Atom PDF