Project

General

Profile

Actions

Bug #70811

closed

osd: Recovery latency related perf counters are calculated incorrectly.

Added by Sridhar Seshasayee 12 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
Normal
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Backport:
reef, squid
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Tags (freeform):
backport_processed
Fixed In:
v20.0.0-1384-g99d9fb5558
Released In:
v20.2.0~603
Upkeep Timestamp:
2025-11-01T01:28:44+00:00

Description

This was noticed while analyzing some logs and code inspection.
This affects PGRecovery, PGRecoveryContext and PGRecoveryMsg objects.

Example of incorrect calculation for PGRecovery object in OpSchedulerItem.cc:

void PGRecovery::run(
  OSD *osd,
  OSDShard *sdata,
  PGRef& pg,
  ThreadPool::TPHandle &handle)
{
  osd->logger->tinc(
    l_osd_recovery_queue_lat,
    time_queued - ceph_clock_now());
  osd->do_recovery(pg.get(), epoch_queued, reserved_pushes, priority, handle);
  pg->unlock();
}

The correct latency calculation must be ceph_clock_now() - time_queued

Results from perf dump showing incorrect results:

        "l_osd_recovery_push_queue_latency": {
            "avgcount": 55093,
            "sum": 6247005052.341587249,
            "avgtime": 113390.177560517
        },
        "l_osd_recovery_push_reply_queue_latency": {
            "avgcount": 130713,
            "sum": 18297766598.029444451,
            "avgtime": 139984.290759369
        },
        "l_osd_recovery_pull_queue_latency": {
            "avgcount": 130713,
            "sum": 18297766598.029444451,
            "avgtime": 139984.290759369
        },
        "l_osd_recovery_backfill_queue_latency": {
            "avgcount": 130723,
            "sum": 5907207331.184129853,
            "avgtime": 45188.737492133
        },
        "l_osd_recovery_backfill_remove_queue_latency": {
            "avgcount": 130723,
            "sum": 5907207331.184129853,
            "avgtime": 45188.737492133
        },
        "l_osd_recovery_scan_queue_latency": {
            "avgcount": 130764,
            "sum": 15980169803.779663494,
            "avgtime": 122206.186746961
        },
        "l_osd_recovery_queue_latency": {
            "avgcount": 80616,
            "sum": 16144014518.570876928,
            "avgtime": 200258.193393009
        },
        "l_osd_recovery_context_queue_latency": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        },

Related issues 2 (0 open2 closed)

Copied to RADOS - Backport #70903: reef: osd: Recovery latency related perf counters are calculated incorrectly.ResolvedSridhar SeshasayeeActions
Copied to RADOS - Backport #70904: squid: osd: Recovery latency related perf counters are calculated incorrectly.ResolvedSridhar SeshasayeeActions
Actions #1

Updated by Sridhar Seshasayee 12 months ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 62704
Actions #2

Updated by Sridhar Seshasayee 12 months ago

The counters after applying the fix:

      "l_osd_recovery_push_queue_latency": {
            "avgcount": 17977,
            "sum": 1.066402826,
            "avgtime": 0.000059320
        },
        "l_osd_recovery_push_reply_queue_latency": {
            "avgcount": 68344,
            "sum": 3.422215369,
            "avgtime": 0.000050073
        },
        "l_osd_recovery_pull_queue_latency": {
            "avgcount": 68386,
            "sum": 9.632210712,
            "avgtime": 0.000140850
        },
        "l_osd_recovery_backfill_queue_latency": {
            "avgcount": 68386,
            "sum": 9.632210712,
            "avgtime": 0.000140850
        },
        "l_osd_recovery_backfill_remove_queue_latency": {
            "avgcount": 68386,
            "sum": 9.632210712,
            "avgtime": 0.000140850
        },
        "l_osd_recovery_scan_queue_latency": {
            "avgcount": 68386,
            "sum": 9.632210712,
            "avgtime": 0.000140850
        },
        "l_osd_recovery_queue_latency": {
            "avgcount": 52590,
            "sum": 5.761666190,
            "avgtime": 0.000109558
        },
        "l_osd_recovery_context_queue_latency": {
            "avgcount": 0,
            "sum": 0.000000000,
            "avgtime": 0.000000000
        },

Actions #3

Updated by Laura Flores 12 months ago

Approved for QA...

Actions #4

Updated by Sridhar Seshasayee 11 months ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to reef, squid
Actions #5

Updated by Upkeep Bot 11 months ago

  • Copied to Backport #70903: reef: osd: Recovery latency related perf counters are calculated incorrectly. added
Actions #6

Updated by Upkeep Bot 11 months ago

  • Copied to Backport #70904: squid: osd: Recovery latency related perf counters are calculated incorrectly. added
Actions #7

Updated by Upkeep Bot 11 months ago

  • Tags (freeform) set to backport_processed
Actions #8

Updated by Sridhar Seshasayee 10 months ago

  • Status changed from Pending Backport to Resolved
Actions #9

Updated by Upkeep Bot 9 months ago

  • Merge Commit set to 99d9fb5558ccdc34deacf2541587ee4775329ed1
  • Fixed In set to v20.0.0-1384-g99d9fb5558c
  • Upkeep Timestamp set to 2025-07-09T18:11:33+00:00
Actions #10

Updated by Upkeep Bot 8 months ago

  • Fixed In changed from v20.0.0-1384-g99d9fb5558c to v20.0.0-1384-g99d9fb5558
  • Upkeep Timestamp changed from 2025-07-09T18:11:33+00:00 to 2025-07-14T18:12:15+00:00
Actions #11

Updated by Upkeep Bot 5 months ago

  • Released In set to v20.2.0~603
  • Upkeep Timestamp changed from 2025-07-14T18:12:15+00:00 to 2025-11-01T01:28:44+00:00
Actions

Also available in: Atom PDF