Project

General

Profile

Actions

Bug #54489

closed

mon: ops get stuck in "resend forwarded message to leader"

Added by Jiaxing Fan about 4 years ago. Updated 5 months ago.

Status:
Resolved
Priority:
High
Category:
-
Target version:
-
% Done:

0%

Source:
Backport:
reef,squid
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Tags (freeform):
backport_processed
Fixed In:
v20.0.0-190-gb1e4a2b1d6
Released In:
v20.2.0~914
Upkeep Timestamp:
2025-11-01T01:28:24+00:00

Description

I hited this bug BUG #22114 in octopus.
"description": "log(2 entries from seq 1 at 2021-12-20T10:43:38.225243+0800)",
"initiated_at": "2021-12-20T10:43:40.490982+0800",
"age": 6306340.1792147979,
"duration": 6306340.179250556,

{
"time": "2021-12-28T08:07:22.952481+0800",
"event": "resend forwarded message to leader"
}, {
"time": "2021-12-29T08:07:28.266030+0800",
"event": "resend forwarded message to leader"
},
I think there is other reason for this bug: we don't remove forward requests belong to keepalive session when trimming mon sessions.
By the way, I also think routed_requests should be protected by a lock.


Related issues 3 (0 open3 closed)

Related to RADOS - Bug #22114: mon: ops get stuck in "resend forwarded message to leader"ResolvedKefu Chai

Actions
Copied to RADOS - Backport #70379: reef: mon: ops get stuck in "resend forwarded message to leader"ResolvedNitzan MordechaiActions
Copied to RADOS - Backport #70380: squid: mon: ops get stuck in "resend forwarded message to leader"ResolvedNitzan MordechaiActions
Actions #1

Updated by Radoslaw Zarzynski about 4 years ago

  • Related to Bug #22114: mon: ops get stuck in "resend forwarded message to leader" added
Actions #2

Updated by Kamoltat (Junior) Sirivadhna almost 2 years ago

  • Priority changed from Normal to High

I think we encountered this downstream, so bumping ...

Actions #3

Updated by Laura Flores over 1 year ago

@Kamoltat (Junior) Sirivadhna can you attach the relevant BZ?

Actions #4

Updated by Radoslaw Zarzynski over 1 year ago

scrub note: bump up. Need to known the details.

Actions #6

Updated by Radoslaw Zarzynski over 1 year ago

  • Assignee set to Nitzan Mordechai

The fix for the related, old bug was adding some missed `mon->no_reply()` calls.
Perhaps we still have a similar missing.

Nitzan, would you mind taking a look? The attached BZ has some potentially useful logs.

Actions #7

Updated by Nitzan Mordechai over 1 year ago

@Kamoltat (Junior) Sirivadhna i couldn't find any "resend forwarded message to leader" events in the logs that the BZ attached, maybe I'm missing something?

the 2 op's that showing as stuck are "forward_request_leader" that never come back. I'll keep checking those logs, but if you have any hints, please let me know

Actions #8

Updated by Radoslaw Zarzynski over 1 year ago

  • Status changed from New to Need More Info

scrub note: @Kamoltat (Junior) Sirivadhna looks the logs in the BZ don't fit the problem.
Would you mind confirming?

Actions #9

Updated by Kamoltat (Junior) Sirivadhna over 1 year ago

@Nitzan Mordechai
So I saw in the BZ that two slow ops events were `forward_request_leader`and `forwarded`. To my understanding, I thought that because the mon gets stuck when they `forwarded` the request to the leader, it is endlessly waiting for something to come back so that it can take action with its current operation. However, nothing came back it seems not to drop the request, therefore, I think the problem in the BZ is similar to that of the tracker, the only difference is that the tracker's `resend forwarded message to leader` as opposed to `forwarded`, but in my opinion they do share a common problem.

Actions #10

Updated by Nitzan Mordechai over 1 year ago

@Kamoltat (Junior) Sirivadhna thanks for the info! will check that

Actions #11

Updated by Kamoltat (Junior) Sirivadhna over 1 year ago

@Nitzan Mordechai I think this BZ downstream might be more relevant,
since the last event has `resend forwarded message to leader`

{
    "ops": [
        {
            "description": "log(1 entries from seq 2 at 2024-05-30T06:54:15.543263+0000)",
            "initiated_at": "2024-05-30T06:54:15.907602+0000",
            "age": 3723.1818099669999,
            "duration": 3723.1818310100002,
            "continuous": false,
            "type_data": {
                "events": [
                    {
                        "event": "initiated",
                        "time": "2024-05-30T06:54:15.907602+0000",
                        "duration": 3.4299999999999999e-07
                    },
                    {
                        "event": "throttled",
                        "time": "2024-05-30T06:54:15.907602+0000",
                        "duration": 4294967295.9999995
                    },
                    {
                        "event": "header_read",
                        "time": "2024-05-30T06:54:15.907602+0000",
                        "duration": 1.0716999999999999e-05
                    },
                    {
                        "event": "all_read",
                        "time": "2024-05-30T06:54:15.907612+0000",
                        "duration": 0.00068134100000000002
                    },
                    {
                        "event": "dispatched",
                        "time": "2024-05-30T06:54:15.908294+0000",
                        "duration": 8.9400000000000004e-07
                    },
                    {
                        "event": "mon:_ms_dispatch",
                        "time": "2024-05-30T06:54:15.908295+0000",
                        "duration": 1.6400000000000001e-07
                    },
                    {
                        "event": "mon:dispatch_op",
                        "time": "2024-05-30T06:54:15.908295+0000",
                        "duration": 3.5900000000000003e-07
                    },
                    {
                        "event": "psvc:dispatch",
                        "time": "2024-05-30T06:54:15.908295+0000",
                        "duration": 8.5199999999999997e-06
                    },
                    {
                        "event": "logm:preprocess_query",
                        "time": "2024-05-30T06:54:15.908304+0000",
                        "duration": 7.2600000000000002e-07
                    },
                    {
                        "event": "logm:preprocess_log",
                        "time": "2024-05-30T06:54:15.908304+0000",
                        "duration": 1.2514000000000001e-05
                    },
                    {
                        "event": "forward_request_leader",
                        "time": "2024-05-30T06:54:15.908317+0000",
                        "duration": 1.0344e-05
                    },
                    {
                        "event": "forwarded",
                        "time": "2024-05-30T06:54:15.908327+0000",
                        "duration": 5.3054240999999998
                    },
                    {
                        "event": "resend forwarded message to leader",
                        "time": "2024-05-30T06:54:21.213751+0000",
                        "duration": 5.3061496789999998
                    }
                ],
                "info": {
                    "seq": 136274,
                    "src_is_mon": false,
                    "source": "osd.21 v2:10.0.210.182:6808/1297045346",
                    "forwarded_to_leader": true
                }
            }
        }
    ],
    "num_ops": 1
}

Actions #12

Updated by Nitzan Mordechai over 1 year ago

@Kamoltat (Junior) Sirivadhna from where did you get that log snap ? maybe I'm missing some of those logs

Actions #13

Updated by Laura Flores over 1 year ago

@Kamoltat (Junior) Sirivadhna looking at this from bug scrub: any info on the above comment?

Actions #14

Updated by Kamoltat (Junior) Sirivadhna over 1 year ago

https://bugzilla.redhat.com/show_bug.cgi?id=2283916 - Observing slow_ops on a mon daemon post site down tests in a 3 AZ cluster
https://bugzilla.redhat.com/show_bug.cgi?id=2174745 - [MDR][Stretch Cluster] Observing Slow Ops on one monitor post OSD replacement in a Stretch cluster

Actions #16

Updated by Laura Flores over 1 year ago

Bug scrub note: still need more info

Actions #17

Updated by Laura Flores over 1 year ago

Actions #18

Updated by Nitzan Mordechai over 1 year ago

still was not able to reproduce, working on it

Actions #19

Updated by Nitzan Mordechai about 1 year ago

  • Status changed from Need More Info to Fix Under Review
  • Pull request ID set to 61933
Actions #20

Updated by Radoslaw Zarzynski about 1 year ago

scrub note: approved, sent to QA.

Actions #21

Updated by Radoslaw Zarzynski about 1 year ago

scrub note: in QA.

Actions #22

Updated by Radoslaw Zarzynski about 1 year ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to reef,squid
Actions #23

Updated by Upkeep Bot about 1 year ago

  • Copied to Backport #70379: reef: mon: ops get stuck in "resend forwarded message to leader" added
Actions #24

Updated by Upkeep Bot about 1 year ago

  • Copied to Backport #70380: squid: mon: ops get stuck in "resend forwarded message to leader" added
Actions #25

Updated by Upkeep Bot about 1 year ago

  • Tags (freeform) set to backport_processed
Actions #26

Updated by Upkeep Bot 9 months ago

  • Merge Commit set to b1e4a2b1d60ed41f787db1709b62bfea5ed889fa
  • Fixed In set to v20.0.0-190-gb1e4a2b1d60
  • Upkeep Timestamp set to 2025-07-09T17:10:47+00:00
Actions #27

Updated by Upkeep Bot 8 months ago

  • Fixed In changed from v20.0.0-190-gb1e4a2b1d60 to v20.0.0-190-gb1e4a2b1d6
  • Upkeep Timestamp changed from 2025-07-09T17:10:47+00:00 to 2025-07-14T17:42:09+00:00
Actions #28

Updated by Nitzan Mordechai 7 months ago

  • Status changed from Pending Backport to Resolved
Actions #29

Updated by Upkeep Bot 5 months ago

  • Released In set to v20.2.0~914
  • Upkeep Timestamp changed from 2025-07-14T17:42:09+00:00 to 2025-11-01T01:28:24+00:00
Actions

Also available in: Atom PDF