Enhancement #66107: QuiesceDB: improve resiliency to lost agent acks - CephFS - Ceph

Actions

Copy link

Enhancement #66107

closed

QuiesceDB: improve resiliency to lost agent acks

Added by Leonid Usov almost 2 years ago. Updated over 1 year ago.

Status:

Duplicate

Priority:

Normal

Assignee:

Dhairya Parmar

Category:

Correctness/Safety

Target version:

Ceph - v20.0.0

% Done:

Source:

Backport:

squid

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

quiesce

Labels (FS):

Pull request ID:

Tags (freeform):

Merge Commit:

Fixed In:

Released In:

Upkeep Timestamp:

Description

Investigating the case of a timed-out quiesce at https://pulpito.ceph.com/pdonnell-2024-05-17_04:08:01-fs-wip-pdonnell-testing-20240517.012301-debug-distro-default-smithi/7710717/

looking for quiescer errors:

$ grep "ERROR:tasks.quiescer" */teuthology.log
...
7710717/teuthology.log:2024-05-17T07:15:47.316 ERROR:tasks.quiescer.fs.[cephfs]:Couldn't quiesce root with rc: 110 (ETIMEDOUT), stdout:
...

checking the outstanding ops:

$ grep "Outstanding op" 7710717/teuthology.log
2024-05-17T07:15:06.105 DEBUG:tasks.quiescer.fs.[cephfs]:Outstanding op at rank 0 (c) for a6585db1-1: 'client_request(client.4680:82962 create owner_uid=1000, owner_gid=1316 #0x10000003654/atomic.h 2024-05-17T07:14:18.439403+0000 caller_uid=1000, caller_gid=1316{6,36,1000,1316,})'

There are no quiesce_inode or quiesce_path operations outstanding, which points at a potential ack loss case. Let's look at the traces for acks from the quiesce db leader. Checking the effective quiesce cluster just before the failure in the teuthology log

2024-05-17T07:14:16.854 INFO:teuthology.orchestra.run.smithi059.stdout:{"epoch":19, ... ,"qdb_leader":4294,"qdb_cluster":[4232,4270,4273,4282,4294]}

leader's logs regarding the set in question

2024-05-17T07:14:19.891+0000 7f6cf4800640 20 quiesce.mgr.4294 <check_peer_reports> [a6585db1@133,file:/] up_to_date_peers: 5 min_reported_state: QS_QUIESCING max_reported_state: QS_QUIESCED peer_acks: {{QS_QUIESCED=4232,(19:133),QS_QUIESCED=4273,(19:133),QS_QUIESCED=4282,(19:133),QS_QUIESCED=4294,(19:133)}}
2024-05-17T07:14:19.891+0000 7f6cf4800640 20 quiesce.mgr.4294 <leader_upkeep_set> [a6585db1@133,file:/] min reported state for all (5) peers: QS_QUIESCING. Effective state: QS_QUIESCING
2024-05-17T07:15:02.293+0000 7f6cf4800640 20 quiesce.mgr.4294 <check_peer_reports> [a6585db1@133,file:/] up_to_date_peers: 5 min_reported_state: QS_QUIESCING max_reported_state: QS_QUIESCED peer_acks: {{QS_QUIESCED=4232,(19:133),QS_QUIESCED=4273,(19:133),QS_QUIESCED=4282,(19:133),QS_QUIESCED=4294,(19:133)}}

We see that the only missing agent ack is from the gid 4270. Checking its logs

2024-05-17T07:14:17.293+0000 7f145911b640 10 quiesce.mds.3 <quiesce_dispatch> got q-db[v:(19:133) sets:1/0] from 4294
2024-05-17T07:14:17.293+0000 7f145090a640 20 quiesce.mgr.4270 <quiesce_db_thread_main> notifying agent with db version (19:133)
2024-05-17T07:14:17.293+0000 7f145090a640 20 quiesce.agt <db_update> got a db update version (19:133) with 1 roots
2024-05-17T07:14:17.293+0000 7f145090a640 20 quiesce.mgr.4270 <quiesce_db_thread_main> synchronous agent ack: q-map[v:(19:133) roots:0/0]
2024-05-17T07:14:17.293+0000 7f145090a640 10 quiesce.mds.3 <operator()> sending ack q-map[v:(19:133) roots:0/0] to the leader 4294
2024-05-17T07:14:17.293+0000 7f145110b640 20 quiesce.agt <agent_thread_main> old = (19:132), current = (19:133)
2024-05-17T07:14:17.293+0000 7f145110b640 10 quiesce.mds.3 <operator()> submit_request: value:file:/
2024-05-17T07:14:17.293+0000 7f145110b640 20 quiesce.agt <operator()> completing request (rc=0) for 'file:/'
2024-05-17T07:14:17.293+0000 7f145110b640 20 quiesce.agt <set_upkeep_needed> current = (19:133), pending = (0:0)
2024-05-17T07:14:17.293+0000 7f145110b640 20 quiesce.agt <operator()> done with submit callback for 'file:/'
2024-05-17T07:14:17.293+0000 7f145110b640 10 quiesce.agt <agent_thread_main> got request handle < mds.3:352> for 'file:/'
2024-05-17T07:14:17.293+0000 7f145110b640 20 quiesce.agt <agent_thread_main> asyncrhonous ack for a new version: q-map[v:(19:133) roots:0/0]
2024-05-17T07:14:17.293+0000 7f145110b640 10 quiesce.mds.3 <operator()> sending ack q-map[v:(19:133) roots:0/0] to the leader 4294
2024-05-17T07:14:17.294+0000 7f145110b640 20 quiesce.agt <agent_thread_main> old = (0:0), current = (19:133)
2024-05-17T07:14:17.294+0000 7f145110b640 20 quiesce.agt <agent_thread_main> asyncrhonous ack for a new version: q-map[v:(19:133) roots:1/0]
2024-05-17T07:14:17.294+0000 7f145110b640 10 quiesce.mds.3 <operator()> sending ack q-map[v:(19:133) roots:1/0] to the leader 4294

Looking at the last line here, we see that the agent had submitted an asynchronous ack with 1 active root. The fact that the agent responds with an ack here means that the state of the active root is different from the state in the db version 19:133. So, it's the ack that we're missing. Where did it go? Let's review the messenger traces:

2024-05-17T07:14:17.294+0000 7f145110b640 10 quiesce.mds.3 <operator()> sending ack q-map[v:(19:133) roots:1/0] to the leader 4294
2024-05-17T07:14:17.294+0000 7f145110b640  1 -- [v2:172.21.15.164:6835/4284393260,v1:172.21.15.164:6838/4284393260] send_to--> mds [v2:172.21.15.59:6836/1004932209,v1:172.21.15.59:6837/1004932209] --  -- ?+0 0x55fa959d3080
2024-05-17T07:14:17.294+0000 7f145110b640  1 -- [v2:172.21.15.164:6835/4284393260,v1:172.21.15.164:6838/4284393260] --> [v2:172.21.15.59:6836/1004932209,v1:172.21.15.59:6837/1004932209] --  -- 0x55fa959d3080 con 0x55fa95282000
2024-05-17T07:14:17.294+0000 7f145c922640  0 -- [v2:172.21.15.164:6835/4284393260,v1:172.21.15.164:6838/4284393260] >> [v2:172.21.15.59:6836/1004932209,v1:172.21.15.59:6837/1004932209] conn(0x55fa95282000 msgr2=0x55fa9533cc00 crc :-1 s=STATE_CON
NECTION_ESTABLISHED l=0)._try_send injecting socket failure
2024-05-17T07:14:17.294+0000 7f145c922640  1 -- [v2:172.21.15.164:6835/4284393260,v1:172.21.15.164:6838/4284393260] >> [v2:172.21.15.59:6836/1004932209,v1:172.21.15.59:6837/1004932209] conn(0x55fa95282000 msgr2=0x55fa9533cc00 crc :-1 s=STATE_CON
NECTION_ESTABLISHED l=0)._try_send send error: (32) Broken pipe
2024-05-17T07:14:17.294+0000 7f145c922640  1 --2- [v2:172.21.15.164:6835/4284393260,v1:172.21.15.164:6838/4284393260] >> [v2:172.21.15.59:6836/1004932209,v1:172.21.15.59:6837/1004932209] conn(0x55fa95282000 0x55fa9533cc00 crc :-1 s=READY pgs=130
5 cs=82 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).write_message error sending 0x55fa959d3080, (32) Broken pipe
2024-05-17T07:14:17.294+0000 7f145c922640  1 --2- [v2:172.21.15.164:6835/4284393260,v1:172.21.15.164:6838/4284393260] >> [v2:172.21.15.59:6836/1004932209,v1:172.21.15.59:6837/1004932209] conn(0x55fa95282000 0x55fa9533cc00 crc :-1 s=READY pgs=130
5 cs=82 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).write_event send msg failed
2024-05-17T07:14:17.294+0000 7f145c922640  1 --2- [v2:172.21.15.164:6835/4284393260,v1:172.21.15.164:6838/4284393260] >> [v2:172.21.15.59:6836/1004932209,v1:172.21.15.59:6837/1004932209] conn(0x55fa95282000 0x55fa9533cc00 crc :-1 s=READY pgs=130
5 cs=82 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).write_event send msg failed
2024-05-17T07:14:17.294+0000 7f145c922640  1 --2- [v2:172.21.15.164:6835/4284393260,v1:172.21.15.164:6838/4284393260] >> [v2:172.21.15.59:6836/1004932209,v1:172.21.15.59:6837/1004932209] conn(0x55fa95282000 0x55fa9533cc00 unknown :-1 s=READY pgs
=1305 cs=82 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0)._fault initiating reconnect
2024-05-17T07:14:17.294+0000 7f145c922640  1 --2- [v2:172.21.15.164:6835/4284393260,v1:172.21.15.164:6838/4284393260] >> [v2:172.21.15.59:6836/1004932209,v1:172.21.15.59:6837/1004932209] conn(0x55fa95282000 0x55fa9533cc00 unknown :-1 s=BANNER_CO
NNECTING pgs=1305 cs=83 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0)._handle_peer_banner_payload supported=3 required=0
2024-05-17T07:14:17.294+0000 7f145c922640  1 --2- [v2:172.21.15.164:6835/4284393260,v1:172.21.15.164:6838/4284393260] >>  conn(0x55fa95798900 0x55fa95564b00 unknown :-1 s=NONE pgs=0 cs=0 l=0 rev1=0 crypto rx=0 tx=0 comp rx=0 tx=0).accept
2024-05-17T07:14:17.294+0000 7f145c922640  1 --2- [v2:172.21.15.164:6835/4284393260,v1:172.21.15.164:6838/4284393260] >>  conn(0x55fa95798900 0x55fa95564b00 unknown :-1 s=BANNER_ACCEPTING pgs=0 cs=0 l=0 rev1=0 crypto rx=0 tx=0 comp rx=0 tx=0)._h
andle_peer_banner_payload supported=3 required=0

OK so here's the culprit, this message was dropped by an error injection logic.

What could quiesce db do to overcome this? There are two things to keep in mind.

1. quiesce agent update optimization

The quiesce agent won't be notified of the same quiesce db version twice. With this in place, the next time when the quiesce db sends out re-transmission of the same db the agent won't get notified, and as a result, it won't send the diff map back.
It seems like that optimization is making more harm than good, so we should consider relaxing that, at least for the case when the leader pushes the same version despite it knowing that we've already acked it once.

2. quiesce manager active re-queries

this will happen today no more often than once a second, but it needs some trigger to get there. In this particular case, there was an opportunity:

2024-05-17T07:14:19.891+0000 7f6cf4800640 20 quiesce.mgr.4294 <leader_upkeep_set> [a6585db1@133,file:/] min reported state for all (5) peers: QS_QUIESCING. Effective state: QS_QUIESCING
2024-05-17T07:14:19.891+0000 7f6cf4800640 20 quiesce.mgr.4294 <leader_upkeep_db> updating peer 4232 with 0 sets modified in db version range (133..133]
2024-05-17T07:14:19.891+0000 7f6cf4800640 10 quiesce.mds.0 <operator()> sending listing q-db[v:(19:133) sets:0/0] to the peer 4232
2024-05-17T07:14:19.892+0000 7f6cf4800640 20 quiesce.mgr.4294 <leader_upkeep_db> updating peer 4270 with 0 sets modified in db version range (133..133]
2024-05-17T07:14:19.892+0000 7f6cf4800640 10 quiesce.mds.0 <operator()> sending listing q-db[v:(19:133) sets:0/0] to the peer 4270
2024-05-17T07:14:19.892+0000 7f6cf4800640 20 quiesce.mgr.4294 <leader_upkeep_db> updating peer 4273 with 0 sets modified in db version range (133..133]
2024-05-17T07:14:19.892+0000 7f6cf4800640 10 quiesce.mds.0 <operator()> sending listing q-db[v:(19:133) sets:0/0] to the peer 4273
2024-05-17T07:14:19.892+0000 7f6cf4800640 20 quiesce.mgr.4294 <leader_upkeep_db> updating peer 4282 with 0 sets modified in db version range (133..133]
2024-05-17T07:14:19.893+0000 7f6cf4800640 10 quiesce.mds.0 <operator()> sending listing q-db[v:(19:133) sets:0/0] to the peer 4282

we see that the leader has re-sent the same version because it had internal update (its own quiesce ack) and the 1 second timeout since last comms had passed.
We need to improve this logic to let the qdb manager re-visit all active sets once again and resend the active sets versions to peers that haven't been heard from for over a second.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Venky Shankar almost 2 years ago

Category set to Correctness/Safety
Assignee set to Dhairya Parmar
Target version set to v20.0.0
Backport set to squid

Actions

Copy link

Updated by Greg Farnum almost 2 years ago

@Leonid Usov why is the message getting dropped? MDS connections are of type "lossless_peer" so I'd expect the message to be resent when the connection is reestablished.
I heard something about not using the normal messaging mechanisms in order to broadcast to standbys and I wonder if there's a configuration issue there or something.

Actions

Copy link

Updated by Leonid Usov almost 2 years ago

@Greg Farnum here's how the message is sent, the code is in MDSRankQuiesce.cc:

        std::lock_guard guard(mds_lock);

        if (mdsmap->get_state_gid(membership.leader) == MDSMap::STATE_NULL) {
          dout(5) << "couldn't find the leader " << membership.leader << " in the map" << dendl;
          return -ENOENT;
        }
        auto addrs = mdsmap->get_info_gid(membership.leader).addrs;

        auto ack_msg = make_message<MMDSQuiesceDbAck>();
        dout(10) << "sending ack " << ack << " to the leader " << membership.leader << dendl;
        ack_msg->encode_payload_from({me, std::move(ack)});
        return send_message_mds(ack_msg, addrs);

The overload of send_message_mds that's used here is just one "layer" below the one used elsewhere:

int MDSRank::send_message_mds(const ref_t<Message>& m, mds_rank_t mds)
{
  if (!mdsmap->is_up(mds)) {
    dout(10) << "send_message_mds mds." << mds << " not up, dropping " << *m << dendl;
    return ENOENT;
  } else if (mdsmap->is_bootstrapping(mds)) {
    dout(5) << __func__ << "mds." << mds << " is bootstrapping, deferring " << *m << dendl;
    wait_for_bootstrapped_peer(mds, new C_MDS_RetrySendMessageMDS(this, mds, m));
    return 0;
  }

  // send mdsmap first?
  auto addrs = mdsmap->get_addrs(mds);
  if (mds != whoami && peer_mdsmap_epoch[mds] < mdsmap->get_epoch()) {
    auto _m = make_message<MMDSMap>(monc->get_fsid(), *mdsmap);
    send_message_mds(_m, addrs);
    peer_mdsmap_epoch[mds] = mdsmap->get_epoch();
  }

  // send message
  return send_message_mds(m, addrs);
}

I had to use the lower level version to be able to send to a gid rather than to send to a rank. The reason is that QuiesceDB is designed to be run by any daemon, including standby MDSes, which can't be messaged through a rank.

The messenger should implement the retry functionality you refer to, so it's also applicable here.

Actions

Copy link

Updated by Leonid Usov almost 2 years ago

after looking at this with Greg online, we have some more strange findings.

The system correctly followed the reconnect path, but it appears that it failed to reconnect:

533cc00 unknown :-1 s=READY pgs=1305 cs=82 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0)._fault initiating reconnect
2024-05-17T07:14:17.294+0000 7f145c922640  1 --2- [v2:172.21.15.164:6835/4284393260,v1:172.21.15.164:6838/4284393260] >> [v2:172.21.15.59:6836/1004932209,v1:172.21.15.59:6837/1004932209] conn(0x55fa95282000 0x55fa9
533cc00 unknown :-1 s=BANNER_CONNECTING pgs=1305 cs=83 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0)._handle_peer_banner_payload supported=3 required=0
2024-05-17T07:14:17.294+0000 7f145c922640  1 --2- [v2:172.21.15.164:6835/4284393260,v1:172.21.15.164:6838/4284393260] >>  conn(0x55fa95798900 0x55fa95564b00 unknown :-1 s=NONE pgs=0 cs=0 l=0 rev1=0 crypto rx=0 tx=0
 comp rx=0 tx=0).accept
2024-05-17T07:14:17.294+0000 7f145c922640  1 --2- [v2:172.21.15.164:6835/4284393260,v1:172.21.15.164:6838/4284393260] >>  conn(0x55fa95798900 0x55fa95564b00 unknown :-1 s=BANNER_ACCEPTING pgs=0 cs=0 l=0 rev1=0 cryp
to rx=0 tx=0 comp rx=0 tx=0)._handle_peer_banner_payload supported=3 required=0
2024-05-17T07:14:17.295+0000 7f145c922640 10 mds.d parse_caps: parsing auth_cap_str='allow'
2024-05-17T07:14:17.295+0000 7f145c922640  1 -- [v2:172.21.15.164:6835/4284393260,v1:172.21.15.164:6838/4284393260] >>  conn(0x55fa95798900 msgr2=0x55fa95564b00 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_bulk
 peer close file descriptor 38
2024-05-17T07:14:17.295+0000 7f145c922640  1 -- [v2:172.21.15.164:6835/4284393260,v1:172.21.15.164:6838/4284393260] >>  conn(0x55fa95798900 msgr2=0x55fa95564b00 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_unti
l read failed
2024-05-17T07:14:17.295+0000 7f145c922640  1 --2- [v2:172.21.15.164:6835/4284393260,v1:172.21.15.164:6838/4284393260] >>  conn(0x55fa95798900 0x55fa95564b00 crc :-1 s=AUTH_ACCEPTING_SIGN pgs=0 cs=0 l=0 rev1=1 crypt
o rx=0 tx=0 comp rx=0 tx=0).handle_read_frame_preamble_main read frame preamble failed r=-1
2024-05-17T07:14:17.295+0000 7f145c922640  1 --2- [v2:172.21.15.164:6835/4284393260,v1:172.21.15.164:6838/4284393260] >>  conn(0x55fa95798900 0x55fa95564b00 crc :-1 s=AUTH_ACCEPTING_SIGN pgs=0 cs=0 l=0 rev1=1 crypt
o rx=0 tx=0 comp rx=0 tx=0).stop

This seems to be the reason for the message being lost forever

Actions

Copy link

Updated by Leonid Usov almost 2 years ago

OK I followed it back to the `_fault` method on the ProtocolV2.cc.

2024-05-17T07:14:17.295+0000 7f145c922640  1 --2- [v2:172.21.15.164:6835/4284393260,v1:172.21.15.164:6838/4284393260] >>  conn(0x55fa95798900 0x55fa95564b00 crc :-1 s=AUTH_ACCEPTING_SIGN pgs=0 cs=0 l=0 rev1=1 crypt
o rx=0 tx=0 comp rx=0 tx=0).handle_read_frame_preamble_main read frame preamble failed r=-1
2024-05-17T07:14:17.295+0000 7f145c922640  1 --2- [v2:172.21.15.164:6835/4284393260,v1:172.21.15.164:6838/4284393260] >>  conn(0x55fa95798900 0x55fa95564b00 crc :-1 s=AUTH_ACCEPTING_SIGN pgs=0 cs=0 l=0 rev1=1 crypt
o rx=0 tx=0 comp rx=0 tx=0).stop

So, we got an error reading from a connected socket while in read_bulk - no clue why - and we return back the call stack:

read_bulk
read_until
handle_read_frame_preamble_main

in the read frame preamble we hit this `if`:

  if (r < 0) {
    ldout(cct, 1) << __func__ << " read frame preamble failed r=" << r
                  << dendl;
    return _fault();
  }

we must be in the _fault method now, and given that our connection is NOT lossy, the only other place where the stop may be called is this one:

  // requeue sent items
  requeue_sent();

  if (out_queue.empty() && state >= START_ACCEPT &&
      state <= SESSION_ACCEPTING && !replacing) {
    ldout(cct, 2) << __func__ << " with nothing to send and in the half " 
                   << " accept state just closed" << dendl;
    connection->write_lock.unlock();
    stop();
    connection->dispatch_queue->queue_reset(connection);
    return nullptr;
  }

We don't see the ldout(cct, 2) because our log level is 1, but we do see the stop log. This means our out_queue was empty, but we have just called the requeue_sent method when we initiated the reconnect, which must have moved our message from the `sent` queue to the `out_queue`.

Searching for the places where the out_queue could have been cleared, it's only in the `discard_out_queue` which is only called from two functions:

void ProtocolV2::reset_session() {
  ldout(cct, 1) << __func__ << dendl;
  ...

void ProtocolV2::stop() {
  ldout(cct, 1) << __func__ << dendl;
  ...

Both would be seen in the logs due to the ms level 1

I'm not sure what's happening here :/

Actions

Copy link

Updated by Venky Shankar over 1 year ago

I discussed this with @Dhairya Parmar yesterday. This issue needs to be analyzed from the messenger perspective first before adding code to handle lost agent ack messages in the leader quiesce agent. Maybe, also audit the quiesce code to lookout for possible misuse of message sending APIs.

Actions

Copy link