Enhancement #66107
closedQuiesceDB: improve resiliency to lost agent acks
0%
Description
Investigating the case of a timed-out quiesce at https://pulpito.ceph.com/pdonnell-2024-05-17_04:08:01-fs-wip-pdonnell-testing-20240517.012301-debug-distro-default-smithi/7710717/
looking for quiescer errors:
$ grep "ERROR:tasks.quiescer" */teuthology.log ... 7710717/teuthology.log:2024-05-17T07:15:47.316 ERROR:tasks.quiescer.fs.[cephfs]:Couldn't quiesce root with rc: 110 (ETIMEDOUT), stdout: ...
checking the outstanding ops:
$ grep "Outstanding op" 7710717/teuthology.log
2024-05-17T07:15:06.105 DEBUG:tasks.quiescer.fs.[cephfs]:Outstanding op at rank 0 (c) for a6585db1-1: 'client_request(client.4680:82962 create owner_uid=1000, owner_gid=1316 #0x10000003654/atomic.h 2024-05-17T07:14:18.439403+0000 caller_uid=1000, caller_gid=1316{6,36,1000,1316,})'
There are no quiesce_inode or quiesce_path operations outstanding, which points at a potential ack loss case. Let's look at the traces for acks from the quiesce db leader. Checking the effective quiesce cluster just before the failure in the teuthology log
2024-05-17T07:14:16.854 INFO:teuthology.orchestra.run.smithi059.stdout:{"epoch":19, ... ,"qdb_leader":4294,"qdb_cluster":[4232,4270,4273,4282,4294]}
leader's logs regarding the set in question
2024-05-17T07:14:19.891+0000 7f6cf4800640 20 quiesce.mgr.4294 <check_peer_reports> [a6585db1@133,file:/] up_to_date_peers: 5 min_reported_state: QS_QUIESCING max_reported_state: QS_QUIESCED peer_acks: {{QS_QUIESCED=4232,(19:133),QS_QUIESCED=4273,(19:133),QS_QUIESCED=4282,(19:133),QS_QUIESCED=4294,(19:133)}}
2024-05-17T07:14:19.891+0000 7f6cf4800640 20 quiesce.mgr.4294 <leader_upkeep_set> [a6585db1@133,file:/] min reported state for all (5) peers: QS_QUIESCING. Effective state: QS_QUIESCING
2024-05-17T07:15:02.293+0000 7f6cf4800640 20 quiesce.mgr.4294 <check_peer_reports> [a6585db1@133,file:/] up_to_date_peers: 5 min_reported_state: QS_QUIESCING max_reported_state: QS_QUIESCED peer_acks: {{QS_QUIESCED=4232,(19:133),QS_QUIESCED=4273,(19:133),QS_QUIESCED=4282,(19:133),QS_QUIESCED=4294,(19:133)}}
We see that the only missing agent ack is from the gid 4270. Checking its logs
2024-05-17T07:14:17.293+0000 7f145911b640 10 quiesce.mds.3 <quiesce_dispatch> got q-db[v:(19:133) sets:1/0] from 4294 2024-05-17T07:14:17.293+0000 7f145090a640 20 quiesce.mgr.4270 <quiesce_db_thread_main> notifying agent with db version (19:133) 2024-05-17T07:14:17.293+0000 7f145090a640 20 quiesce.agt <db_update> got a db update version (19:133) with 1 roots 2024-05-17T07:14:17.293+0000 7f145090a640 20 quiesce.mgr.4270 <quiesce_db_thread_main> synchronous agent ack: q-map[v:(19:133) roots:0/0] 2024-05-17T07:14:17.293+0000 7f145090a640 10 quiesce.mds.3 <operator()> sending ack q-map[v:(19:133) roots:0/0] to the leader 4294 2024-05-17T07:14:17.293+0000 7f145110b640 20 quiesce.agt <agent_thread_main> old = (19:132), current = (19:133) 2024-05-17T07:14:17.293+0000 7f145110b640 10 quiesce.mds.3 <operator()> submit_request: value:file:/ 2024-05-17T07:14:17.293+0000 7f145110b640 20 quiesce.agt <operator()> completing request (rc=0) for 'file:/' 2024-05-17T07:14:17.293+0000 7f145110b640 20 quiesce.agt <set_upkeep_needed> current = (19:133), pending = (0:0) 2024-05-17T07:14:17.293+0000 7f145110b640 20 quiesce.agt <operator()> done with submit callback for 'file:/' 2024-05-17T07:14:17.293+0000 7f145110b640 10 quiesce.agt <agent_thread_main> got request handle < mds.3:352> for 'file:/' 2024-05-17T07:14:17.293+0000 7f145110b640 20 quiesce.agt <agent_thread_main> asyncrhonous ack for a new version: q-map[v:(19:133) roots:0/0] 2024-05-17T07:14:17.293+0000 7f145110b640 10 quiesce.mds.3 <operator()> sending ack q-map[v:(19:133) roots:0/0] to the leader 4294 2024-05-17T07:14:17.294+0000 7f145110b640 20 quiesce.agt <agent_thread_main> old = (0:0), current = (19:133) 2024-05-17T07:14:17.294+0000 7f145110b640 20 quiesce.agt <agent_thread_main> asyncrhonous ack for a new version: q-map[v:(19:133) roots:1/0] 2024-05-17T07:14:17.294+0000 7f145110b640 10 quiesce.mds.3 <operator()> sending ack q-map[v:(19:133) roots:1/0] to the leader 4294
Looking at the last line here, we see that the agent had submitted an asynchronous ack with 1 active root. The fact that the agent responds with an ack here means that the state of the active root is different from the state in the db version 19:133. So, it's the ack that we're missing. Where did it go? Let's review the messenger traces:
2024-05-17T07:14:17.294+0000 7f145110b640 10 quiesce.mds.3 <operator()> sending ack q-map[v:(19:133) roots:1/0] to the leader 4294 2024-05-17T07:14:17.294+0000 7f145110b640 1 -- [v2:172.21.15.164:6835/4284393260,v1:172.21.15.164:6838/4284393260] send_to--> mds [v2:172.21.15.59:6836/1004932209,v1:172.21.15.59:6837/1004932209] -- -- ?+0 0x55fa959d3080 2024-05-17T07:14:17.294+0000 7f145110b640 1 -- [v2:172.21.15.164:6835/4284393260,v1:172.21.15.164:6838/4284393260] --> [v2:172.21.15.59:6836/1004932209,v1:172.21.15.59:6837/1004932209] -- -- 0x55fa959d3080 con 0x55fa95282000 2024-05-17T07:14:17.294+0000 7f145c922640 0 -- [v2:172.21.15.164:6835/4284393260,v1:172.21.15.164:6838/4284393260] >> [v2:172.21.15.59:6836/1004932209,v1:172.21.15.59:6837/1004932209] conn(0x55fa95282000 msgr2=0x55fa9533cc00 crc :-1 s=STATE_CON NECTION_ESTABLISHED l=0)._try_send injecting socket failure 2024-05-17T07:14:17.294+0000 7f145c922640 1 -- [v2:172.21.15.164:6835/4284393260,v1:172.21.15.164:6838/4284393260] >> [v2:172.21.15.59:6836/1004932209,v1:172.21.15.59:6837/1004932209] conn(0x55fa95282000 msgr2=0x55fa9533cc00 crc :-1 s=STATE_CON NECTION_ESTABLISHED l=0)._try_send send error: (32) Broken pipe 2024-05-17T07:14:17.294+0000 7f145c922640 1 --2- [v2:172.21.15.164:6835/4284393260,v1:172.21.15.164:6838/4284393260] >> [v2:172.21.15.59:6836/1004932209,v1:172.21.15.59:6837/1004932209] conn(0x55fa95282000 0x55fa9533cc00 crc :-1 s=READY pgs=130 5 cs=82 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).write_message error sending 0x55fa959d3080, (32) Broken pipe 2024-05-17T07:14:17.294+0000 7f145c922640 1 --2- [v2:172.21.15.164:6835/4284393260,v1:172.21.15.164:6838/4284393260] >> [v2:172.21.15.59:6836/1004932209,v1:172.21.15.59:6837/1004932209] conn(0x55fa95282000 0x55fa9533cc00 crc :-1 s=READY pgs=130 5 cs=82 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).write_event send msg failed 2024-05-17T07:14:17.294+0000 7f145c922640 1 --2- [v2:172.21.15.164:6835/4284393260,v1:172.21.15.164:6838/4284393260] >> [v2:172.21.15.59:6836/1004932209,v1:172.21.15.59:6837/1004932209] conn(0x55fa95282000 0x55fa9533cc00 crc :-1 s=READY pgs=130 5 cs=82 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).write_event send msg failed 2024-05-17T07:14:17.294+0000 7f145c922640 1 --2- [v2:172.21.15.164:6835/4284393260,v1:172.21.15.164:6838/4284393260] >> [v2:172.21.15.59:6836/1004932209,v1:172.21.15.59:6837/1004932209] conn(0x55fa95282000 0x55fa9533cc00 unknown :-1 s=READY pgs =1305 cs=82 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0)._fault initiating reconnect 2024-05-17T07:14:17.294+0000 7f145c922640 1 --2- [v2:172.21.15.164:6835/4284393260,v1:172.21.15.164:6838/4284393260] >> [v2:172.21.15.59:6836/1004932209,v1:172.21.15.59:6837/1004932209] conn(0x55fa95282000 0x55fa9533cc00 unknown :-1 s=BANNER_CO NNECTING pgs=1305 cs=83 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0)._handle_peer_banner_payload supported=3 required=0 2024-05-17T07:14:17.294+0000 7f145c922640 1 --2- [v2:172.21.15.164:6835/4284393260,v1:172.21.15.164:6838/4284393260] >> conn(0x55fa95798900 0x55fa95564b00 unknown :-1 s=NONE pgs=0 cs=0 l=0 rev1=0 crypto rx=0 tx=0 comp rx=0 tx=0).accept 2024-05-17T07:14:17.294+0000 7f145c922640 1 --2- [v2:172.21.15.164:6835/4284393260,v1:172.21.15.164:6838/4284393260] >> conn(0x55fa95798900 0x55fa95564b00 unknown :-1 s=BANNER_ACCEPTING pgs=0 cs=0 l=0 rev1=0 crypto rx=0 tx=0 comp rx=0 tx=0)._h andle_peer_banner_payload supported=3 required=0
OK so here's the culprit, this message was dropped by an error injection logic.
What could quiesce db do to overcome this? There are two things to keep in mind.
1. quiesce agent update optimization
The quiesce agent won't be notified of the same quiesce db version twice. With this in place, the next time when the quiesce db sends out re-transmission of the same db the agent won't get notified, and as a result, it won't send the diff map back.
It seems like that optimization is making more harm than good, so we should consider relaxing that, at least for the case when the leader pushes the same version despite it knowing that we've already acked it once.
2. quiesce manager active re-queries
this will happen today no more often than once a second, but it needs some trigger to get there. In this particular case, there was an opportunity:
2024-05-17T07:14:19.891+0000 7f6cf4800640 20 quiesce.mgr.4294 <leader_upkeep_set> [a6585db1@133,file:/] min reported state for all (5) peers: QS_QUIESCING. Effective state: QS_QUIESCING 2024-05-17T07:14:19.891+0000 7f6cf4800640 20 quiesce.mgr.4294 <leader_upkeep_db> updating peer 4232 with 0 sets modified in db version range (133..133] 2024-05-17T07:14:19.891+0000 7f6cf4800640 10 quiesce.mds.0 <operator()> sending listing q-db[v:(19:133) sets:0/0] to the peer 4232 2024-05-17T07:14:19.892+0000 7f6cf4800640 20 quiesce.mgr.4294 <leader_upkeep_db> updating peer 4270 with 0 sets modified in db version range (133..133] 2024-05-17T07:14:19.892+0000 7f6cf4800640 10 quiesce.mds.0 <operator()> sending listing q-db[v:(19:133) sets:0/0] to the peer 4270 2024-05-17T07:14:19.892+0000 7f6cf4800640 20 quiesce.mgr.4294 <leader_upkeep_db> updating peer 4273 with 0 sets modified in db version range (133..133] 2024-05-17T07:14:19.892+0000 7f6cf4800640 10 quiesce.mds.0 <operator()> sending listing q-db[v:(19:133) sets:0/0] to the peer 4273 2024-05-17T07:14:19.892+0000 7f6cf4800640 20 quiesce.mgr.4294 <leader_upkeep_db> updating peer 4282 with 0 sets modified in db version range (133..133] 2024-05-17T07:14:19.893+0000 7f6cf4800640 10 quiesce.mds.0 <operator()> sending listing q-db[v:(19:133) sets:0/0] to the peer 4282
we see that the leader has re-sent the same version because it had internal update (its own quiesce ack) and the 1 second timeout since last comms had passed.
We need to improve this logic to let the qdb manager re-visit all active sets once again and resend the active sets versions to peers that haven't been heard from for over a second.
Updated by Venky Shankar almost 2 years ago
- Category set to Correctness/Safety
- Assignee set to Dhairya Parmar
- Target version set to v20.0.0
- Backport set to squid
Updated by Greg Farnum almost 2 years ago
@Leonid Usov why is the message getting dropped? MDS connections are of type "lossless_peer" so I'd expect the message to be resent when the connection is reestablished.
I heard something about not using the normal messaging mechanisms in order to broadcast to standbys and I wonder if there's a configuration issue there or something.
Updated by Leonid Usov almost 2 years ago
@Greg Farnum here's how the message is sent, the code is in MDSRankQuiesce.cc:
std::lock_guard guard(mds_lock);
if (mdsmap->get_state_gid(membership.leader) == MDSMap::STATE_NULL) {
dout(5) << "couldn't find the leader " << membership.leader << " in the map" << dendl;
return -ENOENT;
}
auto addrs = mdsmap->get_info_gid(membership.leader).addrs;
auto ack_msg = make_message<MMDSQuiesceDbAck>();
dout(10) << "sending ack " << ack << " to the leader " << membership.leader << dendl;
ack_msg->encode_payload_from({me, std::move(ack)});
return send_message_mds(ack_msg, addrs);
The overload of send_message_mds that's used here is just one "layer" below the one used elsewhere:
int MDSRank::send_message_mds(const ref_t<Message>& m, mds_rank_t mds)
{
if (!mdsmap->is_up(mds)) {
dout(10) << "send_message_mds mds." << mds << " not up, dropping " << *m << dendl;
return ENOENT;
} else if (mdsmap->is_bootstrapping(mds)) {
dout(5) << __func__ << "mds." << mds << " is bootstrapping, deferring " << *m << dendl;
wait_for_bootstrapped_peer(mds, new C_MDS_RetrySendMessageMDS(this, mds, m));
return 0;
}
// send mdsmap first?
auto addrs = mdsmap->get_addrs(mds);
if (mds != whoami && peer_mdsmap_epoch[mds] < mdsmap->get_epoch()) {
auto _m = make_message<MMDSMap>(monc->get_fsid(), *mdsmap);
send_message_mds(_m, addrs);
peer_mdsmap_epoch[mds] = mdsmap->get_epoch();
}
// send message
return send_message_mds(m, addrs);
}
I had to use the lower level version to be able to send to a gid rather than to send to a rank. The reason is that QuiesceDB is designed to be run by any daemon, including standby MDSes, which can't be messaged through a rank.
The messenger should implement the retry functionality you refer to, so it's also applicable here.
Updated by Leonid Usov almost 2 years ago
after looking at this with Greg online, we have some more strange findings.
The system correctly followed the reconnect path, but it appears that it failed to reconnect:
533cc00 unknown :-1 s=READY pgs=1305 cs=82 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0)._fault initiating reconnect 2024-05-17T07:14:17.294+0000 7f145c922640 1 --2- [v2:172.21.15.164:6835/4284393260,v1:172.21.15.164:6838/4284393260] >> [v2:172.21.15.59:6836/1004932209,v1:172.21.15.59:6837/1004932209] conn(0x55fa95282000 0x55fa9 533cc00 unknown :-1 s=BANNER_CONNECTING pgs=1305 cs=83 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0)._handle_peer_banner_payload supported=3 required=0 2024-05-17T07:14:17.294+0000 7f145c922640 1 --2- [v2:172.21.15.164:6835/4284393260,v1:172.21.15.164:6838/4284393260] >> conn(0x55fa95798900 0x55fa95564b00 unknown :-1 s=NONE pgs=0 cs=0 l=0 rev1=0 crypto rx=0 tx=0 comp rx=0 tx=0).accept 2024-05-17T07:14:17.294+0000 7f145c922640 1 --2- [v2:172.21.15.164:6835/4284393260,v1:172.21.15.164:6838/4284393260] >> conn(0x55fa95798900 0x55fa95564b00 unknown :-1 s=BANNER_ACCEPTING pgs=0 cs=0 l=0 rev1=0 cryp to rx=0 tx=0 comp rx=0 tx=0)._handle_peer_banner_payload supported=3 required=0 2024-05-17T07:14:17.295+0000 7f145c922640 10 mds.d parse_caps: parsing auth_cap_str='allow' 2024-05-17T07:14:17.295+0000 7f145c922640 1 -- [v2:172.21.15.164:6835/4284393260,v1:172.21.15.164:6838/4284393260] >> conn(0x55fa95798900 msgr2=0x55fa95564b00 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_bulk peer close file descriptor 38 2024-05-17T07:14:17.295+0000 7f145c922640 1 -- [v2:172.21.15.164:6835/4284393260,v1:172.21.15.164:6838/4284393260] >> conn(0x55fa95798900 msgr2=0x55fa95564b00 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=0).read_unti l read failed 2024-05-17T07:14:17.295+0000 7f145c922640 1 --2- [v2:172.21.15.164:6835/4284393260,v1:172.21.15.164:6838/4284393260] >> conn(0x55fa95798900 0x55fa95564b00 crc :-1 s=AUTH_ACCEPTING_SIGN pgs=0 cs=0 l=0 rev1=1 crypt o rx=0 tx=0 comp rx=0 tx=0).handle_read_frame_preamble_main read frame preamble failed r=-1 2024-05-17T07:14:17.295+0000 7f145c922640 1 --2- [v2:172.21.15.164:6835/4284393260,v1:172.21.15.164:6838/4284393260] >> conn(0x55fa95798900 0x55fa95564b00 crc :-1 s=AUTH_ACCEPTING_SIGN pgs=0 cs=0 l=0 rev1=1 crypt o rx=0 tx=0 comp rx=0 tx=0).stop
This seems to be the reason for the message being lost forever
Updated by Leonid Usov almost 2 years ago
OK I followed it back to the `_fault` method on the ProtocolV2.cc.
2024-05-17T07:14:17.295+0000 7f145c922640 1 --2- [v2:172.21.15.164:6835/4284393260,v1:172.21.15.164:6838/4284393260] >> conn(0x55fa95798900 0x55fa95564b00 crc :-1 s=AUTH_ACCEPTING_SIGN pgs=0 cs=0 l=0 rev1=1 crypt o rx=0 tx=0 comp rx=0 tx=0).handle_read_frame_preamble_main read frame preamble failed r=-1 2024-05-17T07:14:17.295+0000 7f145c922640 1 --2- [v2:172.21.15.164:6835/4284393260,v1:172.21.15.164:6838/4284393260] >> conn(0x55fa95798900 0x55fa95564b00 crc :-1 s=AUTH_ACCEPTING_SIGN pgs=0 cs=0 l=0 rev1=1 crypt o rx=0 tx=0 comp rx=0 tx=0).stop
So, we got an error reading from a connected socket while in read_bulk - no clue why - and we return back the call stack:
read_bulk read_until handle_read_frame_preamble_main
in the read frame preamble we hit this `if`:
if (r < 0) {
ldout(cct, 1) << __func__ << " read frame preamble failed r=" << r
<< dendl;
return _fault();
}
we must be in the _fault method now, and given that our connection is NOT lossy, the only other place where the stop may be called is this one:
// requeue sent items
requeue_sent();
if (out_queue.empty() && state >= START_ACCEPT &&
state <= SESSION_ACCEPTING && !replacing) {
ldout(cct, 2) << __func__ << " with nothing to send and in the half "
<< " accept state just closed" << dendl;
connection->write_lock.unlock();
stop();
connection->dispatch_queue->queue_reset(connection);
return nullptr;
}
We don't see the ldout(cct, 2) because our log level is 1, but we do see the stop log. This means our out_queue was empty, but we have just called the requeue_sent method when we initiated the reconnect, which must have moved our message from the `sent` queue to the `out_queue`.
Searching for the places where the out_queue could have been cleared, it's only in the `discard_out_queue` which is only called from two functions:
void ProtocolV2::reset_session() {
ldout(cct, 1) << __func__ << dendl;
...
void ProtocolV2::stop() {
ldout(cct, 1) << __func__ << dendl;
...
Both would be seen in the logs due to the ms level 1
I'm not sure what's happening here :/
Updated by Venky Shankar over 1 year ago
I discussed this with @Dhairya Parmar yesterday. This issue needs to be analyzed from the messenger perspective first before adding code to handle lost agent ack messages in the leader quiesce agent. Maybe, also audit the quiesce code to lookout for possible misuse of message sending APIs.
Updated by Patrick Donnelly over 1 year ago
- Is duplicate of Bug #67406: tasks.daemonwatchdog.daemon_watchdog:quiescer.fs.[cephfs] failed added