Complete the fix for losing attribute write-out at node shutdown#1685
Complete the fix for losing attribute write-out at node shutdown#1685kgaillot merged 6 commits intoClusterLabs:2.0from
Conversation
464d481 didn't do what it thought it did
Makes more sense, and allows for future methods to poke at them This would behave differently from the previous code if a second election object is created. However, that has never been a targeted use case, and the new behavior would make more sense anyway.
|
Hello @HideoYamauchi , in case you are using commit 435b772, I wanted to let you know about this issue. This PR is testing well, so I plan to merge it soon. |
|
Hi Ken, Okay..All right! Many thanks, |
| * would be to tie the election structure with the peer caches, which | ||
| * would allow us to clear the dampening when the previous winner | ||
| * leaves (and would allow other improvements as well). | ||
| */ |
There was a problem hiding this comment.
Hi Ken,
I'm trying to understand this part of code and how a reported hiccup/race condition is relevant to the comment here.
Node1 was lost. While fencing targeting node1 was performing, it got back again:
2021-09-28T13:39:47+0000 node2 corosync[18758]: [TOTEM ] A new membership (10.42.34.12:604) was formed. Members joined: 1
Apparently attrd on node1 was still running and rejoined attrd cpg group:
2021-09-28T13:39:47+0000 node2 pacemaker-attrd[18809]: notice: Node node1 state is now member
And while a new election was on-going between attrd daemons:
Sep 28 13:39:48 node2 pacemaker-attrd [18809] (election_count_vote) info: election-attrd round 1 (owner node ID 1) pass: vote from node1 (Uptime)
, node1 got lost immediately supposedly at 13:40:49 according to the 20s token timeout:
2021-09-28T13:40:09+0000 node2 corosync[18758]: [TOTEM ] A processor failed, forming new configuration
Then the node was declared lost again:
2021-09-28T13:40:33+0000 node2 corosync[18758]: [TOTEM ] A new membership (10.42.34.63:608) was formed. Members left: 1
2021-09-28T13:40:33+0000 node2 pacemaker-attrd[18809]: notice: Node node1 state is now lost
But the existing election timer of attrd wasn't canceled until it timed out :
Sep 28 13:41:48 node2 pacemaker-attrd [18809] (election_timer_cb) info: election-attrd timed out, declaring local node as winner
Sep 28 13:41:48 node2 pacemaker-attrd [18809] (attrd_declare_winner) notice: Recorded local node as attribute writer (was unset)
Apparently during the 120s between 13:39:48 and 13:41:48, there was no attrd writer. The intermediate updates of node attributes didn't get reflected into the cib.
When the peer is lost again, should the existing election be canceled/reset/restarted anyway for example in attrd_elections.c: attrd_remove_voter()?
There was a problem hiding this comment.
My gut feeling is that there is a reason we don't start a new election every time any peer leaves, but I don't remember why. Maybe something to do with a flapping network.
The race condition described in the comment is when a node wins an election, then leaves the cluster, and a new election is called, all within 2 seconds (the dampening time). In that case, the other nodes will ignore the new election since they lost recently, even though the node they lost to is no longer part of the cluster.
One complication is that when an election is won, only the node that wins knows who the winner is. Losing nodes simply do nothing, and only know that they lost to some particular nodes, not which one eventually won. That prevents us from ignoring the dampening if the last winner is no longer in the cluster, which would be the ideal solution. (I now realize the comment is wrong that integrating with the peer cache would let us do that -- we still wouldn't know the identity of the last winner.)
Attrd uses autoreap for the peer cache, and doesn't track the known peer cache, so we can't use peer flags to track whether we won or lost the last election round against each peer. I was thinking maybe we could use that to ignore dampening if any node we lost to is no longer in the cluster (though I'm not sure that would cover every possible situation). This still might work if attrd maintained a "lost peer" cache with the node name and win/loss.
Alternatively maybe the winner could announce itself with a broadcast, then the losing nodes would know the last winner. But that would still leave timing issues, if the winner leaves the cluster before it is able to announce itself, or another node processes a vote before getting the broadcast.
| election_timeout_stop(e); | ||
|
|
||
| /* Start a new election by voting down this, and other, peers */ | ||
| e->state = election_start; |
There was a problem hiding this comment.
Indeed there's complicated situations that need to be considered well...
According to the message, we didn't lose in this case though:
Sep 28 13:39:48 node2 pacemaker-attrd [18809] (election_count_vote) info: election-attrd round 1 (owner node ID 1) pass: vote from node1 (Uptime)
Instead we ran into the logic here and started a new election trying to vote down the peer. I assume right when we were waiting for a no-vote from the peer, the peer got lost. We didn't give up waiting, but of course never got to receive another vote to run election_check() again or react to the membership change until the long 120s timer popped.
There was a problem hiding this comment.
Ah right, the dampening path doesn't show that log message ...
For this situation, election_count_vote() returns election_start, and attrd calls election_vote() to start a new election (which stops the previous election's timer). Unfortunately the "Started ..." log is debug level so it may not be possible to confirm whether that happened in this case.
election_remove() takes the leaving node out of the election "voted" hash table, and election_check() compares the size of the vote table and the number of active peers in the peer cache. The question is whether one or the other of those numbers can get out of sync with reality, or whether election_check() was not called when it should.
attrd only calls election_check() when it gets a vote from a peer, so maybe the issue is that all the other peers had already voted when the one peer left. Then we would never check the election. Maybe we need to call election_check() when a peer leaves, if there's an election in progress.
There was a problem hiding this comment.
Ah right, the dampening path doesn't show that log message ...
For this situation, election_count_vote() returns election_start, and attrd calls election_vote() to start a new election (which stops the previous election's timer). Unfortunately the "Started ..." log is debug level so it may not be possible to confirm whether that happened in this case.
AFAICT from the code, it probably is the only possibility?
election_remove() takes the leaving node out of the election "voted" hash table, and election_check() compares the size of the vote table and the number of active peers in the peer cache. The question is whether one or the other of those numbers can get out of sync with reality, or whether election_check() was not called when it should.
attrd only calls election_check() when it gets a vote from a peer,
IIUC attrd_handle_election_op() also handles votes from the local and election_count_vote() records a local vote if itself is the election owner:
https://github.com/ClusterLabs/pacemaker/blob/master/lib/cluster/election.c#L576-L591
so maybe the issue is that all the other peers had already voted when the one peer left. Then we would never check the election.
I think so. Given that this is a 2-node cluster, I assume it received and handled a local vote before the peer left.
Maybe we need to call election_check() when a peer leaves, if there's an election in progress.
Sounds sensible to me :-) I'll give it a test although it might not be easy to produce the exact race condition ...
There was a problem hiding this comment.
For this situation, election_count_vote() returns election_start, and attrd calls election_vote() to start a new election (which stops the previous election's timer). Unfortunately the "Started ..." log is debug level so it may not be possible to confirm whether that happened in this case.
AFAICT from the code, it probably is the only possibility?
As the saying goes, "trust but verify" ;)
so maybe the issue is that all the other peers had already voted when the one peer left. Then we would never check the election.
I think so. Given that this is a 2-node cluster, I assume it received and handled a local vote before the peer left.
Ah, that makes sense that it would be much more likely in a 2-node cluster.
Maybe we need to call election_check() when a peer leaves, if there's an election in progress.
Sounds sensible to me :-) I'll give it a test although it might not be easy to produce the exact race condition ...
Maybe a gdb breakpoint before the peer can vote
There was a problem hiding this comment.
Good idea! Set a breakpoint right before the call of send_no_vote() and managed to produce the race. This fix seems to work well:
#2548
Commit 435b772 ironically made a second avenue for the problem it fixed more likely, by making it easier to hit a pre-existing timing issue in the libcrmcluster election code. This fixes that second avenue.