Complete the fix for losing attribute write-out at node shutdown by kgaillot · Pull Request #1685 · ClusterLabs/pacemaker

kgaillot · 2019-01-25T00:45:34Z

Commit 435b772 ironically made a second avenue for the problem it fixed more likely, by making it easier to hit a pre-existing timing issue in the libcrmcluster election code. This fixes that second avenue.

464d481 didn't do what it thought it did

Makes more sense, and allows for future methods to poke at them This would behave differently from the previous code if a second election object is created. However, that has never been a targeted use case, and the new behavior would make more sense anyway.

Otherwise we can easily get the failure case described in 294ec20, due to the recent changes in 435b772.

kgaillot · 2019-01-25T15:13:44Z

Hello @HideoYamauchi , in case you are using commit 435b772, I wanted to let you know about this issue. This PR is testing well, so I plan to merge it soon.

HideoYamauchi · 2019-01-30T00:06:11Z

Hi Ken,

Okay..All right!

Many thanks,
Hideo Yamauchi.

gao-yan · 2021-10-21T16:32:15Z

lib/cluster/election.c

+         * would be to tie the election structure with the peer caches, which
+         * would allow us to clear the dampening when the previous winner
+         * leaves (and would allow other improvements as well).
+         */


Hi Ken,

I'm trying to understand this part of code and how a reported hiccup/race condition is relevant to the comment here.

Node1 was lost. While fencing targeting node1 was performing, it got back again:

2021-09-28T13:39:47+0000 node2 corosync[18758]: [TOTEM ] A new membership (10.42.34.12:604) was formed. Members joined: 1

Apparently attrd on node1 was still running and rejoined attrd cpg group:

2021-09-28T13:39:47+0000 node2 pacemaker-attrd[18809]: notice: Node node1 state is now member

And while a new election was on-going between attrd daemons:

Sep 28 13:39:48 node2 pacemaker-attrd [18809] (election_count_vote) info: election-attrd round 1 (owner node ID 1) pass: vote from node1 (Uptime)

, node1 got lost immediately supposedly at 13:40:49 according to the 20s token timeout:

2021-09-28T13:40:09+0000 node2 corosync[18758]: [TOTEM ] A processor failed, forming new configuration

Then the node was declared lost again:

2021-09-28T13:40:33+0000 node2 corosync[18758]: [TOTEM ] A new membership (10.42.34.63:608) was formed. Members left: 1 2021-09-28T13:40:33+0000 node2 pacemaker-attrd[18809]: notice: Node node1 state is now lost

But the existing election timer of attrd wasn't canceled until it timed out :

Sep 28 13:41:48 node2 pacemaker-attrd [18809] (election_timer_cb) info: election-attrd timed out, declaring local node as winner Sep 28 13:41:48 node2 pacemaker-attrd [18809] (attrd_declare_winner) notice: Recorded local node as attribute writer (was unset)

Apparently during the 120s between 13:39:48 and 13:41:48, there was no attrd writer. The intermediate updates of node attributes didn't get reflected into the cib.

When the peer is lost again, should the existing election be canceled/reset/restarted anyway for example in attrd_elections.c: attrd_remove_voter()?

My gut feeling is that there is a reason we don't start a new election every time any peer leaves, but I don't remember why. Maybe something to do with a flapping network.

The race condition described in the comment is when a node wins an election, then leaves the cluster, and a new election is called, all within 2 seconds (the dampening time). In that case, the other nodes will ignore the new election since they lost recently, even though the node they lost to is no longer part of the cluster.

One complication is that when an election is won, only the node that wins knows who the winner is. Losing nodes simply do nothing, and only know that they lost to some particular nodes, not which one eventually won. That prevents us from ignoring the dampening if the last winner is no longer in the cluster, which would be the ideal solution. (I now realize the comment is wrong that integrating with the peer cache would let us do that -- we still wouldn't know the identity of the last winner.)

Attrd uses autoreap for the peer cache, and doesn't track the known peer cache, so we can't use peer flags to track whether we won or lost the last election round against each peer. I was thinking maybe we could use that to ignore dampening if any node we lost to is no longer in the cluster (though I'm not sure that would cover every possible situation). This still might work if attrd maintained a "lost peer" cache with the node name and win/loss.

Alternatively maybe the winner could announce itself with a broadcast, then the losing nodes would know the last winner. But that would still leave timing issues, if the winner leaves the cluster before it is able to announce itself, or another node processes a vote before getting the broadcast.

gao-yan · 2021-10-22T17:15:25Z

lib/cluster/election.c

            election_timeout_stop(e);

            /* Start a new election by voting down this, and other, peers */
            e->state = election_start;


Indeed there's complicated situations that need to be considered well...

According to the message, we didn't lose in this case though:

Sep 28 13:39:48 node2 pacemaker-attrd [18809] (election_count_vote) info: election-attrd round 1 (owner node ID 1) pass: vote from node1 (Uptime)

Instead we ran into the logic here and started a new election trying to vote down the peer. I assume right when we were waiting for a no-vote from the peer, the peer got lost. We didn't give up waiting, but of course never got to receive another vote to run election_check() again or react to the membership change until the long 120s timer popped.

Ah right, the dampening path doesn't show that log message ...

For this situation, election_count_vote() returns election_start, and attrd calls election_vote() to start a new election (which stops the previous election's timer). Unfortunately the "Started ..." log is debug level so it may not be possible to confirm whether that happened in this case.

election_remove() takes the leaving node out of the election "voted" hash table, and election_check() compares the size of the vote table and the number of active peers in the peer cache. The question is whether one or the other of those numbers can get out of sync with reality, or whether election_check() was not called when it should.

attrd only calls election_check() when it gets a vote from a peer, so maybe the issue is that all the other peers had already voted when the one peer left. Then we would never check the election. Maybe we need to call election_check() when a peer leaves, if there's an election in progress.

Ah right, the dampening path doesn't show that log message ...

For this situation, election_count_vote() returns election_start, and attrd calls election_vote() to start a new election (which stops the previous election's timer). Unfortunately the "Started ..." log is debug level so it may not be possible to confirm whether that happened in this case.

AFAICT from the code, it probably is the only possibility?

election_remove() takes the leaving node out of the election "voted" hash table, and election_check() compares the size of the vote table and the number of active peers in the peer cache. The question is whether one or the other of those numbers can get out of sync with reality, or whether election_check() was not called when it should.

attrd only calls election_check() when it gets a vote from a peer,

IIUC attrd_handle_election_op() also handles votes from the local and election_count_vote() records a local vote if itself is the election owner:

https://github.com/ClusterLabs/pacemaker/blob/master/lib/cluster/election.c#L576-L591

so maybe the issue is that all the other peers had already voted when the one peer left. Then we would never check the election.

I think so. Given that this is a 2-node cluster, I assume it received and handled a local vote before the peer left.

Maybe we need to call election_check() when a peer leaves, if there's an election in progress.

Sounds sensible to me :-) I'll give it a test although it might not be easy to produce the exact race condition ...

For this situation, election_count_vote() returns election_start, and attrd calls election_vote() to start a new election (which stops the previous election's timer). Unfortunately the "Started ..." log is debug level so it may not be possible to confirm whether that happened in this case.

AFAICT from the code, it probably is the only possibility?

As the saying goes, "trust but verify" ;)

so maybe the issue is that all the other peers had already voted when the one peer left. Then we would never check the election.

I think so. Given that this is a 2-node cluster, I assume it received and handled a local vote before the peer left.

Ah, that makes sense that it would be much more likely in a 2-node cluster.

Maybe we need to call election_check() when a peer leaves, if there's an election in progress.

Sounds sensible to me :-) I'll give it a test although it might not be easy to produce the exact race condition ...

Maybe a gdb breakpoint before the peer can vote

Good idea! Set a breakpoint right before the call of send_no_vote() and managed to produce the race. This fix seems to work well:
#2548

Ken Gaillot added 6 commits January 24, 2019 18:01

Log: attrd: log previous writer *before* clearing it

25e1440

Low: libcrmcluster: *really* write only one election storm black box

5957a2b

464d481 didn't do what it thought it did

Doc: libcrmcluster: comment on election dampening failure case

294ec20

Refactor: libcrmcommon: add method to clear election dampening

59ab73a

Fix: attrd: clear election dampening when the writer leaves

5c80a96

Otherwise we can easily get the failure case described in 294ec20, due to the recent changes in 435b772.

kgaillot merged commit 788ee2c into ClusterLabs:2.0 Jan 25, 2019

kgaillot mentioned this pull request Feb 1, 2019

Fix: controld: ensure existing node attributes get written back into CIB after daemons failed and got respawned #1693

Closed

gao-yan reviewed Oct 21, 2021

View reviewed changes

gao-yan reviewed Oct 22, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Complete the fix for losing attribute write-out at node shutdown#1685

Complete the fix for losing attribute write-out at node shutdown#1685
kgaillot merged 6 commits intoClusterLabs:2.0from
kgaillot:release

kgaillot commented Jan 25, 2019

Uh oh!

kgaillot commented Jan 25, 2019

Uh oh!

HideoYamauchi commented Jan 30, 2019 •

edited

Loading

Uh oh!

gao-yan Oct 21, 2021

Uh oh!

kgaillot Oct 21, 2021

Uh oh!

gao-yan Oct 22, 2021

Uh oh!

kgaillot Oct 25, 2021

Uh oh!

gao-yan Oct 26, 2021

Uh oh!

kgaillot Oct 26, 2021

Uh oh!

gao-yan Oct 28, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kgaillot commented Jan 25, 2019

Uh oh!

kgaillot commented Jan 25, 2019

Uh oh!

HideoYamauchi commented Jan 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gao-yan Oct 21, 2021

Choose a reason for hiding this comment

Uh oh!

kgaillot Oct 21, 2021

Choose a reason for hiding this comment

Uh oh!

gao-yan Oct 22, 2021

Choose a reason for hiding this comment

Uh oh!

kgaillot Oct 25, 2021

Choose a reason for hiding this comment

Uh oh!

gao-yan Oct 26, 2021

Choose a reason for hiding this comment

Uh oh!

kgaillot Oct 26, 2021

Choose a reason for hiding this comment

Uh oh!

gao-yan Oct 28, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

HideoYamauchi commented Jan 30, 2019 •

edited

Loading