Fix: controld: ensure existing node attributes get written back into CIB after daemons failed and got respawned by gao-yan · Pull Request #1693 · ClusterLabs/pacemaker

gao-yan · 2019-02-01T15:20:05Z

If a pacemaker daemon unexpectedly exits, the daemon and probably also
other daemons including controld will try to get respawned. Previously
depending on which daemon had failed, if controld got respawned either
itself or together with some other daemons, existing attributes of this
node might not get written back into CIB again.

It was found by the cts test "Reattach" randomly failing due to loss of
the node attribute "connected" in the CIB right after the test
"ComponentFail", for example if controld or execd was the failed
component.

…CIB after daemons failed and got respawned If a pacemaker daemon unexpectedly exits, the daemon and probably also other daemons including controld will try to get respawned. Previously depending on which daemon had failed, if controld got respawned either itself or together with some other daemons, existing attributes of this node might not get written back into CIB again. It was found by the cts test "Reattach" randomly failing due to loss of the node attribute "connected" in the CIB right after the test "ComponentFail", for example if controld or execd was the failed component.

kgaillot · 2019-02-01T15:55:33Z

There were some very recent fixes in this area - see #1685 and #1686 . Did the problem occur after those were applied?

I'd expect this to be an attrd writer issue, but if attrd didn't respawn that seems unlikely, so I'm not sure what's going wrong.

kgaillot · 2019-02-01T16:12:37Z

Actually, now I remember running into this issue myself. I thought I traced it as expected behavior in certain situations, but I don't remember the details. Do you know what sequence of events causes this?

I'm not sure if this was for the same problem I'm remembering or not, but there's this note for attrd_erase_attrs():

 * \todo If pacemaker-attrd respawns after crashing (see PCMK_respawned),
 *       ideally we'd skip this and sync our attributes from the writer.
 *       However, currently we reject any values for us that the writer has, in
 *       attrd_peer_update().

gao-yan · 2019-02-01T16:26:22Z

This issue doesn't occur if attrd gets the chance to be respawned.

For example if only controld gets respawned, the node will be deleted from cib status. While attrd won't be aware that the attributes have disappeared from the CIB.

gao-yan · 2019-02-01T16:32:10Z

AFAICS, this issue occurs when:

controld fails and gets respawned.
execd fails. execd and controld get respawned.
schedulerd on DC fails. schedulerd and controld get respawned.

So basically it occurs when controld gets respawned while attrd keeps running.

kgaillot · 2019-02-01T16:44:08Z

Ahhh of course, this is https://bugs.clusterlabs.org/show_bug.cgi?id=5351 . Let's coordinate that discussion with this one.

We actually used to do what you're proposing, but it was removed in fe44f40 for scalability reasons. Unfortunately that did not foresee corner cases like CLBZ#5329, #5351, and #5375. I'm reluctant to revert that commit for the scalability issue, but we might have to, at least until we come up with a better solution.

My inclination (as described in #5351) is to put clearing transient attributes from the CIB in the hands of attrd rather than the controller. @HideoYamauchi is looking into how difficult that might be. I suspect we may still want the controller to initiate the clearing, but via an attrd op rather than directly, so attrd always keeps the CIB in sync with what's in its memory.

gao-yan · 2019-02-01T17:26:35Z

Good thinking. Probably at least controld shouldn't directly clear the transient attributes when the node is still unclean yet I assume.

Since @HideoYamauchi is already working on this. I'll patiently wait :-)

HideoYamauchi · 2019-02-01T22:33:54Z

Hi Ken, Hi Yan,

From now on I will work on this problem seriously.
Please wait a little more.

Many thanks,
Hideo Yamauchi.

gao-yan closed this Feb 1, 2019

aleksei-burlakov mentioned this pull request Feb 12, 2020

(WIP) Make controller go through attribute manager to clear transient attributes. #1699

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: controld: ensure existing node attributes get written back into CIB after daemons failed and got respawned#1693

Fix: controld: ensure existing node attributes get written back into CIB after daemons failed and got respawned#1693
gao-yan wants to merge 1 commit intoClusterLabs:2.0from
gao-yan:controld-respawned-refresh-attrd

gao-yan commented Feb 1, 2019

Uh oh!

kgaillot commented Feb 1, 2019

Uh oh!

kgaillot commented Feb 1, 2019

Uh oh!

gao-yan commented Feb 1, 2019 •

edited

Loading

Uh oh!

gao-yan commented Feb 1, 2019 •

edited

Loading

Uh oh!

kgaillot commented Feb 1, 2019

Uh oh!

gao-yan commented Feb 1, 2019

Uh oh!

HideoYamauchi commented Feb 1, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

gao-yan commented Feb 1, 2019

Uh oh!

kgaillot commented Feb 1, 2019

Uh oh!

kgaillot commented Feb 1, 2019

Uh oh!

gao-yan commented Feb 1, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gao-yan commented Feb 1, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kgaillot commented Feb 1, 2019

Uh oh!

gao-yan commented Feb 1, 2019

Uh oh!

HideoYamauchi commented Feb 1, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gao-yan commented Feb 1, 2019 •

edited

Loading

gao-yan commented Feb 1, 2019 •

edited

Loading