Skip to content

Fix: controld: ensure existing node attributes get written back into CIB after daemons failed and got respawned#1693

Closed
gao-yan wants to merge 1 commit intoClusterLabs:2.0from
gao-yan:controld-respawned-refresh-attrd
Closed

Fix: controld: ensure existing node attributes get written back into CIB after daemons failed and got respawned#1693
gao-yan wants to merge 1 commit intoClusterLabs:2.0from
gao-yan:controld-respawned-refresh-attrd

Conversation

@gao-yan
Copy link
Copy Markdown
Member

@gao-yan gao-yan commented Feb 1, 2019

If a pacemaker daemon unexpectedly exits, the daemon and probably also
other daemons including controld will try to get respawned. Previously
depending on which daemon had failed, if controld got respawned either
itself or together with some other daemons, existing attributes of this
node might not get written back into CIB again.

It was found by the cts test "Reattach" randomly failing due to loss of
the node attribute "connected" in the CIB right after the test
"ComponentFail", for example if controld or execd was the failed
component.

…CIB after daemons failed and got respawned

If a pacemaker daemon unexpectedly exits, the daemon and probably also
other daemons including controld will try to get respawned. Previously
depending on which daemon had failed, if controld got respawned either
itself or together with some other daemons, existing attributes of this
node might not get written back into CIB again.

It was found by the cts test "Reattach" randomly failing due to loss of
the node attribute "connected" in the CIB right after the test
"ComponentFail", for example if controld or execd was the failed
component.
@kgaillot
Copy link
Copy Markdown

kgaillot commented Feb 1, 2019

There were some very recent fixes in this area - see #1685 and #1686 . Did the problem occur after those were applied?

I'd expect this to be an attrd writer issue, but if attrd didn't respawn that seems unlikely, so I'm not sure what's going wrong.

@kgaillot
Copy link
Copy Markdown

kgaillot commented Feb 1, 2019

Actually, now I remember running into this issue myself. I thought I traced it as expected behavior in certain situations, but I don't remember the details. Do you know what sequence of events causes this?

I'm not sure if this was for the same problem I'm remembering or not, but there's this note for attrd_erase_attrs():

 * \todo If pacemaker-attrd respawns after crashing (see PCMK_respawned),
 *       ideally we'd skip this and sync our attributes from the writer.
 *       However, currently we reject any values for us that the writer has, in
 *       attrd_peer_update().

@gao-yan
Copy link
Copy Markdown
Member Author

gao-yan commented Feb 1, 2019

This issue doesn't occur if attrd gets the chance to be respawned.

For example if only controld gets respawned, the node will be deleted from cib status. While attrd won't be aware that the attributes have disappeared from the CIB.

@gao-yan
Copy link
Copy Markdown
Member Author

gao-yan commented Feb 1, 2019

AFAICS, this issue occurs when:

  1. controld fails and gets respawned.
  2. execd fails. execd and controld get respawned.
  3. schedulerd on DC fails. schedulerd and controld get respawned.

So basically it occurs when controld gets respawned while attrd keeps running.

@kgaillot
Copy link
Copy Markdown

kgaillot commented Feb 1, 2019

Ahhh of course, this is https://bugs.clusterlabs.org/show_bug.cgi?id=5351 . Let's coordinate that discussion with this one.

We actually used to do what you're proposing, but it was removed in fe44f40 for scalability reasons. Unfortunately that did not foresee corner cases like CLBZ#5329, #5351, and #5375. I'm reluctant to revert that commit for the scalability issue, but we might have to, at least until we come up with a better solution.

My inclination (as described in #5351) is to put clearing transient attributes from the CIB in the hands of attrd rather than the controller. @HideoYamauchi is looking into how difficult that might be. I suspect we may still want the controller to initiate the clearing, but via an attrd op rather than directly, so attrd always keeps the CIB in sync with what's in its memory.

@gao-yan
Copy link
Copy Markdown
Member Author

gao-yan commented Feb 1, 2019

Good thinking. Probably at least controld shouldn't directly clear the transient attributes when the node is still unclean yet I assume.

Since @HideoYamauchi is already working on this. I'll patiently wait :-)

@gao-yan gao-yan closed this Feb 1, 2019
@HideoYamauchi
Copy link
Copy Markdown
Contributor

Hi Ken, Hi Yan,

From now on I will work on this problem seriously.
Please wait a little more.

Many thanks,
Hideo Yamauchi.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants