Fix: controld: ensure existing node attributes get written back into CIB after daemons failed and got respawned#1693
Conversation
…CIB after daemons failed and got respawned If a pacemaker daemon unexpectedly exits, the daemon and probably also other daemons including controld will try to get respawned. Previously depending on which daemon had failed, if controld got respawned either itself or together with some other daemons, existing attributes of this node might not get written back into CIB again. It was found by the cts test "Reattach" randomly failing due to loss of the node attribute "connected" in the CIB right after the test "ComponentFail", for example if controld or execd was the failed component.
|
Actually, now I remember running into this issue myself. I thought I traced it as expected behavior in certain situations, but I don't remember the details. Do you know what sequence of events causes this? I'm not sure if this was for the same problem I'm remembering or not, but there's this note for attrd_erase_attrs(): |
|
This issue doesn't occur if attrd gets the chance to be respawned. For example if only controld gets respawned, the node will be deleted from cib status. While attrd won't be aware that the attributes have disappeared from the CIB. |
|
AFAICS, this issue occurs when:
So basically it occurs when controld gets respawned while attrd keeps running. |
|
Ahhh of course, this is https://bugs.clusterlabs.org/show_bug.cgi?id=5351 . Let's coordinate that discussion with this one. We actually used to do what you're proposing, but it was removed in fe44f40 for scalability reasons. Unfortunately that did not foresee corner cases like CLBZ#5329, #5351, and #5375. I'm reluctant to revert that commit for the scalability issue, but we might have to, at least until we come up with a better solution. My inclination (as described in #5351) is to put clearing transient attributes from the CIB in the hands of attrd rather than the controller. @HideoYamauchi is looking into how difficult that might be. I suspect we may still want the controller to initiate the clearing, but via an attrd op rather than directly, so attrd always keeps the CIB in sync with what's in its memory. |
|
Good thinking. Probably at least controld shouldn't directly clear the transient attributes when the node is still unclean yet I assume. Since @HideoYamauchi is already working on this. I'll patiently wait :-) |
|
Hi Ken, Hi Yan, From now on I will work on this problem seriously. Many thanks, |
If a pacemaker daemon unexpectedly exits, the daemon and probably also
other daemons including controld will try to get respawned. Previously
depending on which daemon had failed, if controld got respawned either
itself or together with some other daemons, existing attributes of this
node might not get written back into CIB again.
It was found by the cts test "Reattach" randomly failing due to loss of
the node attribute "connected" in the CIB right after the test
"ComponentFail", for example if controld or execd was the failed
component.