Subdaemon heartbeat with modified libqb (async API for connect) by wenningerk · Pull Request #2588 · ClusterLabs/pacemaker

wenningerk · 2021-12-09T10:27:48Z

Keep pacemakerd tracking subdaemons for liveness - via qb-ipc-connect and the packets exchanged for authentication as of now.
qb-ipc-connect as of current libqb is blocking for an indefinite time if the subdaemon is unresponsive - -SIGSTOP or busy mainloop.
Thus there is an experimental API extension of libqb ClusterLabs/libqb#450 to be able to deal with that without needing ugly workarounds.
This is as well the reason why CI at this point is expected to fail as upstream master of libqb is missing the API extension.

kgaillot

This approach seems reasonable to me

daemons/pacemakerd/pcmkd_subdaemons.c

kgaillot

Looks good. Are you planning to keep the old code if the new libqb API isn't available?

lib/common/ipc_client.c

wenningerk · 2022-01-15T08:48:30Z

Last push as well solves the issue with a hanging shutdown once there were subdaemons that weren't observed as children of pacemakerd (signal).
I had experienced a similar hang with current main version where I suppose it was just hanging on the libqb-API.

wenningerk · 2022-01-15T08:51:22Z

build-issue is some repo-issue with tumbleweed

kgaillot · 2022-01-17T18:47:18Z

Last push as well solves the issue with a hanging shutdown once there were subdaemons that weren't observed as children of pacemakerd (signal). I had experienced a similar hang with current main version where I suppose it was just hanging on the libqb-API.

Do you know if that hang was a regression in a released version?

kgaillot

I think this is a good approach, we just need a fallback for when the libqb API isn't available

kgaillot · 2022-01-17T18:50:31Z

daemons/pacemakerd/pcmkd_subdaemons.c

+                        pcmk_children[next_child].name,
+                        (long long) PCMK__SPECIAL_PID_AS_0(
+                            pcmk_children[next_child].pid),
+                        (rc == pcmk_rc_ipc_pid_only)? " as IPC server" : "");


Now that we don't fall through we don't need this check or the similar one below

wenningerk · 2022-01-17T22:08:40Z

Last push as well solves the issue with a hanging shutdown once there were subdaemons that weren't observed as children of pacemakerd (signal). I had experienced a similar hang with current main version where I suppose it was just hanging on the libqb-API.

Do you know if that hang was a regression in a released version?

No. Never had tested much with pre-existing daemons.
Maybe it is even working when you give it longer time and the issue just arises with sbd enabled.
Now that I know it isn't the same thing as with my code I should look at it once more.

kgaillot · 2022-01-19T19:47:03Z

configure.ac

 dnl libqb 2.0.2+ (2020-10)
 AC_CHECK_FUNCS(qb_ipcc_auth_get,
-               AC_DEFINE(HAVE_IPCC_AUTH_GET, 1,
+               AC_DEFINE(HAVE_QB_IPCC_AUTH_GET, 1,


Actually (thankfully) this shouldn't be necessary. AC_CHECK_FUNCS() will already define that, so we were just unnecessarily defining the alternate name. We can just drop the second argument (i.e. the AC_DEFINE) altogether.

Then let's make it consistent if we are sure it works the same on all platforms/versions we do support.

Remove superfluous AC_DEFINE - one of them with typo

gao-yan · 2024-12-17T12:47:55Z

daemons/pacemakerd/pcmkd_subdaemons.c

+                        (long long) PCMK__SPECIAL_PID_AS_0(
+                            pcmk_children[next_child].pid),
+                        pcmk_children[next_child].check_count);
+                stop_child(&pcmk_children[next_child], SIGKILL);


In public clouds, nowadays it happens more often than before that sub-daemons are unresponsive to IPC and get respawned.

As we know, if it's controller that respawns, the node will lose all its transient attributes in the CIB status without being written again. Not only the resources that rely on the attributes will get impacted, but also missing of the internal attribute #feature-set will result into confusing MIXED-VERSION condition being shown from interfaces like crm_mon.

So far PCMK_fail_fast=yes probably is the only workaround to get the situation back into sanity but of course at a cost of node reboot.

While we've been trying to address it with the idea like:
#1699

, I'm not sure if it'd make sense at all to increase the tolerance here such as PCMK_PROCESS_CHECK_RETRIES or make it configurable... Otherwise should we say that 5 failures in a row are anyway bad enough to trigger a recovery?

Sry I may be missing the reason for your comment here.
Previously IPC wasn't checked on a periodic basis for all subdaemons.
Numbers are kind of arbitrary. 1s is kind of a lower limit that makes sense for retries. Failing after 5 retries was the attempt to make it as reactive as before for cases where IPC was checked before already.

Nothing is wrong with the changes in this PR. Just for bringing up the topic in the context here :-)

Coincidentally I recently created https://projects.clusterlabs.org/T950 regarding this code, but it's not related unless you've only seen issues at cluster shutdown.

https://projects.clusterlabs.org/T73 is not directly related either but could affect the timing.

There is a 1s delay between checks of all subdaemons, so if they're all up, that's at least 6s between checks for any one subdaemon. 5 tries (30s) does seem plenty of time, so I wouldn't want to raise that. If a cloud host can't get enough cycles in 30s to respond to a check, it's probably unsuitable as an HA node.

Thanks for the info and opinion. I agree.

@wenningerk Any idea about how pacemaker-based processes the ping request from other sub-daemons?
Thanks.

Don't know/remember in that detail - so give me a bit to check. Maybe sbdy knowing the answers from the top of his/her head can jump in.

AFAICS, despite how a system runs into a difficult situation, the situation may be reflected by two factors that can affect pacemakerd's liveness checks on sub-daemons:

How long the situation lasts

How difficult the situation is

Regarding factor 1), as also mentioned in above, there's at least 6s between checks for a specific sub-daemon. So 5 tries (30s) seem to offer a certain degree of tolerance.

While regarding factor 2) for a single liveness check, it surely affects how soon pacemakerd can receive a response from a sub-daemon. So far it's restricted by the 2s poll timeout though:

pacemaker/lib/common/ipc_client.c

Line 1578 in fba6e95

poll_rc = poll(&pollfd, 1, 2000);

It seems a little too sensitive in practice. We've seen users' cases where it often timed out right before the sub-daemon was about to send out a response.

And such timeouts of liveness checks are kind of premature IMO. They almost always appear when there are no real timeouts/failures occurring on the production IPCs between pacemaker daemons yet. AFAICS for the production IPCs, timeouts are usually longer than 5s, for example in the typical cases with crm_ipc_send().

I wonder whether it'd make sense to increase the tolerance on this specific IPC for liveness check. There have been some experiments in users' clusters where an enlarged poll timeout definitely increases the tolerance. The question is whether it'd make sense to generally enlarge the hardcoded poll timeout or even make it configurable.

I would prefer to not make this configurable. We already have so many tuneables that it can be pretty hard to keep track of it all.

I would be okay with increasing this timeout, because I've also seen the "Subdaemon is unresponsive to IPC" message more than I'd like, when what's really happening is the daemon is just bogged down dealing with too many messages. In general, this whole area seems more problematic than I wish it were. Daemons get bogged down too easily, and we give up on them too easily. As the number of resources increases, the problems just get worse.

On the other hand, this is called by child_liveness which in turn is called in a loop by find_and_track_existing_processes. I worry that if we increase the timeout here, one slow daemon is going to have ripple effects for other daemons later on in the loop. Ken had mentioned at one point making this truly async, which I think would be a good thing to do along with increasing the timeout, but obviously that is going to be a more difficult patch.

Finally, I've been looking at this code a little bit recently, and our process management code is pretty gross. It'd be great if someone with a really good understanding of this stuff could clean it up, suggest a library we could be using, etc.

Not saying relaxing the timeout by a bit wouldn't make sense but there is actually an idea behind making this trigger a little bit earlier than actual ipc-timeouts. Like this we can see how often we're running in a situation where it is getting close to ipc timing out. And actions are anyway just taken if that happens 5 times in a row.

HideoYamauchi · 2025-04-12T19:35:19Z

Hi @wenningerk,
Hi All,

So far, we have not received any reports of similar issues from the Japanese community or from our users when monitoring subdaemons.

I agree with Klaus.
If time permits, we should look into more situations where this problem actually occurs.

Best Regards,
Hideo Yamauchi.

…on sub-daemons If a node is temporarily running into a difficult situation, for the liveness checks on sub-daemons by pacemakerd, the 5 `PCMK_PROCESS_CHECK_RETRIES` offer a certain degree of tolerance. But in practice with the current 2s of poll() timeout, if the situation is a little too difficult, the liveness checks seem to run into 5 failures in a row rather easily while the situation still seems acceptable. This commit offers a longer timeout of 5s for a sub-daemon to respond to a liveness check. See discussion at: ClusterLabs#2588 (comment)

wenningerk mentioned this pull request Dec 9, 2021

RFC: ipcc: Add an async connect API ClusterLabs/libqb#450

Merged

kgaillot reviewed Dec 13, 2021

View reviewed changes

daemons/pacemakerd/pcmkd_subdaemons.c Outdated Show resolved Hide resolved

wenningerk force-pushed the subdaemon_heartbeat_mod_libqb branch 3 times, most recently from bd94be0 to f07e7b3 Compare January 14, 2022 20:05

kgaillot reviewed Jan 14, 2022

View reviewed changes

lib/common/ipc_client.c Show resolved Hide resolved

kgaillot reviewed Jan 17, 2022

View reviewed changes

wenningerk force-pushed the subdaemon_heartbeat_mod_libqb branch from f07e7b3 to a84bd9b Compare January 17, 2022 23:51

kgaillot mentioned this pull request Jan 18, 2022

[WIP] Subdaemon heartbeat #2573

Closed

wenningerk changed the title ~~[WIP] Subdaemon heartbeat with modified libqb (async API for connect)~~ Subdaemon heartbeat with modified libqb (async API for connect) Jan 18, 2022

kgaillot reviewed Jan 19, 2022

View reviewed changes

wenningerk force-pushed the subdaemon_heartbeat_mod_libqb branch from a84bd9b to f62b28f Compare January 19, 2022 20:22

wenningerk added 3 commits January 19, 2022 22:17

Feature: pacemakerd: keep tracking pacemakerd for liveness

9ee9fd6

Fix: ipc_client: use libqb async API for connect

4b60aa1

Fix: configure.ac: make usage of AC_CHECK_FUNCS consistent

8e8a4a3

Remove superfluous AC_DEFINE - one of them with typo

wenningerk force-pushed the subdaemon_heartbeat_mod_libqb branch from f62b28f to 8e8a4a3 Compare January 19, 2022 21:19

kgaillot merged commit 2c937a4 into ClusterLabs:main Jan 19, 2022

gao-yan reviewed Dec 17, 2024

View reviewed changes

gao-yan mentioned this pull request Jun 24, 2025

Fix: libcrmcommon: Increase poll() timeout to 5s for liveness checks on sub-daemons #3900

Merged

This was referenced Jul 24, 2025

Fix: libcrmcommon: Increase poll() timeout to 5s for liveness checks on sub-daemons #3923

Closed

Fix: libcrmcommon: Increase poll() timeout to 5s for liveness checks on sub-daemons #3924

Closed

Conversation

wenningerk commented Dec 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kgaillot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kgaillot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wenningerk commented Jan 15, 2022

Uh oh!

wenningerk commented Jan 15, 2022

Uh oh!

kgaillot commented Jan 17, 2022

Uh oh!

kgaillot left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wenningerk commented Jan 17, 2022

Uh oh!

kgaillot Jan 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HideoYamauchi commented Apr 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

wenningerk commented Dec 9, 2021 •

edited

Loading

kgaillot Jan 19, 2022 •

edited

Loading