Skip to content

Subdaemon heartbeat with modified libqb (async API for connect)#2588

Merged
kgaillot merged 3 commits intoClusterLabs:mainfrom
wenningerk:subdaemon_heartbeat_mod_libqb
Jan 19, 2022
Merged

Subdaemon heartbeat with modified libqb (async API for connect)#2588
kgaillot merged 3 commits intoClusterLabs:mainfrom
wenningerk:subdaemon_heartbeat_mod_libqb

Conversation

@wenningerk
Copy link
Copy Markdown
Contributor

@wenningerk wenningerk commented Dec 9, 2021

Keep pacemakerd tracking subdaemons for liveness - via qb-ipc-connect and the packets exchanged for authentication as of now.
qb-ipc-connect as of current libqb is blocking for an indefinite time if the subdaemon is unresponsive - -SIGSTOP or busy mainloop.
Thus there is an experimental API extension of libqb ClusterLabs/libqb#450 to be able to deal with that without needing ugly workarounds.
This is as well the reason why CI at this point is expected to fail as upstream master of libqb is missing the API extension.

Copy link
Copy Markdown

@kgaillot kgaillot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach seems reasonable to me

@wenningerk wenningerk force-pushed the subdaemon_heartbeat_mod_libqb branch 3 times, most recently from bd94be0 to f07e7b3 Compare January 14, 2022 20:05
Copy link
Copy Markdown

@kgaillot kgaillot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Are you planning to keep the old code if the new libqb API isn't available?

@wenningerk
Copy link
Copy Markdown
Contributor Author

Last push as well solves the issue with a hanging shutdown once there were subdaemons that weren't observed as children of pacemakerd (signal).
I had experienced a similar hang with current main version where I suppose it was just hanging on the libqb-API.

@wenningerk
Copy link
Copy Markdown
Contributor Author

build-issue is some repo-issue with tumbleweed

@kgaillot
Copy link
Copy Markdown

Last push as well solves the issue with a hanging shutdown once there were subdaemons that weren't observed as children of pacemakerd (signal). I had experienced a similar hang with current main version where I suppose it was just hanging on the libqb-API.

Do you know if that hang was a regression in a released version?

Copy link
Copy Markdown

@kgaillot kgaillot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good approach, we just need a fallback for when the libqb API isn't available

pcmk_children[next_child].name,
(long long) PCMK__SPECIAL_PID_AS_0(
pcmk_children[next_child].pid),
(rc == pcmk_rc_ipc_pid_only)? " as IPC server" : "");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we don't fall through we don't need this check or the similar one below

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops

@wenningerk
Copy link
Copy Markdown
Contributor Author

Last push as well solves the issue with a hanging shutdown once there were subdaemons that weren't observed as children of pacemakerd (signal). I had experienced a similar hang with current main version where I suppose it was just hanging on the libqb-API.

Do you know if that hang was a regression in a released version?

No. Never had tested much with pre-existing daemons.
Maybe it is even working when you give it longer time and the issue just arises with sbd enabled.
Now that I know it isn't the same thing as with my code I should look at it once more.

@wenningerk wenningerk force-pushed the subdaemon_heartbeat_mod_libqb branch from f07e7b3 to a84bd9b Compare January 17, 2022 23:51
@wenningerk wenningerk changed the title [WIP] Subdaemon heartbeat with modified libqb (async API for connect) Subdaemon heartbeat with modified libqb (async API for connect) Jan 18, 2022
configure.ac Outdated
dnl libqb 2.0.2+ (2020-10)
AC_CHECK_FUNCS(qb_ipcc_auth_get,
AC_DEFINE(HAVE_IPCC_AUTH_GET, 1,
AC_DEFINE(HAVE_QB_IPCC_AUTH_GET, 1,
Copy link
Copy Markdown

@kgaillot kgaillot Jan 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually (thankfully) this shouldn't be necessary. AC_CHECK_FUNCS() will already define that, so we were just unnecessarily defining the alternate name. We can just drop the second argument (i.e. the AC_DEFINE) altogether.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then let's make it consistent if we are sure it works the same on all platforms/versions we do support.

@wenningerk wenningerk force-pushed the subdaemon_heartbeat_mod_libqb branch from a84bd9b to f62b28f Compare January 19, 2022 20:22
@wenningerk wenningerk force-pushed the subdaemon_heartbeat_mod_libqb branch from f62b28f to 8e8a4a3 Compare January 19, 2022 21:19
@kgaillot kgaillot merged commit 2c937a4 into ClusterLabs:main Jan 19, 2022
(long long) PCMK__SPECIAL_PID_AS_0(
pcmk_children[next_child].pid),
pcmk_children[next_child].check_count);
stop_child(&pcmk_children[next_child], SIGKILL);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In public clouds, nowadays it happens more often than before that sub-daemons are unresponsive to IPC and get respawned.

As we know, if it's controller that respawns, the node will lose all its transient attributes in the CIB status without being written again. Not only the resources that rely on the attributes will get impacted, but also missing of the internal attribute #feature-set will result into confusing MIXED-VERSION condition being shown from interfaces like crm_mon.

So far PCMK_fail_fast=yes probably is the only workaround to get the situation back into sanity but of course at a cost of node reboot.

While we've been trying to address it with the idea like:
#1699

, I'm not sure if it'd make sense at all to increase the tolerance here such as PCMK_PROCESS_CHECK_RETRIES or make it configurable... Otherwise should we say that 5 failures in a row are anyway bad enough to trigger a recovery?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sry I may be missing the reason for your comment here.
Previously IPC wasn't checked on a periodic basis for all subdaemons.
Numbers are kind of arbitrary. 1s is kind of a lower limit that makes sense for retries. Failing after 5 retries was the attempt to make it as reactive as before for cases where IPC was checked before already.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing is wrong with the changes in this PR. Just for bringing up the topic in the context here :-)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coincidentally I recently created https://projects.clusterlabs.org/T950 regarding this code, but it's not related unless you've only seen issues at cluster shutdown.

https://projects.clusterlabs.org/T73 is not directly related either but could affect the timing.

There is a 1s delay between checks of all subdaemons, so if they're all up, that's at least 6s between checks for any one subdaemon. 5 tries (30s) does seem plenty of time, so I wouldn't want to raise that. If a cloud host can't get enough cycles in 30s to respond to a check, it's probably unsuitable as an HA node.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the info and opinion. I agree.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wenningerk Any idea about how pacemaker-based processes the ping request from other sub-daemons?
Thanks.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't know/remember in that detail - so give me a bit to check. Maybe sbdy knowing the answers from the top of his/her head can jump in.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICS, despite how a system runs into a difficult situation, the situation may be reflected by two factors that can affect pacemakerd's liveness checks on sub-daemons:

  1. How long the situation lasts
  2. How difficult the situation is

Regarding factor 1), as also mentioned in above, there's at least 6s between checks for a specific sub-daemon. So 5 tries (30s) seem to offer a certain degree of tolerance.

While regarding factor 2) for a single liveness check, it surely affects how soon pacemakerd can receive a response from a sub-daemon. So far it's restricted by the 2s poll timeout though:

poll_rc = poll(&pollfd, 1, 2000);

It seems a little too sensitive in practice. We've seen users' cases where it often timed out right before the sub-daemon was about to send out a response.

And such timeouts of liveness checks are kind of premature IMO. They almost always appear when there are no real timeouts/failures occurring on the production IPCs between pacemaker daemons yet. AFAICS for the production IPCs, timeouts are usually longer than 5s, for example in the typical cases with crm_ipc_send().

I wonder whether it'd make sense to increase the tolerance on this specific IPC for liveness check. There have been some experiments in users' clusters where an enlarged poll timeout definitely increases the tolerance. The question is whether it'd make sense to generally enlarge the hardcoded poll timeout or even make it configurable.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to not make this configurable. We already have so many tuneables that it can be pretty hard to keep track of it all.

I would be okay with increasing this timeout, because I've also seen the "Subdaemon is unresponsive to IPC" message more than I'd like, when what's really happening is the daemon is just bogged down dealing with too many messages. In general, this whole area seems more problematic than I wish it were. Daemons get bogged down too easily, and we give up on them too easily. As the number of resources increases, the problems just get worse.

On the other hand, this is called by child_liveness which in turn is called in a loop by find_and_track_existing_processes. I worry that if we increase the timeout here, one slow daemon is going to have ripple effects for other daemons later on in the loop. Ken had mentioned at one point making this truly async, which I think would be a good thing to do along with increasing the timeout, but obviously that is going to be a more difficult patch.

Finally, I've been looking at this code a little bit recently, and our process management code is pretty gross. It'd be great if someone with a really good understanding of this stuff could clean it up, suggest a library we could be using, etc.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not saying relaxing the timeout by a bit wouldn't make sense but there is actually an idea behind making this trigger a little bit earlier than actual ipc-timeouts. Like this we can see how often we're running in a situation where it is getting close to ipc timing out. And actions are anyway just taken if that happens 5 times in a row.

@HideoYamauchi
Copy link
Copy Markdown
Contributor

Hi @wenningerk,
Hi All,

So far, we have not received any reports of similar issues from the Japanese community or from our users when monitoring subdaemons.

I agree with Klaus.
If time permits, we should look into more situations where this problem actually occurs.

Best Regards,
Hideo Yamauchi.

gao-yan added a commit to gao-yan/pacemaker that referenced this pull request Jun 24, 2025
…on sub-daemons

If a node is temporarily running into a difficult situation, for the
liveness checks on sub-daemons by pacemakerd, the 5
`PCMK_PROCESS_CHECK_RETRIES` offer a certain degree of tolerance. But in
practice with the current 2s of poll() timeout, if the situation is a
little too difficult, the liveness checks seem to run into 5 failures in
a row rather easily while the situation still seems acceptable.

This commit offers a longer timeout of 5s for a sub-daemon to respond to
a liveness check.

See discussion at:
ClusterLabs#2588 (comment)
gao-yan added a commit to gao-yan/pacemaker that referenced this pull request Jul 24, 2025
…on sub-daemons

If a node is temporarily running into a difficult situation, for the
liveness checks on sub-daemons by pacemakerd, the 5
`PCMK_PROCESS_CHECK_RETRIES` offer a certain degree of tolerance. But in
practice with the current 2s of poll() timeout, if the situation is a
little too difficult, the liveness checks seem to run into 5 failures in
a row rather easily while the situation still seems acceptable.

This commit offers a longer timeout of 5s for a sub-daemon to respond to
a liveness check.

See discussion at:
ClusterLabs#2588 (comment)
gao-yan added a commit to gao-yan/pacemaker that referenced this pull request Jul 24, 2025
…on sub-daemons

If a node is temporarily running into a difficult situation, for the
liveness checks on sub-daemons by pacemakerd, the 5
`PCMK_PROCESS_CHECK_RETRIES` offer a certain degree of tolerance. But in
practice with the current 2s of poll() timeout, if the situation is a
little too difficult, the liveness checks seem to run into 5 failures in
a row rather easily while the situation still seems acceptable.

This commit offers a longer timeout of 5s for a sub-daemon to respond to
a liveness check.

See discussion at:
ClusterLabs#2588 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants