Subdaemon heartbeat with modified libqb (async API for connect)#2588
Subdaemon heartbeat with modified libqb (async API for connect)#2588kgaillot merged 3 commits intoClusterLabs:mainfrom
Conversation
kgaillot
left a comment
There was a problem hiding this comment.
This approach seems reasonable to me
bd94be0 to
f07e7b3
Compare
kgaillot
left a comment
There was a problem hiding this comment.
Looks good. Are you planning to keep the old code if the new libqb API isn't available?
|
Last push as well solves the issue with a hanging shutdown once there were subdaemons that weren't observed as children of pacemakerd (signal). |
|
build-issue is some repo-issue with tumbleweed |
Do you know if that hang was a regression in a released version? |
kgaillot
left a comment
There was a problem hiding this comment.
I think this is a good approach, we just need a fallback for when the libqb API isn't available
| pcmk_children[next_child].name, | ||
| (long long) PCMK__SPECIAL_PID_AS_0( | ||
| pcmk_children[next_child].pid), | ||
| (rc == pcmk_rc_ipc_pid_only)? " as IPC server" : ""); |
There was a problem hiding this comment.
Now that we don't fall through we don't need this check or the similar one below
No. Never had tested much with pre-existing daemons. |
f07e7b3 to
a84bd9b
Compare
configure.ac
Outdated
| dnl libqb 2.0.2+ (2020-10) | ||
| AC_CHECK_FUNCS(qb_ipcc_auth_get, | ||
| AC_DEFINE(HAVE_IPCC_AUTH_GET, 1, | ||
| AC_DEFINE(HAVE_QB_IPCC_AUTH_GET, 1, |
There was a problem hiding this comment.
Actually (thankfully) this shouldn't be necessary. AC_CHECK_FUNCS() will already define that, so we were just unnecessarily defining the alternate name. We can just drop the second argument (i.e. the AC_DEFINE) altogether.
There was a problem hiding this comment.
Then let's make it consistent if we are sure it works the same on all platforms/versions we do support.
a84bd9b to
f62b28f
Compare
Remove superfluous AC_DEFINE - one of them with typo
f62b28f to
8e8a4a3
Compare
| (long long) PCMK__SPECIAL_PID_AS_0( | ||
| pcmk_children[next_child].pid), | ||
| pcmk_children[next_child].check_count); | ||
| stop_child(&pcmk_children[next_child], SIGKILL); |
There was a problem hiding this comment.
In public clouds, nowadays it happens more often than before that sub-daemons are unresponsive to IPC and get respawned.
As we know, if it's controller that respawns, the node will lose all its transient attributes in the CIB status without being written again. Not only the resources that rely on the attributes will get impacted, but also missing of the internal attribute #feature-set will result into confusing MIXED-VERSION condition being shown from interfaces like crm_mon.
So far PCMK_fail_fast=yes probably is the only workaround to get the situation back into sanity but of course at a cost of node reboot.
While we've been trying to address it with the idea like:
#1699
, I'm not sure if it'd make sense at all to increase the tolerance here such as PCMK_PROCESS_CHECK_RETRIES or make it configurable... Otherwise should we say that 5 failures in a row are anyway bad enough to trigger a recovery?
There was a problem hiding this comment.
Sry I may be missing the reason for your comment here.
Previously IPC wasn't checked on a periodic basis for all subdaemons.
Numbers are kind of arbitrary. 1s is kind of a lower limit that makes sense for retries. Failing after 5 retries was the attempt to make it as reactive as before for cases where IPC was checked before already.
There was a problem hiding this comment.
Nothing is wrong with the changes in this PR. Just for bringing up the topic in the context here :-)
There was a problem hiding this comment.
Coincidentally I recently created https://projects.clusterlabs.org/T950 regarding this code, but it's not related unless you've only seen issues at cluster shutdown.
https://projects.clusterlabs.org/T73 is not directly related either but could affect the timing.
There is a 1s delay between checks of all subdaemons, so if they're all up, that's at least 6s between checks for any one subdaemon. 5 tries (30s) does seem plenty of time, so I wouldn't want to raise that. If a cloud host can't get enough cycles in 30s to respond to a check, it's probably unsuitable as an HA node.
There was a problem hiding this comment.
Thanks for the info and opinion. I agree.
There was a problem hiding this comment.
@wenningerk Any idea about how pacemaker-based processes the ping request from other sub-daemons?
Thanks.
There was a problem hiding this comment.
Don't know/remember in that detail - so give me a bit to check. Maybe sbdy knowing the answers from the top of his/her head can jump in.
There was a problem hiding this comment.
AFAICS, despite how a system runs into a difficult situation, the situation may be reflected by two factors that can affect pacemakerd's liveness checks on sub-daemons:
- How long the situation lasts
- How difficult the situation is
Regarding factor 1), as also mentioned in above, there's at least 6s between checks for a specific sub-daemon. So 5 tries (30s) seem to offer a certain degree of tolerance.
While regarding factor 2) for a single liveness check, it surely affects how soon pacemakerd can receive a response from a sub-daemon. So far it's restricted by the 2s poll timeout though:
pacemaker/lib/common/ipc_client.c
Line 1578 in fba6e95
It seems a little too sensitive in practice. We've seen users' cases where it often timed out right before the sub-daemon was about to send out a response.
And such timeouts of liveness checks are kind of premature IMO. They almost always appear when there are no real timeouts/failures occurring on the production IPCs between pacemaker daemons yet. AFAICS for the production IPCs, timeouts are usually longer than 5s, for example in the typical cases with crm_ipc_send().
I wonder whether it'd make sense to increase the tolerance on this specific IPC for liveness check. There have been some experiments in users' clusters where an enlarged poll timeout definitely increases the tolerance. The question is whether it'd make sense to generally enlarge the hardcoded poll timeout or even make it configurable.
There was a problem hiding this comment.
I would prefer to not make this configurable. We already have so many tuneables that it can be pretty hard to keep track of it all.
I would be okay with increasing this timeout, because I've also seen the "Subdaemon is unresponsive to IPC" message more than I'd like, when what's really happening is the daemon is just bogged down dealing with too many messages. In general, this whole area seems more problematic than I wish it were. Daemons get bogged down too easily, and we give up on them too easily. As the number of resources increases, the problems just get worse.
On the other hand, this is called by child_liveness which in turn is called in a loop by find_and_track_existing_processes. I worry that if we increase the timeout here, one slow daemon is going to have ripple effects for other daemons later on in the loop. Ken had mentioned at one point making this truly async, which I think would be a good thing to do along with increasing the timeout, but obviously that is going to be a more difficult patch.
Finally, I've been looking at this code a little bit recently, and our process management code is pretty gross. It'd be great if someone with a really good understanding of this stuff could clean it up, suggest a library we could be using, etc.
There was a problem hiding this comment.
Not saying relaxing the timeout by a bit wouldn't make sense but there is actually an idea behind making this trigger a little bit earlier than actual ipc-timeouts. Like this we can see how often we're running in a situation where it is getting close to ipc timing out. And actions are anyway just taken if that happens 5 times in a row.
|
Hi @wenningerk, So far, we have not received any reports of similar issues from the Japanese community or from our users when monitoring subdaemons. I agree with Klaus. Best Regards, |
…on sub-daemons If a node is temporarily running into a difficult situation, for the liveness checks on sub-daemons by pacemakerd, the 5 `PCMK_PROCESS_CHECK_RETRIES` offer a certain degree of tolerance. But in practice with the current 2s of poll() timeout, if the situation is a little too difficult, the liveness checks seem to run into 5 failures in a row rather easily while the situation still seems acceptable. This commit offers a longer timeout of 5s for a sub-daemon to respond to a liveness check. See discussion at: ClusterLabs#2588 (comment)
…on sub-daemons If a node is temporarily running into a difficult situation, for the liveness checks on sub-daemons by pacemakerd, the 5 `PCMK_PROCESS_CHECK_RETRIES` offer a certain degree of tolerance. But in practice with the current 2s of poll() timeout, if the situation is a little too difficult, the liveness checks seem to run into 5 failures in a row rather easily while the situation still seems acceptable. This commit offers a longer timeout of 5s for a sub-daemon to respond to a liveness check. See discussion at: ClusterLabs#2588 (comment)
…on sub-daemons If a node is temporarily running into a difficult situation, for the liveness checks on sub-daemons by pacemakerd, the 5 `PCMK_PROCESS_CHECK_RETRIES` offer a certain degree of tolerance. But in practice with the current 2s of poll() timeout, if the situation is a little too difficult, the liveness checks seem to run into 5 failures in a row rather easily while the situation still seems acceptable. This commit offers a longer timeout of 5s for a sub-daemon to respond to a liveness check. See discussion at: ClusterLabs#2588 (comment)
Keep pacemakerd tracking subdaemons for liveness - via qb-ipc-connect and the packets exchanged for authentication as of now.
qb-ipc-connect as of current libqb is blocking for an indefinite time if the subdaemon is unresponsive - -SIGSTOP or busy mainloop.
Thus there is an experimental API extension of libqb ClusterLabs/libqb#450 to be able to deal with that without needing ugly workarounds.
This is as well the reason why CI at this point is expected to fail as upstream master of libqb is missing the API extension.