Flush namespace sinks' residuals before destroying namespace#3711
Flush namespace sinks' residuals before destroying namespace#3711rhc54 merged 2 commits intoopenpmix:masterfrom
Conversation
|
Afraid I don't understand - how is this new function any different than the one right above it ( Just trying to understand why we would want to flush the residuals twice after this change, and if perhaps this isn't really the source of the hang you observed. |
|
Yeah, it's about the timing. The namespace gets destroyed in The changed function signature lets us flush only the specific sinks' residuals, so that any continuing namespaces' outputs don't risk getting mangled by printing unfinished lines. |
|
So you are flushing the residuals relating to a specific nspace when that nspace is deregistered. At server finalize, we flush all remaining residuals from the global list. This implies that some namespaces were not deregistered prior to finalizing the server. Is that what you are seeing? Or is the residual list empty? I'm wondering if the complete fix would include replacing the residual flush in server finalize with a pass through the list of registered namespaces to deregister those that remain when we finalize the server? This would ensure we cleaned up everything associated with the namespace, including the residuals. In fact, we do that just a few lines below: PMIX_LIST_FOREACH (ns, &pmix_globals.nspaces, pmix_namespace_t) {
/* ensure that we do the specified cleanup - if this is an
* abnormal termination, then the nspace object may not be
* at zero refcount */
pmix_execute_epilog(&ns->epilog);
}Perhaps we need to revise that loop a little bit to basically call "deregister_nspace" - that would cover the epilog as well as all the other required cleanup. I know we are finalizing, but right now it feels like we are forgetting that the host may continue on without us, meaning that the library needs to fully cleanup. Not saying you (or this PR) have to do all that - just wanting to see if we completely grok the problem. |
|
All namespaces are deregistered before finalizing (to my knowledge, w/ my tests, etc etc), that's not the problem. The problem is that when the namespace is deregistered, it just needs a tiny bit of extra cleanup. The namespace currently cleans up any of its own Any residuals with dangling pointers end up sitting in the residuals list 'forever' (because that namespace is finalized, so it won't get any more iof messages that would finish a residual line). The bug manifests in a deadlock during finalize only because that's when the server finally tries to clean up those residuals, but the actual bug is during the namespace deregister step. |
|
The safest approach would probably be to put these lines: Inside of the I just didn't want to go verify that it would be safe to access the residuals list and do IO at every existing location where sinks can be destroyed. |
I just realized what you were asking here. I think the residual list might not be empty even if all namespaces are deregistered, because there could be residuals for the |
Thanks - that indeed was my question.
Scanning the code, there currently is only one place where sinks are released - and that is in the However, I'm willing to take this as-is and make the change later. While I agree that PRRTE will deregister known namespaces (i.e., namespaces that are directly "registered" by its host), there are other mechanisms by which namespace objects get added to the server (not part of the minimal "mpirun" operation). The current server finalize doesn't ensure everything associated with any lingering namespace objects gets properly cleaned up, and that should also be fixed. Revising this in that second stage would be fine with me - make sense? |
cbcd4e4 to
afa8bba
Compare
Signed-off-by: Matthew Whitlock <mwhitlo@sandia.gov>
As per the discussion on the PR, move the sink cleanup calls to the sink destructor. Signed-off-by: Ralph Castain <rhc@pmix.org>
|
I went ahead and added it here. Had to add a little protection because clients don't have residuals (only servers do), but this better encapsulates the cleanup. |
I was getting intermittent hangs during PMIx_server_finalize when running MPI jobs with the
--output file=<filename>option. The problem is that sinks get destroyed when the namespace is destroyed, but any residuals on those sinks keep a pointer to thepmix_iof_write_event_tobject in those sinks. Flushing the residuals during finalize causes PMIx to add invalid events to the event base, which then hangs when being closed by libevent.