ch4/coll: Fix reduce composition alpha with non-zero root#6543
ch4/coll: Fix reduce composition alpha with non-zero root#6543raffenet merged 1 commit intopmodels:mainfrom
Conversation
|
test:mpich/ch4/most |
|
|
|
test:mpich/ch4/most |
|
I was trying to enhance the test to cover this bug, with following patch: But strangely, the enhanced test passes even without your fix. It appears that the posix release gather algorithm works even with the wrong usage of |
|
test:mpich/ch4/most |
I was wondering why |
Both |
|
@hzhou did you figure out a consistent test config for this one? |
|
I haven't got time yet. I'll approve the PR and add the test as future PRs. |
hzhou
left a comment
There was a problem hiding this comment.
LGTM
TODO: add test as future PRs
We need to handle the case where a non-zero root uses MPI_IN_PLACE. Otherwise we could try reading from a bad address and crash. Fixes pmodels#6540. NOTE: For single node reduce operation with non-zero root, this composition incurs an extra copy from rank 0->root.
| } | ||
|
|
||
| /* non-zero root needs to send from recvbuf if using MPI_IN_PLACE */ | ||
| intra_sendbuf = (sendbuf == MPI_IN_PLACE && root != 0) ? recvbuf : sendbuf; |
There was a problem hiding this comment.
I found why the release_gather code was immune to the bug. It is due to
mpich/src/mpid/ch4/shm/posix/release_gather/nb_reduce_release_gather.h
Lines 397 to 399 in 8451885
In general, are we allowed to have sendbuf and recvbuf overlap? Apparently all our reduce algorithms works with this replacement, but I wonder whether this is a potential hazard.
Pull Request Description
We need to handle the case where a non-zero root uses
MPI_IN_PLACE. Otherwise we could try reading from a bad address and
crash. Fixes #6540.
NOTE: For single node reduce operation with non-zero root, this
composition incurs an unnecessary extra copy.
Author Checklist
Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
Commits are self-contained and do not do two things at once.
Commit message is of the form:
module: short descriptionCommit message explains what's in the commit.
Whitespace checker. Warnings test. Additional tests via comments.
For non-Argonne authors, check contribution agreement.
If necessary, request an explicit comment from your companies PR approval manager.