Fix issues in reductions and thread predicates#470
Conversation
When TIDx/y/z are predicated, set the TIDx/y/z template flags as false Closes #367
csarofeen
left a comment
There was a problem hiding this comment.
This looks good to me. Do we assert somewhere if someone attempts to use 2 grid reductions in the same kernel? We actually may want to do this at some point and just want to make sure until then we throw a hard error.
Can you also please update the issues with the code that is now generated?
Thanks!
|
Added an issue of detecting multiple grid reductions (#475). |
|
Updated issue #367 with the generated kernel. |
|
Updated issue #468 with the generated kernel. |
| } | ||
| } | ||
|
|
||
| std::string generateGridReduceTemplateFlags( |
There was a problem hiding this comment.
Please place this local helper in a anonymous namespace
There was a problem hiding this comment.
It's already in a class defined in an anonymous namespace.
There was a problem hiding this comment.
Ah, right. One reason I don't like anonymous namespaces, you have to scroll around a lot to find them :(
There was a problem hiding this comment.
Yeah, that's a downside.
| ReductionOp* reduction_op_ = nullptr; | ||
| Allocate* reduction_buffer_ = nullptr; | ||
| Allocate* sync_buffer_ = nullptr; | ||
| ParallelTypeBitmap thread_pred_; |
There was a problem hiding this comment.
Why can't we use the "normal" predicate (Expr::predicate_) instead of this?
There was a problem hiding this comment.
We could. We would need to make some changes to gridReduce. In particular, indexing the work buffer written by each thread block needs some non-significant change.
However, gridReduce does have template bool parameters exactly for predicating based on block and thread indices being zero, so using those template flags should make more sense. The normal predicate is not for thread/block indices, so it can't be mapped to the template flags, and that's why we need to separate them for gridReduce. Note that for other expressions, we just combine them by &&.
There was a problem hiding this comment.
Added a comment to the code itself too.
Thread predicates are ignored when calling
blockReduceandgridReduce. See #468 for a reproducer of the problem when it is ignored forblockReduce. See #367 for a reproducer ofgridReduce. This PR also adds the reproducers as new tests.For
gridReduce, the best way to apply thread predicates is, IMO, to set theTIDx/y/ztemplate parameters as false. An assumption I have is that the thread predicate of a GridReduction must not includeBIDx/y/zsince we don't allow multiple calls togridReducein a single kernel.To pass around the predicate info until the CUDA code for calling
gridReduceis generated, a new field of typeParallelTypeBitmapis added tokir::GridReduction. To useParalleTypeBitmapfrom kernel_ir.h, I also extracted the class from lower_utils.h into its own header file.For
blockReduce, the change is much more trivial (IndexLowering::visit(kir::ReductionOp*).Fixes #367
Fixes #468