Kernel IR: Splitting CUDA codegen from IrPrinter#379
Conversation
| if ((((((blockIdx.x * 1) + (1 - 1)) * 128) + threadIdx.x) < T0.size[0])) { | ||
| for(size_t i6 = 0; i6 < 1; ++i6) { | ||
| T2[i6] | ||
| = T0[((((blockIdx.x * 1) + i6) * 128) + threadIdx.x)] * T1[((((blockIdx.x * 1) + i6) * 128) + threadIdx.x)]; | ||
| T3[((((blockIdx.x * 1) + i6) * 128) + threadIdx.x)] | ||
| = T2[i6] * T0[((((blockIdx.x * 1) + i6) * 128) + threadIdx.x)]; |
There was a problem hiding this comment.
I like collapsing the parenthesis, but I'd prefer to have the operators on separate lines for readability.
T2[] = T0[]
* T1[]
For reads and writes, I'm fine with this: T2[] = T0[]
There was a problem hiding this comment.
I agree. The formatting is not final, and I was planning to revisit it in a follow up PR to keep the changes a bit smaller (and also since we have opportunities to improve the formatting while also simplify the codegen code itself)
But if this is something we don't want to wait for, I'd be happy to update this PR.
There was a problem hiding this comment.
I'm just adding my 2 cents on the kernel formatting. 😄
I also noticed that the for-loop is redundant, since it is only run once.
There was a problem hiding this comment.
I also noticed that the for-loop is redundant, since it is only run once.
Yep. That's a completely different beast altogether though. We're not doing any low-level optimizations today (but we could, and probably should - another reason to have a standalone kernel IR)
|
|
||
| // Predicate map | ||
| // TODO(kir): consider a simpler, kernel IR based version | ||
| ThreadPredicateMap predicate_map_; |
There was a problem hiding this comment.
IIRC, the only reason we need to keep this mapping is for code generation of broadcastOp. A device function, blockBroacast, must be used when broadcasting thread-parallelized dimensions. Whether we should call that function is currently only determined at the code-gen time, but really I think this should be captured when lowering to KIR. One idea may be to have BlockBroadcast KIR node and generate that KIR node instance when FIR is lowered to KIR.
There was a problem hiding this comment.
One idea may be to have BlockBroadcast KIR node and generate that KIR node instance when FIR is lowered to KIR.
I really like this idea. In general, I think this is the right pattern for conditional code generation: generate the intended operations during lowering rather than deciding what to print at the last minute.
naoyam
left a comment
There was a problem hiding this comment.
Looks good. Left a comment on ThreadPredicateMap.
One of the main goals of having a dedicated kernel IR was separation of concerns: simpler and smaller components which do one thing instead of monolithic implementations.
This PR is a significant step in that direction: the CUDA code generation is now separate from the IrPrinter.