You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Un-prioritized list of things that generally should be done:
Support inter-block reductions (similar template function approach)
Reduction to a scalar does not work as we no longer have a tensor axis. Need to figure out how to fix this. Likely want to implement a zero-dim tensor which is just a scalar (this is how PyTorch does it).
Fusion printer that only prints math exprs from outputs. Rework the ir_printer class.
SetRandom on fusion is unlikely necessary, lets see if we can pull this out of so much of the logic in the codebase.
Remove TensorView Reorder code, use tensor domain.
Cross thread reduction, predicate blocks of code not using threads (i.e. downstream of reduction)
Remove active view from lower2device
Move logic out of lower2device, so lower2device is just a wrapper around lowering passes
Reduction op can only be run on tensor views, we should restrict the IR node to TensorView input/output.
Rework predicates, per loop, include thread guards (if thread dim doesn't participate, predicate out on threads (i.e. threadIdx.y>0). This can be done at the highest for-loop that doesn’t use that thread dim.
Remove predicate logic (besides that required for unrolling) out of unrolling.
Get an external compilation working with torchlib like we do with test_gpu. I'd like to be able to create tutorials that can be individually compiled and run.
Un-prioritized list of things that generally should be done:
TensorDomain::rootDomain()TensorDomain::rfactorDomain()andTensorDomain::domain()#1396