Skip to content

Adding a Reduction Heuristic Scheduler to match the Pytorch ATen TensorIterator #116

@kevinstephano

Description

@kevinstephano

🚀 Feature

The goal is to make a "magic scheduler" that takes an algorithm with a reduction op and applies a TensorExpression schedule to match the performance of Pytorch's ATen TensorIterator.

My plan and heuristic are shown in this (NVIDIA Internal) document: https://docs.google.com/document/d/15b8JSnLYu9PIGwEltPXeKOoX5XR_EE_RMjPkRtQ8EHo/

Work is happening on this branch:
https://github.com/csarofeen/pytorch/tree/20_6_11_devel_redsched

Evaluation is happening with this code base:
https://github.com/kevinstephano/codegen_perf

Plan

Part1: Basic Assumptions 2D, only scheduling just a reduction

  • Get simple schedule up and running in a test
  • Reverse engineer the ATEN heuristic
  • Implement Aten's schedule
    • Get a scheduler file and function stubbed out
    • Add code to calculate BIDx, BIDy, TIDx, and TIDy splits
    • Fix scheduling that needs to split based on remaining size
    • Modify my codegen_perf code to compare against ATen
    • Fix errors in going over fastest dimension reduction differences
    • Add cross-block reductions
    • Fix reductions not on fastest dimension.
    • Need to fix outer dimension reductions with Vectorize. The perf is off.
    • Need to add FP16 support. I am going to write out a test first to capture the behavior.
  • Perf test schedule

Part2: Start Addressing assumptions

  • Generalize to more than 2D tensors
  • Make schedule usable in the presence of other fusion ops. Should the scheduling be from the bottom up?

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions