Skip to content

[RFC][Discussion] Tile-based Hidet Script Frontend #330

@Aalanli

Description

@Aalanli

The present kernel-level interface, hidet script, is similar to CUDA and consequently creates issues in code reuse, maintainability and brittleness. I feel that we need to adopt tile based approaches to kernel programming similar to trition, in order to speed up development, eliminate redundant code, and aid other optimizations.

I think that we need to establish the scope of the implementation, in particular the degree of developer vs compiler effort. There are two which I think could be valuable.

  1. Adopt triton semantics in single threaded execution on the block level. Eg, blocks are explicitly parallel, but within each block, we only use operators with single threaded semantics, operating on tiles. For example: Load, Store, elementwise, dot, reduce, scan. The developer specifies the concrete layout of each tile and its memory type (global, shared, registers), while the compiler takes care of inserting barriers, shared memory allocation, and distributing work in the threadblock. Some advantages of this is better code readability, the ability to create libraries, and potential vectorized prologue fusion.
  2. We can erase the user specified layouts to arrive at triton, where the compiler determines the layouts of tiles through heuristics.

In terms of vectorized prologue fusion, if we can determine the alignment of the prologue operator, then we can determine the minimal vectorizable load, I believe this algorithm is very similar to the algorithm that triton uses to figure out the layouts of intermediate tiles, but I could be over-simplifying the problem. Of course, there should be a cost model to determine whether to vectorize load, or to not fuse, or to use something like cp_async.

I am thinking that an implementation would roughly require its own set of IR, which would be lowered into C++ or hidet script.

Finally, I think we could also consider whether this is necessary at all; do we want to make a triton clone, something unique? Or it would be a better use of our time to target triton ir instead? I think the challenge would be to find a good balance between user and compiler effort, or the balance between imperative knowledge and declarative knowledge. I believe triton achieves a good balance, however, its programming model is not flexible enough to express something like FFT. In the end, the problems that tile-based approaches seek to address (imo) is tangential to the problem of heuristically determining which strategy to take (layout A vs layout B, tile sizes, etc.), and excessive code duplication in kernel programs. Is another method possible, instead of tile based approaches?

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestrfcDiscussion of potential rfc

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions