-
Notifications
You must be signed in to change notification settings - Fork 16
[Instruction][tcgen05] Add mma instruction #55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>
Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>
Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>
Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for the tcgen05.mma instruction for the TCGen05 microarchitecture, implementing matrix multiplication acceleration with initial support for fp16-fp16-fp32 and fp8-fp8-fp32 operand combinations. The implementation includes comprehensive layout inference, code generation, and testing infrastructure.
Key changes include:
- Implementation of tcgen05.mma instruction with shared memory and tensor memory operand support
- Addition of permute_shared instruction for tensor dimension reordering
- Refactoring of Tilus Script method handling to improve extensibility for Register/Shared/Global tensors
Reviewed Changes
Copilot reviewed 49 out of 50 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/instructions/test_tcgen05_mma.py | New test suite for tcgen05.mma instruction with various operand type combinations |
| python/tilus/lang/transpiler.py | Refactored tensor method handling to use new extensible method system |
| python/tilus/lang/methods/ | New method handling infrastructure for different tensor types |
| python/tilus/ir/instructions/cuda/tcgen05.py | Added Tcgen05MmaSSInst and Tcgen05MmaTSInst instruction definitions |
| python/tilus/backends/emitters/cuda/tcgen05/mma.py | Comprehensive mma instruction code generation with layout validation |
| python/tilus/ir/layout/cuda/tcgen05/smem.py | Refactored swizzle mode handling and layout generation |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| # class Tcgen05SwizzleMode(Enum): | ||
| # """TCGen05 swizzle modes corresponding to cute Swizzle parameters""" | ||
|
|
||
| # NO_SWIZZLE = (0, 0, 0) # No swizzling or Interleaved | ||
| # B32_SWIZZLE = (1, 4, 3) # 32B Swizzling: Swizzle<1, 4, 3> | ||
| # B64_SWIZZLE = (2, 4, 3) # 64B Swizzling: Swizzle<2, 4, 3> | ||
| # B128_SWIZZLE = (3, 4, 3) # 128B Swizzling: Swizzle<3, 4, 3> | ||
|
|
||
| # def encode(self) -> int: | ||
| # # see https://docs.nvidia.com/cuda/parallel-thread-execution/#tcgen05-shared-memory-desc-layout | ||
| # return { | ||
| # Tcgen05SwizzleMode.NO_SWIZZLE: 0, | ||
| # Tcgen05SwizzleMode.B32_SWIZZLE: 6, | ||
| # Tcgen05SwizzleMode.B64_SWIZZLE: 4, | ||
| # Tcgen05SwizzleMode.B128_SWIZZLE: 2, | ||
| # }[self] | ||
|
|
||
| # @property | ||
| # def bbits(self) -> int: | ||
| # return self.value[0] | ||
|
|
||
| # @property | ||
| # def mbase(self) -> int: | ||
| # return self.value[1] | ||
|
|
||
| # @property | ||
| # def sshift(self) -> int: | ||
| # return self.value[2] | ||
|
|
||
| # def as_cute_swizzle(self) -> CuteSwizzle: | ||
| # bbits, mbase, sshift = self.value | ||
| # return CuteSwizzle(bbits=bbits, mbase=mbase, sshift=sshift) | ||
|
|
||
|
|
Copilot
AI
Oct 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Large blocks of commented-out code should be removed rather than left in the codebase. This creates confusion and makes the code harder to maintain.
|
|
||
|
|
||
| class RegisterTensorWithMethods(RegisterTensor): | ||
| def __init__(self, tensor: RegisterTensor, builder: StmtBuilder): |
Copilot
AI
Oct 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing call to super().init() in RegisterTensorWithMethods constructor. This could lead to incomplete initialization of the parent class.
| def __init__(self, tensor: RegisterTensor, builder: StmtBuilder): | |
| def __init__(self, tensor: RegisterTensor, builder: StmtBuilder): | |
| super().__init__() |
| @dataclass | ||
| class SharedMatrixDescriptor: |
Copilot
AI
Oct 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SharedMatrixDescriptor class has extensive documentation in its docstring table but lacks docstring descriptions for individual parameters (addr, lbo, sbo, etc.). Consider adding parameter documentation for better API clarity.
| assert tmem_tensor.shape[1] * tmem_tensor.dtype.nbits % 32 == 0 | ||
| num_columns = tmem_tensor.shape[1] * tmem_tensor.dtype.nbits // 32 | ||
| assert num_columns % 32 == 0 and 32 <= num_columns <= 512 | ||
| assert num_columns % 32 == 0 and 32 <= num_columns <= 512, num_columns |
Copilot
AI
Oct 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assertion error message is unclear. Consider providing a more descriptive error message explaining what num_columns represents and why it must meet these constraints.
| assert num_columns % 32 == 0 and 32 <= num_columns <= 512, num_columns | |
| assert num_columns % 32 == 0 and 32 <= num_columns <= 512, ( | |
| f"num_columns (number of 32-bit columns in TMemoryTensor) must be a multiple of 32 and in [32, 512], got {num_columns}" | |
| ) |
| swizzles.append(Swizzle(c=1, d=log2(128), r=log2(1024))) | ||
| elif swizzle == TensorMapSwizzle.B64: | ||
| swizzles.append(Swizzle(c=2, d=log2(16), r=log2(128))) | ||
| swizzles.append(Swizzle(c=2, d=log2(128), r=log2(1024))) | ||
| elif swizzle == TensorMapSwizzle.B128: | ||
| swizzles.append(Swizzle(c=3, d=log2(16), r=log2(128))) | ||
| swizzles.append(Swizzle(c=3, d=log2(128), r=log2(1024))) | ||
| elif swizzle == TensorMapSwizzle.B128_ATOM_32B: | ||
| swizzles.append(Swizzle(c=3, d=log2(32), r=log2(256))) | ||
| swizzles.append(Swizzle(c=3, d=log2(256), r=log2(2048))) | ||
| elif swizzle == TensorMapSwizzle.B128_ATOM_32B_FLIP_8B: | ||
| swizzles.append(Swizzle(c=3, d=log2(32), r=log2(256))) | ||
| swizzles.append(Swizzle(c=1, d=log2(8), r=log2(256))) | ||
| swizzles.append(Swizzle(c=3, d=log2(256), r=log2(2048))) | ||
| swizzles.append(Swizzle(c=1, d=log2(64), r=log2(512))) | ||
| elif swizzle == TensorMapSwizzle.B128_ATOM_64B: | ||
| swizzles.append(Swizzle(c=3, d=log2(64), r=log2(512))) | ||
| swizzles.append(Swizzle(c=3, d=log2(512), r=log2(4096))) |
Copilot
AI
Oct 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The magic numbers (128, 1024, 256, 2048, etc.) should be replaced with named constants to improve code readability and maintainability.
This PR adds the tcgen05.mma instruction.
The current support is limited: only tested fp16-fp16-fp32 and fp8-fp8-fp32 case for (a, b, c) dtypes. Did not add the block scale yet.
Minors:
permute_sharedinstruction.