[Instruction][tcgen05] Add mma instruction #55

yaoyaoding · 2025-10-07T07:32:03Z

This PR adds the tcgen05.mma instruction.

The current support is limited: only tested fp16-fp16-fp32 and fp8-fp8-fp32 case for (a, b, c) dtypes. Did not add the block scale yet.

Minors:

add permute_shared instruction.
refactor how Tilus Script handles the method calling of Register/Shared tensors to make it more extensible.

Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>

Copilot

Pull Request Overview

This PR adds support for the tcgen05.mma instruction for the TCGen05 microarchitecture, implementing matrix multiplication acceleration with initial support for fp16-fp16-fp32 and fp8-fp8-fp32 operand combinations. The implementation includes comprehensive layout inference, code generation, and testing infrastructure.

Key changes include:

Implementation of tcgen05.mma instruction with shared memory and tensor memory operand support
Addition of permute_shared instruction for tensor dimension reordering
Refactoring of Tilus Script method handling to improve extensibility for Register/Shared/Global tensors

Reviewed Changes

Copilot reviewed 49 out of 50 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
tests/instructions/test_tcgen05_mma.py	New test suite for tcgen05.mma instruction with various operand type combinations
python/tilus/lang/transpiler.py	Refactored tensor method handling to use new extensible method system
python/tilus/lang/methods/	New method handling infrastructure for different tensor types
python/tilus/ir/instructions/cuda/tcgen05.py	Added Tcgen05MmaSSInst and Tcgen05MmaTSInst instruction definitions
python/tilus/backends/emitters/cuda/tcgen05/mma.py	Comprehensive mma instruction code generation with layout validation
python/tilus/ir/layout/cuda/tcgen05/smem.py	Refactored swizzle mode handling and layout generation

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-07T18:34:53Z

python/tilus/ir/layout/cuda/tcgen05/smem.py

+# class Tcgen05SwizzleMode(Enum):
+#     """TCGen05 swizzle modes corresponding to cute Swizzle parameters"""
+
+#     NO_SWIZZLE = (0, 0, 0)  # No swizzling or Interleaved
+#     B32_SWIZZLE = (1, 4, 3)  # 32B Swizzling: Swizzle<1, 4, 3>
+#     B64_SWIZZLE = (2, 4, 3)  # 64B Swizzling: Swizzle<2, 4, 3>
+#     B128_SWIZZLE = (3, 4, 3)  # 128B Swizzling: Swizzle<3, 4, 3>
+
+#     def encode(self) -> int:
+#         # see https://docs.nvidia.com/cuda/parallel-thread-execution/#tcgen05-shared-memory-desc-layout
+#         return {
+#             Tcgen05SwizzleMode.NO_SWIZZLE: 0,
+#             Tcgen05SwizzleMode.B32_SWIZZLE: 6,
+#             Tcgen05SwizzleMode.B64_SWIZZLE: 4,
+#             Tcgen05SwizzleMode.B128_SWIZZLE: 2,
+#         }[self]
+
+#     @property
+#     def bbits(self) -> int:
+#         return self.value[0]
+
+#     @property
+#     def mbase(self) -> int:
+#         return self.value[1]
+
+#     @property
+#     def sshift(self) -> int:
+#         return self.value[2]
+
+#     def as_cute_swizzle(self) -> CuteSwizzle:
+#         bbits, mbase, sshift = self.value
+#         return CuteSwizzle(bbits=bbits, mbase=mbase, sshift=sshift)
+
+


Large blocks of commented-out code should be removed rather than left in the codebase. This creates confusion and makes the code harder to maintain.

Copilot · 2025-10-07T18:34:54Z

python/tilus/lang/methods/register_tensor.py

+
+
+class RegisterTensorWithMethods(RegisterTensor):
+    def __init__(self, tensor: RegisterTensor, builder: StmtBuilder):


Missing call to super().init() in RegisterTensorWithMethods constructor. This could lead to incomplete initialization of the parent class.

Suggested change

def __init__(self, tensor: RegisterTensor, builder: StmtBuilder):

def __init__(self, tensor: RegisterTensor, builder: StmtBuilder):

super().__init__()

Copilot · 2025-10-07T18:34:54Z

python/tilus/backends/emitters/cuda/tcgen05/smem_desc.py

+@dataclass
+class SharedMatrixDescriptor:


The SharedMatrixDescriptor class has extensive documentation in its docstring table but lacks docstring descriptions for individual parameters (addr, lbo, sbo, etc.). Consider adding parameter documentation for better API clarity.

Copilot · 2025-10-07T18:34:55Z

python/tilus/backends/emitters/cuda/tcgen05/allocation.py

        assert tmem_tensor.shape[1] * tmem_tensor.dtype.nbits % 32 == 0
        num_columns = tmem_tensor.shape[1] * tmem_tensor.dtype.nbits // 32
-        assert num_columns % 32 == 0 and 32 <= num_columns <= 512
+        assert num_columns % 32 == 0 and 32 <= num_columns <= 512, num_columns


The assertion error message is unclear. Consider providing a more descriptive error message explaining what num_columns represents and why it must meet these constraints.

Suggested change

assert num_columns % 32 == 0 and 32 <= num_columns <= 512, num_columns

assert num_columns % 32 == 0 and 32 <= num_columns <= 512, (

f"num_columns (number of 32-bit columns in TMemoryTensor) must be a multiple of 32 and in [32, 512], got {num_columns}"

)

Copilot · 2025-10-07T18:34:55Z

python/tilus/backends/emitters/cuda/cp_async_tensor.py

+        swizzles.append(Swizzle(c=1, d=log2(128), r=log2(1024)))
    elif swizzle == TensorMapSwizzle.B64:
-        swizzles.append(Swizzle(c=2, d=log2(16), r=log2(128)))
+        swizzles.append(Swizzle(c=2, d=log2(128), r=log2(1024)))
    elif swizzle == TensorMapSwizzle.B128:
-        swizzles.append(Swizzle(c=3, d=log2(16), r=log2(128)))
+        swizzles.append(Swizzle(c=3, d=log2(128), r=log2(1024)))
    elif swizzle == TensorMapSwizzle.B128_ATOM_32B:
-        swizzles.append(Swizzle(c=3, d=log2(32), r=log2(256)))
+        swizzles.append(Swizzle(c=3, d=log2(256), r=log2(2048)))
    elif swizzle == TensorMapSwizzle.B128_ATOM_32B_FLIP_8B:
-        swizzles.append(Swizzle(c=3, d=log2(32), r=log2(256)))
-        swizzles.append(Swizzle(c=1, d=log2(8), r=log2(256)))
+        swizzles.append(Swizzle(c=3, d=log2(256), r=log2(2048)))
+        swizzles.append(Swizzle(c=1, d=log2(64), r=log2(512)))
    elif swizzle == TensorMapSwizzle.B128_ATOM_64B:
-        swizzles.append(Swizzle(c=3, d=log2(64), r=log2(512)))
+        swizzles.append(Swizzle(c=3, d=log2(512), r=log2(4096)))


The magic numbers (128, 1024, 256, 2048, etc.) should be replaced with named constants to improve code readability and maintainability.

yaoyaoding added 16 commits October 7, 2025 07:30

wip

4a652e5

Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>

wip

3b68ed3

Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>

wip

8e0586f

Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>

wip

5a778b3

Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>

wip

0f1fc5b

Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>

wip

88c256f

Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>

wip

83d6dd9

Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>

wip

072a868

Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>

wip

d0b756c

Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>

basic finish

215ff6f

Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>

wip

0c06606

Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>

wip

e620259

Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>

format & lint

7a01178

Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>

add copy right

3080a29

Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>

skip test for old gpus

1781c62

Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>

fix

8d1f0a8

Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>

yaoyaoding mentioned this pull request Oct 7, 2025

Roadmap of Tilus v0.2 #49

Open

17 tasks

yaoyaoding added 2 commits October 7, 2025 18:32

fix

7f8506d

Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>

format

2df9596

Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>

yaoyaoding requested a review from Copilot October 7, 2025 18:33

Copilot AI reviewed Oct 7, 2025

View reviewed changes

yaoyaoding merged commit 9d216e8 into main Oct 7, 2025
9 checks passed

yaoyaoding deleted the support-tcgen05 branch October 7, 2025 19:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Instruction][tcgen05] Add mma instruction #55

[Instruction][tcgen05] Add mma instruction #55

Uh oh!

yaoyaoding commented Oct 7, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Oct 7, 2025

Uh oh!

Copilot AI Oct 7, 2025

Uh oh!

Copilot AI Oct 7, 2025

Uh oh!

Copilot AI Oct 7, 2025

Uh oh!

Copilot AI Oct 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants



		class RegisterTensorWithMethods(RegisterTensor):
		def __init__(self, tensor: RegisterTensor, builder: StmtBuilder):

	def __init__(self, tensor: RegisterTensor, builder: StmtBuilder):
	def __init__(self, tensor: RegisterTensor, builder: StmtBuilder):
	super().__init__()

-        assert num_columns % 32 == 0 and 32 <= num_columns <= 512, num_columns
+        assert num_columns % 32 == 0 and 32 <= num_columns <= 512, (
+            f"num_columns (number of 32-bit columns in TMemoryTensor) must be a multiple of 32 and in [32, 512], got {num_columns}"
+        )

[Instruction][tcgen05] Add mma instruction #55

[Instruction][tcgen05] Add mma instruction #55

Uh oh!

Conversation

yaoyaoding commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yaoyaoding commented Oct 7, 2025 •

edited

Loading