Skip to content

Conversation

@yaoyaoding
Copy link
Member

@yaoyaoding yaoyaoding commented Oct 7, 2025

This PR adds the tcgen05.mma instruction.

The current support is limited: only tested fp16-fp16-fp32 and fp8-fp8-fp32 case for (a, b, c) dtypes. Did not add the block scale yet.

Minors:

  1. add permute_shared instruction.
  2. refactor how Tilus Script handles the method calling of Register/Shared tensors to make it more extensible.

Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>
Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>
Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>
Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>
Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>
Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>
Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>
Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>
Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>
Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>
Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>
Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>
Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>
Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>
Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>
Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>
@yaoyaoding yaoyaoding mentioned this pull request Oct 7, 2025
17 tasks
Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>
Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>
@yaoyaoding yaoyaoding requested a review from Copilot October 7, 2025 18:33
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for the tcgen05.mma instruction for the TCGen05 microarchitecture, implementing matrix multiplication acceleration with initial support for fp16-fp16-fp32 and fp8-fp8-fp32 operand combinations. The implementation includes comprehensive layout inference, code generation, and testing infrastructure.

Key changes include:

  • Implementation of tcgen05.mma instruction with shared memory and tensor memory operand support
  • Addition of permute_shared instruction for tensor dimension reordering
  • Refactoring of Tilus Script method handling to improve extensibility for Register/Shared/Global tensors

Reviewed Changes

Copilot reviewed 49 out of 50 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/instructions/test_tcgen05_mma.py New test suite for tcgen05.mma instruction with various operand type combinations
python/tilus/lang/transpiler.py Refactored tensor method handling to use new extensible method system
python/tilus/lang/methods/ New method handling infrastructure for different tensor types
python/tilus/ir/instructions/cuda/tcgen05.py Added Tcgen05MmaSSInst and Tcgen05MmaTSInst instruction definitions
python/tilus/backends/emitters/cuda/tcgen05/mma.py Comprehensive mma instruction code generation with layout validation
python/tilus/ir/layout/cuda/tcgen05/smem.py Refactored swizzle mode handling and layout generation

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +30 to +63
# class Tcgen05SwizzleMode(Enum):
# """TCGen05 swizzle modes corresponding to cute Swizzle parameters"""

# NO_SWIZZLE = (0, 0, 0) # No swizzling or Interleaved
# B32_SWIZZLE = (1, 4, 3) # 32B Swizzling: Swizzle<1, 4, 3>
# B64_SWIZZLE = (2, 4, 3) # 64B Swizzling: Swizzle<2, 4, 3>
# B128_SWIZZLE = (3, 4, 3) # 128B Swizzling: Swizzle<3, 4, 3>

# def encode(self) -> int:
# # see https://docs.nvidia.com/cuda/parallel-thread-execution/#tcgen05-shared-memory-desc-layout
# return {
# Tcgen05SwizzleMode.NO_SWIZZLE: 0,
# Tcgen05SwizzleMode.B32_SWIZZLE: 6,
# Tcgen05SwizzleMode.B64_SWIZZLE: 4,
# Tcgen05SwizzleMode.B128_SWIZZLE: 2,
# }[self]

# @property
# def bbits(self) -> int:
# return self.value[0]

# @property
# def mbase(self) -> int:
# return self.value[1]

# @property
# def sshift(self) -> int:
# return self.value[2]

# def as_cute_swizzle(self) -> CuteSwizzle:
# bbits, mbase, sshift = self.value
# return CuteSwizzle(bbits=bbits, mbase=mbase, sshift=sshift)


Copy link

Copilot AI Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Large blocks of commented-out code should be removed rather than left in the codebase. This creates confusion and makes the code harder to maintain.

Copilot uses AI. Check for mistakes.


class RegisterTensorWithMethods(RegisterTensor):
def __init__(self, tensor: RegisterTensor, builder: StmtBuilder):
Copy link

Copilot AI Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing call to super().init() in RegisterTensorWithMethods constructor. This could lead to incomplete initialization of the parent class.

Suggested change
def __init__(self, tensor: RegisterTensor, builder: StmtBuilder):
def __init__(self, tensor: RegisterTensor, builder: StmtBuilder):
super().__init__()

Copilot uses AI. Check for mistakes.
Comment on lines +28 to +29
@dataclass
class SharedMatrixDescriptor:
Copy link

Copilot AI Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SharedMatrixDescriptor class has extensive documentation in its docstring table but lacks docstring descriptions for individual parameters (addr, lbo, sbo, etc.). Consider adding parameter documentation for better API clarity.

Copilot uses AI. Check for mistakes.
assert tmem_tensor.shape[1] * tmem_tensor.dtype.nbits % 32 == 0
num_columns = tmem_tensor.shape[1] * tmem_tensor.dtype.nbits // 32
assert num_columns % 32 == 0 and 32 <= num_columns <= 512
assert num_columns % 32 == 0 and 32 <= num_columns <= 512, num_columns
Copy link

Copilot AI Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assertion error message is unclear. Consider providing a more descriptive error message explaining what num_columns represents and why it must meet these constraints.

Suggested change
assert num_columns % 32 == 0 and 32 <= num_columns <= 512, num_columns
assert num_columns % 32 == 0 and 32 <= num_columns <= 512, (
f"num_columns (number of 32-bit columns in TMemoryTensor) must be a multiple of 32 and in [32, 512], got {num_columns}"
)

Copilot uses AI. Check for mistakes.
Comment on lines +126 to +137
swizzles.append(Swizzle(c=1, d=log2(128), r=log2(1024)))
elif swizzle == TensorMapSwizzle.B64:
swizzles.append(Swizzle(c=2, d=log2(16), r=log2(128)))
swizzles.append(Swizzle(c=2, d=log2(128), r=log2(1024)))
elif swizzle == TensorMapSwizzle.B128:
swizzles.append(Swizzle(c=3, d=log2(16), r=log2(128)))
swizzles.append(Swizzle(c=3, d=log2(128), r=log2(1024)))
elif swizzle == TensorMapSwizzle.B128_ATOM_32B:
swizzles.append(Swizzle(c=3, d=log2(32), r=log2(256)))
swizzles.append(Swizzle(c=3, d=log2(256), r=log2(2048)))
elif swizzle == TensorMapSwizzle.B128_ATOM_32B_FLIP_8B:
swizzles.append(Swizzle(c=3, d=log2(32), r=log2(256)))
swizzles.append(Swizzle(c=1, d=log2(8), r=log2(256)))
swizzles.append(Swizzle(c=3, d=log2(256), r=log2(2048)))
swizzles.append(Swizzle(c=1, d=log2(64), r=log2(512)))
elif swizzle == TensorMapSwizzle.B128_ATOM_64B:
swizzles.append(Swizzle(c=3, d=log2(64), r=log2(512)))
swizzles.append(Swizzle(c=3, d=log2(512), r=log2(4096)))
Copy link

Copilot AI Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The magic numbers (128, 1024, 256, 2048, etc.) should be replaced with named constants to improve code readability and maintainability.

Copilot uses AI. Check for mistakes.
@yaoyaoding yaoyaoding merged commit 9d216e8 into main Oct 7, 2025
9 checks passed
@yaoyaoding yaoyaoding deleted the support-tcgen05 branch October 7, 2025 19:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants