Skip to content

[Issue]: GPU Core dump when running CK-W8A8GEMM Kernel on GPU ID 1,2,3,4,5,6,7 #89

@tjtanaa

Description

@tjtanaa

Problem Description

When trying to run the kernel on inputs of GPU ID of non-zero. E.g. 1,2,3,4,5,6,7. It will throw the following error.

Memory access fault by GPU node-2 (Agent handle: 0x9b15d70) on address 0x7ee42d200000. Reason: Unknown.
tensor(False, device='cuda:1')
GPU core dump created: gpucore.10171
Aborted
root@tw024:/app# python ex.py 
Memory access fault by GPU node-2 (Agent handle: 0xa5f71a0) on address 0x7f532b800000. Reason: Unknown.
GPU core dump created: gpucore.10255
Aborted

Operating System

Ubuntu 22.04.4 LTS (Jammy Jellyfish)

CPU

AMD EPYC 9654 96-Core Processor

GPU

AMD Instinct MI300X

ROCm Version

ROCm 6.3.1

ROCm Component

composable_kernel

Steps to Reproduce

  1. Install aiter from main branch.
  2. Run the following script
from aiter.ops.gemm_op_a8w8 import gemm_a8w8_CK


import torch


SIZE_LIST = [
   (3840, 16384, 16384),
   (56, 8192, 7392)
   ]




def main():
   for size in SIZE_LIST:
       M, N, K = size
       A = torch.rand(size=(M, K), device="cuda:1").to(torch.int8)
       B = torch.rand(size=(K, N), device="cuda:1").to(torch.int8)
       scale_a = torch.ones((M, 1), device="cuda:1").to(torch.int32)
       scale_b = torch.ones((N, 1), device="cuda:1").to(torch.int32)
       result = gemm_a8w8_CK(A, B.t(), scale_a, scale_b, dtype=torch.bfloat16)


if __name__ == "__main__":
   main()

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions