For FP8 on Blackwell GPU(eg. B200), the deepgemm has accuracy issue, and it is enabled by default. The unit test for fp8 block deepgemm on Blackwell also fails.
test_deep_gemm_blackwell (__main__.TestDeepGemmBlackwell.test_deep_gemm_blackwell) ... [Test Method] test_deep_gemm_blackwell
test_deep_gemm_blackwell (__main__.TestDeepGemmBlackwell.test_deep_gemm_blackwell) (M=64, NKs=(2112, 7168), block_size=[128, 128], out_dtype=torch.bfloat16, seed=0) ... FAIL
test_deep_gemm_blackwell (__main__.TestDeepGemmBlackwell.test_deep_gemm_blackwell) (M=64, NKs=(1536, 7168), block_size=[128, 128], out_dtype=torch.bfloat16, seed=0) ... FAIL
test_deep_gemm_blackwell (__main__.TestDeepGemmBlackwell.test_deep_gemm_blackwell) (M=128, NKs=(2112, 7168), block_size=[128, 128], out_dtype=torch.bfloat16, seed=0) ... FAIL
test_deep_gemm_blackwell (__main__.TestDeepGemmBlackwell.test_deep_gemm_blackwell) (M=128, NKs=(1536, 7168), block_size=[128, 128], out_dtype=torch.bfloat16, seed=0) ... FAIL
test_deep_gemm_blackwell (__main__.TestDeepGemmBlackwell.test_deep_gemm_blackwell) (M=512, NKs=(2112, 7168), block_size=[128, 128], out_dtype=torch.bfloat16, seed=0) ... FAIL
test_deep_gemm_blackwell (__main__.TestDeepGemmBlackwell.test_deep_gemm_blackwell) (M=512, NKs=(1536, 7168), block_size=[128, 128], out_dtype=torch.bfloat16, seed=0) ... FAIL
test_deep_gemm_blackwell (__main__.TestDeepGemmBlackwell.test_deep_gemm_blackwell) (M=1024, NKs=(2112, 7168), block_size=[128, 128], out_dtype=torch.bfloat16, seed=0) ... FAIL
test_deep_gemm_blackwell (__main__.TestDeepGemmBlackwell.test_deep_gemm_blackwell) (M=1024, NKs=(1536, 7168), block_size=[128, 128], out_dtype=torch.bfloat16, seed=0) ... FAIL
test_deep_gemm_blackwell (__main__.TestDeepGemmBlackwell.test_deep_gemm_blackwell) (M=4096, NKs=(2112, 7168), block_size=[128, 128], out_dtype=torch.bfloat16, seed=0) ... FAIL
test_deep_gemm_blackwell (__main__.TestDeepGemmBlackwell.test_deep_gemm_blackwell) (M=4096, NKs=(1536, 7168), block_size=[128, 128], out_dtype=torch.bfloat16, seed=0) ... FAIL
======================================================================
FAIL: test_deep_gemm_blackwell (__main__.TestDeepGemmBlackwell.test_deep_gemm_blackwell) (M=64, NKs=(2112, 7168), block_size=[128, 128], out_dtype=torch.bfloat16, seed=0)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/scratch.lsam_gpu/bench-b200/repo-sglang/python/sglang/test/test_block_fp8_deep_gemm_blackwell.py", line 247, in test_deep_gemm_blackwell
self._test_deep_gemm_blackwell(*params)
File "/home/scratch.lsam_gpu/bench-b200/repo-sglang/python/sglang/test/test_block_fp8_deep_gemm_blackwell.py", line 230, in _test_deep_gemm_blackwell
torch.testing.assert_close(out, ref_out, atol=1e-1, rtol=1e-2)
File "/home/scratch.lsam_gpu/miniconda3/envs/b200/lib/python3.12/site-packages/torch/testing/_comparison.py", line 1587, in assert_close
raise error_metas[0].to_error(msg)
AssertionError: Tensor-likes are not close!
Mismatched elements: 43956 / 135168 (32.5%)
Greatest absolute difference: 0.5625 at index (18, 878) (up to 0.1 allowed)
Greatest relative difference: 1020.0 at index (28, 1466) (up to 0.01 allowed)
======================================================================
FAIL: test_deep_gemm_blackwell (__main__.TestDeepGemmBlackwell.test_deep_gemm_blackwell) (M=64, NKs=(1536, 7168), block_size=[128, 128], out_dtype=torch.bfloat16, seed=0)
Blackwell GPU (eg. B200)
sglang main branch
Checklist
Describe the bug
For FP8 on Blackwell GPU(eg. B200), the deepgemm has accuracy issue, and it is enabled by default. The unit test for fp8 block deepgemm on Blackwell also fails.
Reproduction
Unit test command:
python3 sglang/python/sglang/test/test_block_fp8_deep_gemm_blackwell.pyOutput
Environment
Blackwell GPU (eg. B200)
sglang main branch