[quant kernel] sgl-kernel support per_tensor_quant fp8#3786
Merged
Conversation
Collaborator
Author
Ok. |
Collaborator
Author
18 tasks
6 tasks
6 tasks
hebiao064
reviewed
Mar 6, 2025
hebiao064
reviewed
Mar 6, 2025
hebiao064
reviewed
Mar 6, 2025
hebiao064
reviewed
Mar 6, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
ops.scaled_fp8_quant, remove a dependency of a vllm kernel.cub::Reduce. kernel speed up 10%-60% in h100 and 20%-100% in h200。when the number of tokens is very large: Program hit cudaErrorInvalidConfiguration (error 9) due to "invalid configuration argument" on CUDA API call to cudaLaunchKernel.h100:
✅ All implementations match per-tensor-quant-fp8-performance: batch_size seq_len VLLM SGL Kernel 0 16.0 64.0 37.919998 27.840000 1 16.0 128.0 61.888002 40.959999 2 16.0 256.0 121.023998 74.400000 3 16.0 512.0 233.967990 148.079991 4 16.0 1024.0 444.415987 273.887992 5 16.0 2048.0 862.720013 523.808002 6 32.0 64.0 62.080000 41.152000 7 32.0 128.0 120.768003 74.239999 8 32.0 256.0 233.855993 148.240000 9 32.0 512.0 444.512010 273.983985 10 32.0 1024.0 862.768054 523.775995 11 32.0 2048.0 1704.031944 1022.799969 12 64.0 64.0 120.576002 74.207999 13 64.0 128.0 233.088002 148.256004 14 64.0 256.0 443.136007 274.623990 15 64.0 512.0 860.224009 522.783995 16 64.0 1024.0 1700.991988 1021.952033 17 64.0 2048.0 3366.015911 2000.895977 18 128.0 64.0 233.375996 148.128003 19 128.0 128.0 443.071991 274.015993 20 128.0 256.0 860.239983 522.144020 21 128.0 512.0 1700.688004 1022.207975 22 128.0 1024.0 3364.768028 2003.328085 23 128.0 2048.0 6704.607964 3968.960047h200:
✅ All implementations match per-tensor-quant-fp8-performance: batch_size seq_len VLLM SGL Kernel 0 16.0 64.0 35.200000 25.024001 1 16.0 128.0 57.312001 36.352001 2 16.0 256.0 110.239998 63.263997 3 16.0 512.0 220.384002 116.544001 4 16.0 1024.0 423.103988 211.935997 5 16.0 2048.0 822.607994 399.199992 6 32.0 64.0 57.567999 36.320001 7 32.0 128.0 110.335998 63.295998 8 32.0 256.0 220.223993 116.448000 9 32.0 512.0 423.119992 211.888000 10 32.0 1024.0 821.503997 399.183989 11 32.0 2048.0 1627.104044 775.839984 12 64.0 64.0 110.431999 63.327998 13 64.0 128.0 220.543995 116.608001 14 64.0 256.0 423.168004 211.935997 15 64.0 512.0 821.695983 399.071991 16 64.0 1024.0 1625.263929 776.368022 17 64.0 2048.0 3222.768068 1522.783995 18 128.0 64.0 220.128000 116.576001 19 128.0 128.0 423.135996 211.840004 20 128.0 256.0 821.407974 399.183989 21 128.0 512.0 1625.616074 775.776029 22 128.0 1024.0 3221.407890 1522.495985 23 128.0 2048.0 6415.455818 3015.967846