Optimize GQA/MQA by grimoire · Pull Request #1649 · InternLM/lmdeploy

grimoire · 2024-05-23T11:44:51Z

enable tensorcore in decoding MHA kernel.

internlm2-chat-20b tp=2 batch_size=256 num_reqs=3000

origin

first token latency(s)(min, max, ave): 1.151, 20.570, 5.350
per-token latency(s) percentile(50, 75, 95, 99): [0.074, 0.087, 0.387, 0.515]

number of prompt tokens: 684711
number of completion tokens: 624144
token throughput (completion token): 1647.393 token/s
token throughput (prompt + completion token): 3454.650 token/s
RPS (request per second): 7.918 req/s
RPM (request per minute): 475.100 req/min

This PR

concurrency: 256
elapsed_time: 346.817s

first token latency(s)(min, max, ave): 2.643, 21.116, 5.051
per-token latency(s) percentile(50, 75, 95, 99): [0.065, 0.083, 0.407, 0.504]

number of prompt tokens: 684711
number of completion tokens: 624144
token throughput (completion token): 1799.636 token/s
token throughput (prompt + completion token): 3773.910 token/s
RPS (request per second): 8.650 req/s
RPM (request per minute): 519.006 req/min

internlm2-chat-20b tp=1 batch_size=128 num_reqs=3000

origin

concurrency: 128
elapsed_time: 566.050s

first token latency(s)(min, max, ave): 1.929, 12.539, 3.287
per-token latency(s) percentile(50, 75, 95, 99): [0.064, 0.066, 0.288, 0.609]

number of prompt tokens: 684711
number of completion tokens: 624144
token throughput (completion token): 1102.630 token/s
token throughput (prompt + completion token): 2312.260 token/s
RPS (request per second): 5.300 req/s
RPM (request per minute): 317.993 req/min

This PR

concurrency: 128
elapsed_time: 480.266s

first token latency(s)(min, max, ave): 1.489, 10.305, 2.808
per-token latency(s) percentile(50, 75, 95, 99): [0.051, 0.053, 0.251, 0.571]

number of prompt tokens: 684711
number of completion tokens: 624144
token throughput (completion token): 1299.579 token/s
token throughput (prompt + completion token): 2725.268 token/s
RPS (request per second): 6.247 req/s
RPM (request per minute): 374.792 req/min

LLama-3-8b-instruct tp=1 batch_size=128 num_reqs=3000

origin

concurrency: 256
elapsed_time: 260.775s

first token latency(s)(min, max, ave): 1.044, 15.817, 3.555
per-token latency(s) percentile(50, 75, 95, 99): [0.056, 0.06, 0.315, 0.379]

number of prompt tokens: 676779
number of completion tokens: 612685
token throughput (completion token): 2349.476 token/s
token throughput (prompt + completion token): 4944.734 token/s
RPS (request per second): 11.504 req/s
RPM (request per minute): 690.250 req/min

This PR

concurrency: 256
elapsed_time: 246.199s

first token latency(s)(min, max, ave): 2.152, 13.018, 3.580
per-token latency(s) percentile(50, 75, 95, 99): [0.051, 0.065, 0.316, 0.361]

number of prompt tokens: 676779
number of completion tokens: 612685
token throughput (completion token): 2488.573 token/s
token throughput (prompt + completion token): 5237.481 token/s
RPS (request per second): 12.185 req/s
RPM (request per minute): 731.115 req/min

lvhan028 · 2024-05-24T06:59:34Z

+        qk *= sm_scale
+        # NOTE: inf - inf = nan, and nan will leads to error
+        qk_mask = history_len >= (start_n + offs_n)
+        if window_size > 0:


Is it related to local attention?

lvhan028 · 2024-05-24T07:00:13Z

+    head_size_v: tl.constexpr,
    SPLIT_K: tl.constexpr,
-    BLOCK_DMODEL: tl.constexpr,
+    BLOCK_DV: tl.constexpr,


what does 'DV' refer to?

head_Dim of Value.

value could have a different head_dim from key and query(which is referred as BLOCK_DMODEL).

lvhan028 · 2024-05-24T07:40:01Z

+    else:
+        BLOCK_DV = triton.next_power_of_2(Lv)
+    BLOCK_M = max(16, min(BLOCK, 16384 // BLOCK_DMODEL))
+    if Lk > 512 and BLOCK > 32:


"and" or "or"?

"and".

Lk>512 => BLOCK_DMODEL>=1024
the key smem usage is BLOCK * BLOCK_DMODEL * sizeof(half).

lvhan028 · 2024-05-24T07:43:01Z


-    sm_scale = 1.0 / (Lq**0.5)
+    if sm_scale is None:
+        sm_scale = 1.0 / (Lq**0.5)


Just out of curiosity, which model uses a different sm_scale other than 1/sqrt(dim)

https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite-Chat/blob/main/modeling_deepseek.py#L749
And MLA might use the sm_scale of origin dim.

lvhan028 · 2024-05-24T07:44:12Z

    max_seqlen: int,
    window_size: int = None,
+    sm_scale: float = None,
+    shared_kv: int = False,


int -> bool.
what does shared_kv mean?

https://kexue.fm/archives/10091#Part%203

K is [ci, rope] V is [ci]. V share the same memory with K.

And it is not recommend to enable this flag since the layout of shared V is not friendly to matmul.

RunningLeon

LGTM

optimize gqa

e14cc47

grimoire added enhancement New feature or request improvement and removed enhancement New feature or request labels May 24, 2024

optimize num warps

c518fcf

lvhan028 requested review from RunningLeon, lvhan028 and lzhangzz May 24, 2024 04:07

auto tuning

61c52d1

lvhan028 reviewed May 24, 2024

View reviewed changes

arg type bool

cc35842

lvhan028 approved these changes May 24, 2024

View reviewed changes

RunningLeon approved these changes May 24, 2024

View reviewed changes

lvhan028 merged commit cd19422 into InternLM:main May 24, 2024

ispobock mentioned this pull request Aug 17, 2024

Optimize MLA/GQA/MQA Triton decoding sgl-project/sglang#1138

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize GQA/MQA#1649

Optimize GQA/MQA#1649
lvhan028 merged 4 commits intoInternLM:mainfrom
grimoire:torch-optimize-gqa

grimoire commented May 23, 2024 •

edited

Loading

Uh oh!

lvhan028 May 24, 2024

Uh oh!

grimoire May 24, 2024

Uh oh!

lvhan028 May 24, 2024

Uh oh!

grimoire May 24, 2024 •

edited

Loading

Uh oh!

lvhan028 May 24, 2024

Uh oh!

grimoire May 24, 2024

Uh oh!

lvhan028 May 24, 2024

Uh oh!

grimoire May 24, 2024

Uh oh!

lvhan028 May 24, 2024

Uh oh!

grimoire May 24, 2024

Uh oh!

RunningLeon left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

grimoire commented May 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

internlm2-chat-20b tp=2 batch_size=256 num_reqs=3000

origin

This PR

internlm2-chat-20b tp=1 batch_size=128 num_reqs=3000

origin

This PR

LLama-3-8b-instruct tp=1 batch_size=128 num_reqs=3000

origin

This PR

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

grimoire May 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RunningLeon left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

grimoire commented May 23, 2024 •

edited

Loading

grimoire May 24, 2024 •

edited

Loading