Skip to content

Sequence Parallel Decode Attn Kernel#5

Merged
ZYHowell merged 14 commits intomainfrom
pr-sp-decode-kernel
Aug 8, 2024
Merged

Sequence Parallel Decode Attn Kernel#5
ZYHowell merged 14 commits intomainfrom
pr-sp-decode-kernel

Conversation

@ivanium
Copy link
Copy Markdown
Owner

@ivanium ivanium commented Aug 1, 2024

This PR implements the SP decode kernel:

  • Initialize flashinfer wrapper with actual seq_lens.
  • The kernel support that replicates Q tensors of decoding batch across SP workers, gathers output o,s tensors at the end, and merges their states.
  • Fixed a bug in prefill communication which may lead to deadlock due to incorrect send/recv order
  • Incorporate KV cache store logic. Need out_cache_loc support here.

@ivanium ivanium requested a review from ZYHowell August 1, 2024 23:30
Comment thread python/sglang/srt/layers/radix_attention.py
@ZYHowell ZYHowell merged commit 1695aed into main Aug 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants