Skip to content

Llama3.2 vision model support#1551

Merged
hnyls2002 merged 48 commits intomainfrom
llama-3.2
Oct 21, 2024
Merged

Llama3.2 vision model support#1551
hnyls2002 merged 48 commits intomainfrom
llama-3.2

Conversation

@hnyls2002
Copy link
Copy Markdown
Collaborator

@hnyls2002 hnyls2002 commented Oct 1, 2024

Motivation

  • Support encoder-decoder architecture in SGLang.
  • Support llama vision model.
  • Support CUDA graph and prefix cache for llama vision model

Note that to support CUDA graph for encoder-decoder architecture like llama vision (mllama), we should make encoder_lens the part of the cuda graph, as the full_text_row_masked_out_mask is decided by encoder_lens to skip the text-only req in a mixed batch.

However, the current cuda graph backend (flashinfer) seems to have trouble handling the mixed batch. So we for now only accept the pure image decoding batch.

Todo in the following PRs:

  • Split attention backends: sliding_window, single_attention, cross_attention
  • Optimize encoder cache locations indexing, and reduce memory usage.

Modifications

Checklist

  • Format your code according to the Contributor Guide.
  • Add unit tests as outlined in the Contributor Guide.
  • Update documentation as needed, including docstrings or example tutorials.

@hnyls2002 hnyls2002 marked this pull request as draft October 1, 2024 06:46
@hnyls2002 hnyls2002 force-pushed the llama-3.2 branch 2 times, most recently from 00cd46a to 2aebd9f Compare October 1, 2024 08:12
Comment thread python/pyproject.toml Outdated
@hnyls2002 hnyls2002 marked this pull request as ready for review October 21, 2024 03:52
Comment thread python/sglang/srt/layers/attention/triton_backend.py
Comment thread python/sglang/srt/mem_cache/memory_pool.py Outdated
Comment thread python/sglang/srt/models/qwen2_vl.py
@hnyls2002 hnyls2002 merged commit 94cde10 into main Oct 21, 2024
@hnyls2002 hnyls2002 deleted the llama-3.2 branch October 21, 2024 22:01
def set_kv_buffer(
self,
layer_id: int,
layer: RadixAttention,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it necessary to change the data type from int to RadixAttention here?

@zhaochenyang20 zhaochenyang20 mentioned this pull request Mar 3, 2025
22 tasks
timethink pushed a commit to timethink/sglang that referenced this pull request Mar 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants