[Model Runner V2] Migration from v1 to v2, with more Llama and Mistral dense models [2/N]#42665
[Model Runner V2] Migration from v1 to v2, with more Llama and Mistral dense models [2/N]#42665yewentao256 wants to merge 14 commits into
Conversation
Signed-off-by: yewentao256 <zhyanwentao@126.com>
There was a problem hiding this comment.
Claude Code Review
This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.
Tip: disable this comment in your organization's Code Review settings.
There was a problem hiding this comment.
Code Review
This pull request expands the DEFAULT_V2_MODEL_RUNNER_ARCHITECTURES in vllm/config/vllm.py to include LlamaForCausalLM and MistralForCausalLM. Additionally, it adds corresponding test cases for Llama 3.2 and Mistral 7B models in tests/test_config.py to ensure correct environment configuration. I have no feedback to provide as there were no review comments to evaluate.
| architectures=["LlamaForCausalLM"], | ||
| runner_type="generate", | ||
| is_moe=False, | ||
| is_quantized=False, |
There was a problem hiding this comment.
Why should quantization matter at all to mrv2? This might be too incremental an approach
There was a problem hiding this comment.
We are very conservative about adding supported features, let's do it step by step. As you can see in #41286 we fixed a bunch of issues just for qwen dense model. We don't want to involve too much possilble CI failures in one PR
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
| # Note: torch stock compile is not supported in v2, so keep | ||
| # the runner fixed across compile modes so this test only checks | ||
| # compilation correctness | ||
| base_env = {"VLLM_USE_V2_MODEL_RUNNER": "0"} |
There was a problem hiding this comment.
Can we not add this to the oracle? We should have it fall back automatically I think?
There was a problem hiding this comment.
#43233
This should be solved here, CI failure related, a lot of similar issues.
| # TODO: ngram / ngram_gpu / eagle are not supported by the v2 | ||
| # model runner yet. | ||
| if speculative_config.method in ("ngram", "ngram_gpu"): | ||
| unsupported.append("ngram/ngram_gpu speculative decoding") | ||
| elif speculative_config.method not in ("eagle", "eagle3", "mtp"): | ||
| elif speculative_config.method not in ("eagle3", "mtp"): |
There was a problem hiding this comment.
What is the reason for this? MRV2 does support eagle.
There was a problem hiding this comment.
We talked about this before, OOM issue related https://buildkite.com/vllm/ci/builds/66718#019e3bb8-eba0-4c70-8abd-eab98274facf
I temporally remove support of eagle, as there is a behavior change between v1 and v2
#35214 This PR might be related as it removes the full fp32 allocation which saves memory
There was a problem hiding this comment.
OK thanks, I think we should understand/fix the behavior change instead
Signed-off-by: Nick Hill <nickhill123@gmail.com>
|
@yewentao256 fyi I pushed another small test change that's needed |
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
Thanks @yewentao256, have now merged your commits here + additional fixes in #43458. |
Purpose
Make progress for #41286
Based on #39337 (review) by @NickLucche , we firstly expand to llama and mistral, if this works well we will expand to all dense later.
Test
Covered in CI