[Model] Support Mistral-Nemo#6548
Conversation
|
Tested merging your commits into 0.5.2 and it works fine. Model works up to 100k tokens(max I can fit into my A100 with fp8 weights/fp8 cache. |
|
Yes! |
|
env: VLLM_ATTENTION_BACKEND=XFORMERS CUDA_VISIBLE_DEVICES=0,1 python -m vllm.entrypoints.openai.api_server --model neuralmagic/Mistral-Nemo-Instruct-2407-FP8 --gpu-memory-utilization 0.75 --quantization fp8 --host 0.0.0.0 --port 1237 -tp 2 --max-model-len 17000 --served-model-name gpt --trust-remote-code --enable-prefix-caching error: |
|
@maxin9966 You would need to apply the patch manually. It hasn't been released yet. See https://docs.vllm.ai/en/latest/getting_started/installation.html#build-from-source. I'm running mistralai/Mistral-Nemo-Instruct-2407 on an A100 with 100k and no issues. Built via docker... DOCKER_BUILDKIT=1 docker build . --target vllm-openai --tag nemo-vllm |
|
@jasonacox Alright, thank you very much. @mgoin Could you please confirm if the latest code supports the mistral-nemo models running in gptq or awq modes? FP8 is a bit too slow. |
|
Yes, mistral-nemo should have the same quantization support as mistral. |
|
@w013nad how did you test fp8 with an A100? I thought fp8 was only supported on newer hardware. thanks! |
How much GPU memory is needed by fp16 model and 128K tokens? |
|
Testing a single A100, 128k max-model-len, dtype=auto, weights take 23GB but full vram running footprint is 57GB. I'm getting average 42 TPS per session with aggregate throughput of 1,422 TPS using 512 concurrent threads (load testing). Docker: git clone https://github.com/vllm-project/vllm.git
cd vllm
DOCKER_BUILDKIT=1 docker build . --target vllm-openai --tag vllm-nemodocker run -d --runtime nvidia --gpus '"device=0"' \
-v ${PWD}/models:/root/.cache/huggingface \
-p 8000:8000 \
-e NVIDIA_DISABLE_REQUIRE=true \
--env "HF_TOKEN=*******" \
--ipc=host \
--name vllm \
--restart unless-stopped \
vllm-nemo \
--model mistralai/Mistral-Nemo-Instruct-2407 \
--max-model-len 128000 \
--tensor-parallel-size 1 |
This is great! Do you if know if it will work like this with a LoRA Adapter currently? |
|
Tested with FP8 on 2A100s getting 86.60 tok/s |
|
@tensimixt I would love to see what aggregate (concurrent) tok/s you get with that setup. I use this simple load generator: https://github.com/jasonacox/TinyLLM/blob/main/loadtest.py |
|
@simon-mo the latest docker image will include this next week? Thanks |
|
Need help, why I can't use fp8? |
Signed-off-by: Alvant <alvasian@yandex.ru>
|
I'm getting |
|
Ensure that your input sequence length doesn’t exceed the model’s maximum limit. Trim or truncate the input to fit within 1024 tokens. |
Initialization config
From model's config.json:
Input length is irrelevant, because it's |
|
@Isotr0py would you have an idea about GGUF issues with this architecture? |
|
@vladfaust Can you try updating to the latest |
|
@Isotr0py yep, it works with the latest vLLM. |
Signed-off-by: LeiWang1999 <leiwang1999@outlook.com>
FIX #6545
Patch was ported from huggingface/transformers#32050
Essentially there was a new
head_dimoverride added to MistralConfig. We will look for that optional argument in the config and default to the previousself.hidden_size // self.total_num_headsbehavior.We have also produced and validated a FP8 quantized checkpoint: https://huggingface.co/neuralmagic/Mistral-Nemo-Instruct-2407-FP8
Note that by default it will use a very large model length (128k) and may need
max_model_lento be specified.