Contact Details
No response
What happened?
Issue Description:
The token generation speed of llamafile is slower compared to the upstream llama.cpp project.
Details:
llamafile version 0.8.12:gglm-cuda built with command:
nvcc -arch=all -DIGNORE123 -O3 --shared --use_fast_math --forward-unknown-to-host-compiler --compiler-options "/nologo /EHsc /O2 /GR /MT" -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_MULTIPLATFORM -DGGML_CUDA_DMMV_X=32 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_CUDA_MMV_Y=1 -DGGML_USE_CUBLAS -DTEHFLASH -o ggml-cuda.dll.all ggml-cuda.cu -lcublas -lcuda
llama.cpp versions tested:
- b3567 (latest version)
- b2968 (version from May 22nd, which means
llamafile should have included all updates from this version)
In comparison:
llamafile only achieves 26 tokens/s.
- Both versions of
llama.cpp achieve 51 tokens/s.
GPU Utilization:
- Using
nvidia-smi, the GPU utilization for llamafile is observed to be 41%, whereas for llama.cpp, it reaches 80%.
Model Used for Testing:
- Model: Qwen/Qwen2-7B-Instruct-GGUF
- Specific file:
qwen2-7b-instruct-q3_k_m.gguf
Test Environment:
- Operating System: Windows 10
- GPU: RTX 2080
- CUDA Version: 12.6
Version
llamafile v0.8.12
What operating system are you seeing the problem on?
Windows
Relevant log output
Logs Comparison:
-
llama.cpp Log (b3567):
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 2080, compute capability 7.5, VMM: yes
llm_load_tensors: ggml ctx size = 0.32 MiB
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors: CPU buffer size = 223.33 MiB
llm_load_tensors: CUDA0 buffer size = 3402.96 MiB
-
llamafile Log:
ggml_cuda_link: welcome to CUDA SDK with cuBLAS
...
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 2080, compute capability 7.5, VMM: yes
llm_load_tensors: ggml ctx size = 0.38 MiB
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloaded 28/29 layers to GPU
llm_load_tensors: CPU buffer size = 3626.29 MiB
llm_load_tensors: CUDA0 buffer size = 2976.59 MiB
The llamafile log is missing the line offloading non-repeating layers to GPU. I’m wondering if this could be the reason for the performance issue.
Contact Details
No response
What happened?
Issue Description:
The token generation speed of
llamafileis slower compared to the upstreamllama.cppproject.Details:
llamafileversion 0.8.12:gglm-cuda built with command:llama.cppversions tested:llamafileshould have included all updates from this version)In comparison:
llamafileonly achieves 26 tokens/s.llama.cppachieve 51 tokens/s.GPU Utilization:
nvidia-smi, the GPU utilization forllamafileis observed to be 41%, whereas forllama.cpp, it reaches 80%.Model Used for Testing:
qwen2-7b-instruct-q3_k_m.ggufTest Environment:
Version
llamafile v0.8.12
What operating system are you seeing the problem on?
Windows
Relevant log output
Logs Comparison:
llama.cpp Log (b3567):
llamafile Log:
The llamafile log is missing the line
offloading non-repeating layers to GPU. I’m wondering if this could be the reason for the performance issue.