Skip to content

Bug: The token generation speed is slower compared to the upstream llama.cpp project #533

@BIGPPWONG

Description

@BIGPPWONG

Contact Details

No response

What happened?

Issue Description:

The token generation speed of llamafile is slower compared to the upstream llama.cpp project.

Details:

  • llamafile version 0.8.12:gglm-cuda built with command:
nvcc -arch=all -DIGNORE123 -O3 --shared --use_fast_math --forward-unknown-to-host-compiler --compiler-options "/nologo /EHsc /O2 /GR /MT" -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_MULTIPLATFORM -DGGML_CUDA_DMMV_X=32 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_CUDA_MMV_Y=1 -DGGML_USE_CUBLAS -DTEHFLASH -o ggml-cuda.dll.all ggml-cuda.cu -lcublas -lcuda
  • llama.cpp versions tested:
    • b3567 (latest version)
    • b2968 (version from May 22nd, which means llamafile should have included all updates from this version)

In comparison:

  • llamafile only achieves 26 tokens/s.
  • Both versions of llama.cpp achieve 51 tokens/s.

GPU Utilization:

  • Using nvidia-smi, the GPU utilization for llamafile is observed to be 41%, whereas for llama.cpp, it reaches 80%.

Model Used for Testing:

  • Model: Qwen/Qwen2-7B-Instruct-GGUF
  • Specific file: qwen2-7b-instruct-q3_k_m.gguf

Test Environment:

  • Operating System: Windows 10
  • GPU: RTX 2080
  • CUDA Version: 12.6

Version

llamafile v0.8.12

What operating system are you seeing the problem on?

Windows

Relevant log output

Logs Comparison:

  • llama.cpp Log (b3567):

    ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
    ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
    ggml_cuda_init: found 1 CUDA devices:
    Device 0: NVIDIA GeForce RTX 2080, compute capability 7.5, VMM: yes
    llm_load_tensors: ggml ctx size =    0.32 MiB
    llm_load_tensors: offloading 28 repeating layers to GPU
    llm_load_tensors: offloading non-repeating layers to GPU
    llm_load_tensors: offloaded 29/29 layers to GPU
    llm_load_tensors:        CPU buffer size =   223.33 MiB
    llm_load_tensors:      CUDA0 buffer size =  3402.96 MiB
  • llamafile Log:

    ggml_cuda_link: welcome to CUDA SDK with cuBLAS
    ...
    ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
    ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
    ggml_cuda_init: found 1 CUDA devices:
    Device 0: NVIDIA GeForce RTX 2080, compute capability 7.5, VMM: yes
    llm_load_tensors: ggml ctx size =    0.38 MiB
    llm_load_tensors: offloading 28 repeating layers to GPU
    llm_load_tensors: offloaded 28/29 layers to GPU
    llm_load_tensors:        CPU buffer size =  3626.29 MiB
    llm_load_tensors:      CUDA0 buffer size =  2976.59 MiB

The llamafile log is missing the line offloading non-repeating layers to GPU. I’m wondering if this could be the reason for the performance issue.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions