Bug: The token generation speed is slower compared to the upstream llama.cpp project

### Contact Details

_No response_

### What happened?

**Issue Description:**

The token generation speed of `llamafile` is slower compared to the upstream `llama.cpp` project.

**Details:**

* `llamafile` version 0.8.12：gglm-cuda built with command: 
```
nvcc -arch=all -DIGNORE123 -O3 --shared --use_fast_math --forward-unknown-to-host-compiler --compiler-options "/nologo /EHsc /O2 /GR /MT" -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_MULTIPLATFORM -DGGML_CUDA_DMMV_X=32 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_CUDA_MMV_Y=1 -DGGML_USE_CUBLAS -DTEHFLASH -o ggml-cuda.dll.all ggml-cuda.cu -lcublas -lcuda
```
* `llama.cpp` versions tested:
    * b3567 (latest version)
    * b2968 (version from May 22nd, which means `llamafile` should have included all updates from this version)

In comparison:

* `llamafile` only achieves 26 tokens/s.
* Both versions of `llama.cpp` achieve 51 tokens/s.

**GPU Utilization:**

* Using `nvidia-smi`, the GPU utilization for `llamafile` is observed to be 41%, whereas for `llama.cpp`, it reaches 80%.

**Model Used for Testing:**

* Model: Qwen/Qwen2-7B-Instruct-GGUF
* Specific file: `qwen2-7b-instruct-q3_k_m.gguf`

**Test Environment:**

* Operating System: Windows 10
* GPU: RTX 2080
* CUDA Version: 12.6

### Version

llamafile v0.8.12

### What operating system are you seeing the problem on?

Windows

### Relevant log output

**Logs Comparison:**

* **llama.cpp Log (b3567):**
    
    ```vbnet
    ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
    ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
    ggml_cuda_init: found 1 CUDA devices:
    Device 0: NVIDIA GeForce RTX 2080, compute capability 7.5, VMM: yes
    llm_load_tensors: ggml ctx size =    0.32 MiB
    llm_load_tensors: offloading 28 repeating layers to GPU
    llm_load_tensors: offloading non-repeating layers to GPU
    llm_load_tensors: offloaded 29/29 layers to GPU
    llm_load_tensors:        CPU buffer size =   223.33 MiB
    llm_load_tensors:      CUDA0 buffer size =  3402.96 MiB
    ```
    
* **llamafile Log:**
    
    ```vbnet
    ggml_cuda_link: welcome to CUDA SDK with cuBLAS
    ...
    ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
    ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
    ggml_cuda_init: found 1 CUDA devices:
    Device 0: NVIDIA GeForce RTX 2080, compute capability 7.5, VMM: yes
    llm_load_tensors: ggml ctx size =    0.38 MiB
    llm_load_tensors: offloading 28 repeating layers to GPU
    llm_load_tensors: offloaded 28/29 layers to GPU
    llm_load_tensors:        CPU buffer size =  3626.29 MiB
    llm_load_tensors:      CUDA0 buffer size =  2976.59 MiB
    ```
    

The llamafile log is missing the line `offloading non-repeating layers to GPU`. I’m wondering if this could be the reason for the performance issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: The token generation speed is slower compared to the upstream llama.cpp project #533

Contact Details

What happened?

Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Bug: The token generation speed is slower compared to the upstream llama.cpp project #533

Description

Contact Details

What happened?

Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions