Skip to content

server : fix swa-full logic#22288

Merged
ggerganov merged 1 commit into
masterfrom
gg/server-fix-n-swa
Apr 24, 2026
Merged

server : fix swa-full logic#22288
ggerganov merged 1 commit into
masterfrom
gg/server-fix-n-swa

Conversation

@ggerganov

Copy link
Copy Markdown
Member

Overview

fix #21468
alt #21749

Simplify the logic by augmenting llama_model_n_swa with a server_context.n_swa member. When --swa-full is passed, we set n_swa = 0 to simulate a non-SWA model

Requirements

@shipped-it

Copy link
Copy Markdown

I'm a bit unsure about the approach. Why would the model be reported as non-SWA? It is still SWA, just with a full-size cache.

Also, you've made the suggestion to use --swa-full --no-mmproj. Correct me if I'm wrong, but in practice I've seen repetitions issues when using Gemma 4.

I will confirm shortly if this fixes the issue, or if I get the repetition problem.

@ggerganov

Copy link
Copy Markdown
Member Author

Also, you've made the suggestion to use --swa-full --no-mmproj.

Cache reuse does not work with mmproj.

Correct me if I'm wrong, but in practice I've seen repetitions issues when using Gemma 4.

I am not following. This fixes the cache reuse logic - I am not aware of any repetitions.

@shipped-it

Copy link
Copy Markdown

I confirm that it is fixed with this PR or #21749 (tested on ROCm)

Without PR:
warm req: prompt_n=821, prompt_ms=982

With PR:
warm req: prompt_n=5, prompt_ms=71 so about 13x faster

@ggerganov ggerganov merged commit ffdd983 into master Apr 24, 2026
42 of 47 checks passed
@ggerganov ggerganov deleted the gg/server-fix-n-swa branch April 24, 2026 07:17
IntelNav pushed a commit to IntelNav/llama.cpp that referenced this pull request Apr 29, 2026
IntelNav pushed a commit to IntelNav/llama.cpp that referenced this pull request Apr 29, 2026
rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 1, 2026
samuraieng pushed a commit to samuraieng/llama.cpp that referenced this pull request May 6, 2026
ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request May 6, 2026
meh pushed a commit to meh/llama.cpp that referenced this pull request May 10, 2026
my-other-github-account pushed a commit to my-other-github-account/llama.cpp that referenced this pull request May 15, 2026
my-other-github-account pushed a commit to my-other-github-account/llama.cpp that referenced this pull request May 15, 2026
baramofme pushed a commit to baramofme/llama-cpp-turboquant that referenced this pull request May 23, 2026
fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cache reuse is not supported for Gemma 4 models despite -fa enabled and --swa-full

2 participants