Closed
Conversation
- Combined nmake/Unix Makefile. - _alloca instead of variable size array. - Cast void* to char* for math. - C++20 for designated initializers. It builds. I haven't run it yet.
Merged
Member
|
We will merge #31 first and then see how to update the build system - either CMake or what you suggested here |
Member
|
We already merged CMake support which provide Windows build |
rooprob
pushed a commit
to rooprob/llama.cpp
that referenced
this pull request
Aug 2, 2023
jesusmb1995
pushed a commit
to jesusmb1995/llama.cpp
that referenced
this pull request
Oct 30, 2025
rerouting common logs to callback
SamuelOliveirads
pushed a commit
to SamuelOliveirads/llama.cpp
that referenced
this pull request
Dec 29, 2025
* Zen4 Flash Attnetion: WIP generalize to other types Now loading of data from K and V is done via a template parameter, so this should make it easy to generalize to typ[es other than F16 for the K and V cache. * Zen4 Flash Attnetion: it works for q4_0 and q8_0 * Zen4 Flash Attnetion: small q8_0 performance improvement * Zen4 Flash Attnetion: add q4_1 * Delete unused stuff --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
spiritbuun
referenced
this pull request
in spiritbuun/llama-cpp-turboquant-cuda
Mar 27, 2026
Move Q forward rotation from graph-level ggml_turbo_wht op into FA kernels to eliminate a separate kernel launch per layer during decode: - Vec kernel (decode): shared memory FWHT with 64-thread parallel butterfly, zero extra kernel launches, CUDA graph compatible - Prefill MMA: separate k_turbo_fwht_forward kernel with persistent cudaMalloc buffer (avoids cudaMallocAsync NaN on graph replay) - V inverse rotation remains at graph level for CUDA graph compat Results: decode 30.14 tok/s (-0.4%), prefill 1146 tok/s (-0.3%), PPL identical to baseline (19.7152 on 10-chunk test). Also adds temporal decay test (experiment TheTom#36) and benchmarks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
didlawowo
pushed a commit
to didlawowo/llama.cpp
that referenced
this pull request
Mar 27, 2026
Move Q forward rotation from graph-level ggml_turbo_wht op into FA kernels to eliminate a separate kernel launch per layer during decode: - Vec kernel (decode): shared memory FWHT with 64-thread parallel butterfly, zero extra kernel launches, CUDA graph compatible - Prefill MMA: separate k_turbo_fwht_forward kernel with persistent cudaMalloc buffer (avoids cudaMallocAsync NaN on graph replay) - V inverse rotation remains at graph level for CUDA graph compat Results: decode 30.14 tok/s (-0.4%), prefill 1146 tok/s (-0.3%), PPL identical to baseline (19.7152 on 10-chunk test). Also adds temporal decay test (experiment ggml-org#36) and benchmarks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
It builds. I haven't run it yet.