Skip to content

Port to Visual C++.#36

Closed
jaykrell wants to merge 6 commits intoggml-org:masterfrom
jaykrell:jaykrell/msvc1
Closed

Port to Visual C++.#36
jaykrell wants to merge 6 commits intoggml-org:masterfrom
jaykrell:jaykrell/msvc1

Conversation

@jaykrell
Copy link
Copy Markdown

@jaykrell jaykrell commented Mar 12, 2023

  • Combined nmake/Unix Makefile.
  • _alloca instead of variable size array.
  • Do not do math on void*, could cast to char*, but in this case, move the uint8_t* cast.
  • C++20 for designated initializers.
  • Conditionalize on _WIN32, not specific compilers.

It builds. I haven't run it yet.

- Combined nmake/Unix Makefile.
- _alloca instead of variable size array.
- Cast void* to char* for math.
- C++20 for designated initializers.

It builds. I haven't run it yet.
@ggerganov ggerganov mentioned this pull request Mar 12, 2023
@ggerganov
Copy link
Copy Markdown
Member

We will merge #31 first and then see how to update the build system - either CMake or what you suggested here

@ggerganov
Copy link
Copy Markdown
Member

We already merged CMake support which provide Windows build

@ggerganov ggerganov closed this Mar 14, 2023
rooprob pushed a commit to rooprob/llama.cpp that referenced this pull request Aug 2, 2023
jesusmb1995 pushed a commit to jesusmb1995/llama.cpp that referenced this pull request Oct 30, 2025
SamuelOliveirads pushed a commit to SamuelOliveirads/llama.cpp that referenced this pull request Dec 29, 2025
* Zen4 Flash Attnetion: WIP generalize to other types

Now loading of data from K and V is done via a template parameter,
so this should make it easy to generalize to typ[es other than
F16 for the K and V cache.

* Zen4 Flash Attnetion: it works for q4_0 and q8_0

* Zen4 Flash Attnetion: small q8_0 performance improvement

* Zen4 Flash Attnetion: add q4_1

* Delete unused stuff

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
spiritbuun referenced this pull request in spiritbuun/llama-cpp-turboquant-cuda Mar 27, 2026
Move Q forward rotation from graph-level ggml_turbo_wht op into FA
kernels to eliminate a separate kernel launch per layer during decode:

- Vec kernel (decode): shared memory FWHT with 64-thread parallel
  butterfly, zero extra kernel launches, CUDA graph compatible
- Prefill MMA: separate k_turbo_fwht_forward kernel with persistent
  cudaMalloc buffer (avoids cudaMallocAsync NaN on graph replay)
- V inverse rotation remains at graph level for CUDA graph compat

Results: decode 30.14 tok/s (-0.4%), prefill 1146 tok/s (-0.3%),
PPL identical to baseline (19.7152 on 10-chunk test).

Also adds temporal decay test (experiment TheTom#36) and benchmarks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
didlawowo pushed a commit to didlawowo/llama.cpp that referenced this pull request Mar 27, 2026
Move Q forward rotation from graph-level ggml_turbo_wht op into FA
kernels to eliminate a separate kernel launch per layer during decode:

- Vec kernel (decode): shared memory FWHT with 64-thread parallel
  butterfly, zero extra kernel launches, CUDA graph compatible
- Prefill MMA: separate k_turbo_fwht_forward kernel with persistent
  cudaMalloc buffer (avoids cudaMallocAsync NaN on graph replay)
- V inverse rotation remains at graph level for CUDA graph compat

Results: decode 30.14 tok/s (-0.4%), prefill 1146 tok/s (-0.3%),
PPL identical to baseline (19.7152 on 10-chunk test).

Also adds temporal decay test (experiment ggml-org#36) and benchmarks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants