Port to Visual C++. by jaykrell · Pull Request #36 · ggml-org/llama.cpp

jaykrell · 2023-03-12T06:04:18Z

Combined nmake/Unix Makefile.
_alloca instead of variable size array.
Do not do math on void*, could cast to char*, but in this case, move the uint8_t* cast.
C++20 for designated initializers.
Conditionalize on _WIN32, not specific compilers.

It builds. I haven't run it yet.

- Combined nmake/Unix Makefile. - _alloca instead of variable size array. - Cast void* to char* for math. - C++20 for designated initializers. It builds. I haven't run it yet.

ggerganov · 2023-03-12T09:22:55Z

We will merge #31 first and then see how to update the build system - either CMake or what you suggested here

ggerganov · 2023-03-14T17:07:40Z

We already merged CMake support which provide Windows build

typo

rerouting common logs to callback

* Zen4 Flash Attnetion: WIP generalize to other types Now loading of data from K and V is done via a template parameter, so this should make it easy to generalize to typ[es other than F16 for the K and V cache. * Zen4 Flash Attnetion: it works for q4_0 and q8_0 * Zen4 Flash Attnetion: small q8_0 performance improvement * Zen4 Flash Attnetion: add q4_1 * Delete unused stuff --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Move Q forward rotation from graph-level ggml_turbo_wht op into FA kernels to eliminate a separate kernel launch per layer during decode: - Vec kernel (decode): shared memory FWHT with 64-thread parallel butterfly, zero extra kernel launches, CUDA graph compatible - Prefill MMA: separate k_turbo_fwht_forward kernel with persistent cudaMalloc buffer (avoids cudaMallocAsync NaN on graph replay) - V inverse rotation remains at graph level for CUDA graph compat Results: decode 30.14 tok/s (-0.4%), prefill 1146 tok/s (-0.3%), PPL identical to baseline (19.7152 on 10-chunk test). Also adds temporal decay test (experiment TheTom#36) and benchmarks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Move Q forward rotation from graph-level ggml_turbo_wht op into FA kernels to eliminate a separate kernel launch per layer during decode: - Vec kernel (decode): shared memory FWHT with 64-thread parallel butterfly, zero extra kernel launches, CUDA graph compatible - Prefill MMA: separate k_turbo_fwht_forward kernel with persistent cudaMalloc buffer (avoids cudaMallocAsync NaN on graph replay) - V inverse rotation remains at graph level for CUDA graph compat Results: decode 30.14 tok/s (-0.4%), prefill 1146 tok/s (-0.3%), PPL identical to baseline (19.7152 on 10-chunk test). Also adds temporal decay test (experiment ggml-org#36) and benchmarks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jaykrell added 5 commits March 11, 2023 22:01

Port to Visual C++.

636d568

- Combined nmake/Unix Makefile. - _alloca instead of variable size array. - Cast void* to char* for math. - C++20 for designated initializers. It builds. I haven't run it yet.

LTCG and Win32 is Win32, not specific compilers.

a05225f

define NDEBUG, and simplify casts.

ec64cfa

nologo

3225d9b

Fix sizeof(pp).

bea4af5

jaykrell mentioned this pull request Mar 12, 2023

Windows 64-bit, Microsoft Visual Studio - it works like a charm after those fixes! #22

Closed

Add comment.

c2201a9

ggerganov mentioned this pull request Mar 12, 2023

Windows fixes #31

Merged

ggerganov closed this Mar 14, 2023

rooprob pushed a commit to rooprob/llama.cpp that referenced this pull request Aug 2, 2023

Merge pull request ggml-org#36 from hu-po/patch-1

d0ddf94

typo

Bearsaerker mentioned this pull request Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Closed

jesusmb1995 pushed a commit to jesusmb1995/llama.cpp that referenced this pull request Oct 30, 2025

Merge pull request ggml-org#36 from ogad-tether/temp-latest

5545231

rerouting common logs to callback

uttampc1 mentioned this pull request Nov 18, 2025

Throughput improvement for small batch sizes #17342

Open

Maks4d mentioned this pull request Dec 3, 2025

Eval bug: Running llama-server only possible with single AMD GPU, running multiple always causes Segmentation fault regardless of model size #17583

Open

Copilot AI mentioned this pull request Dec 30, 2025

Document unresolved code review comments from merged PRs #1 and #2 TheOriginalBytePlayer/llama.cpp#3

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port to Visual C++.#36

Port to Visual C++.#36
jaykrell wants to merge 6 commits intoggml-org:masterfrom
jaykrell:jaykrell/msvc1

jaykrell commented Mar 12, 2023 •

edited

Loading

Uh oh!

ggerganov commented Mar 12, 2023

Uh oh!

ggerganov commented Mar 14, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jaykrell commented Mar 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Mar 12, 2023

Uh oh!

ggerganov commented Mar 14, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jaykrell commented Mar 12, 2023 •

edited

Loading