Reduce model loading time by maekawatoshiki · Pull Request #43 · ggml-org/llama.cpp

maekawatoshiki · 2023-03-12T10:43:37Z

Hello!

I noticed that the model loader is not using buffered IO, so I added a piece of code for buffering.
I measured the loading time only for llama 7B on my M1 Pro Macbook, but it reduced the time from 1316ms to 749ms.

maekawatoshiki · 2023-03-12T10:46:39Z

        fin.close();
    }

+    free(f_buf);


f_buf will not be free if this function returns earlier, but I think it does not matter since it's a small amount of memory :)

maekawatoshiki · 2023-03-13T01:39:16Z

Thank you for your review. Fixed as you mentioned.

Speed up rmsnorm by using sqrtf/expf

1. Fixed audio-detokenizer.cpp:708 - Changed std::expf(log_abs) to expf(log_abs) - The expf function is in the global namespace, not in the std namespace 2. Fixed mtmd-cli.cpp:332 - Changed eval_message(ctx, msgs) to eval_message(ctx, msgs.back()) - The eval_message function expects a single message reference, not a vector - We pass the last message in the vector (the user prompt) *Make sure to read the [contributing guidelines](https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md) before submitting a PR*

…er_rebase hardcoded name handling for rope_freqs.weight

* wip * refactor: rewrite dequantize_row_q4_0 by intrinsic * log for debug * fix q4 intrinsic * small opt * wip * wip * add vtcm_quota_size * add perf log for hexagon-npu backend * wip * add log * sync after a specfic op * increase worker thread priority * fix unbalanced thread slice * small slict to fit in vtcm cache * limit the supported row element size * opt 4_0 dequant * fix q4 dequant * add power_utils * add rms_norm * wip * enable rms_norm f32 * fix rms_norm with param * fix compiling flags * use float * fix small row size * vectorized rms norm * wip * read 2 vectors * rename * add perf log on update * set empty tensors handle also * merge some rpc functions * opt param update * wip * print more log * add struct for update param config * add npu_device_graph_set_tensor_with_param * merge tensor and params update * wip * wip * make as template to reuse * vectorize dequantize_row_q8_0 * opt * avoid using union to store q data * wip * wip * wip

Fix/turbo4 wht dequant

* Use buffering * Use vector * Minor --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Use buffering

640fd77

maekawatoshiki commented Mar 12, 2023

View reviewed changes

ggerganov requested changes Mar 12, 2023

View reviewed changes

Comment thread main.cpp Outdated

Use vector

efaa30e

maekawatoshiki requested a review from ggerganov March 13, 2023 05:31

Minor

3419f88

ggerganov approved these changes Mar 13, 2023

View reviewed changes

ggerganov merged commit 63fd76f into ggml-org:master Mar 13, 2023

apaz-cli mentioned this pull request Mar 15, 2023

mmap() - backed istream implementation #150

Closed

rooprob pushed a commit to rooprob/llama.cpp that referenced this pull request Aug 2, 2023

Merge pull request ggml-org#43 from krzysztof-jusiak/rmsnorm

669b75d

Speed up rmsnorm by using sqrtf/expf

Bearsaerker mentioned this pull request Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Closed

wine99 pushed a commit to wine99/llama.cpp that referenced this pull request Feb 27, 2026

Merge pull request ggml-org#43 from cavusmustafa/additional_fixes_aft…

76775a5

…er_rebase hardcoded name handling for rope_freqs.weight

EzequielDM pushed a commit to EzequielDM/llama.cpp-bad that referenced this pull request Apr 2, 2026

Fix turbo4 C reference WHT dequant mismatch (ggml-org#43)

63b832b

Fix/turbo4 wht dequant

InfernalDread referenced this pull request in InfernalDread/llama.cpp Apr 4, 2026

Fix turbo4 C reference WHT dequant mismatch (#43)

a32f7c9

Fix/turbo4 wht dequant

itme-brain pushed a commit to itme-brain/llama.cpp that referenced this pull request Apr 16, 2026

Fix turbo4 C reference WHT dequant mismatch (ggml-org#43)

71a96cf

Fix/turbo4 wht dequant

erazortt pushed a commit to erazortt/llama.cpp that referenced this pull request Apr 17, 2026

Fix turbo4 C reference WHT dequant mismatch (ggml-org#43)

fe2ead9

Fix/turbo4 wht dequant

ausshir pushed a commit to ausshir/llama.cpp-iso-rocm that referenced this pull request Apr 20, 2026

Fix turbo4 C reference WHT dequant mismatch (ggml-org#43)

4ef1b0a

Fix/turbo4 wht dequant

Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026

Reduce model loading time (ggml-org#43)

a81c113

* Use buffering * Use vector * Minor --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

phuongncn pushed a commit to phuongncn/llama.cpp-gx10-dgx-sparks-deepseekv4 that referenced this pull request Apr 28, 2026

iq2_tn: slightly faster PP (ggml-org#43)

f2ef628

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce model loading time#43

Reduce model loading time#43
ggerganov merged 3 commits intoggml-org:masterfrom
maekawatoshiki:master

maekawatoshiki commented Mar 12, 2023 •

edited

Loading

Uh oh!

maekawatoshiki Mar 12, 2023

Uh oh!

Uh oh!

maekawatoshiki commented Mar 13, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

maekawatoshiki commented Mar 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maekawatoshiki Mar 12, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

maekawatoshiki commented Mar 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

maekawatoshiki commented Mar 12, 2023 •

edited

Loading

maekawatoshiki commented Mar 13, 2023 •

edited

Loading