llama : quantize up to 31% faster on Linux with mmap#3206
llama : quantize up to 31% faster on Linux with mmap#3206ggerganov merged 3 commits intoggml-org:masterfrom
Conversation
|
How does incorporating |
|
When the 'quantize' script reads from disk, it normally has to load a whole tensor into memory before it can start converting it to f32 and quantizing it. This change allows the input tensor to be paged in on-demand in 4096-byte chunks so it can be read and converted simultaneously. |
|
I tested this with 7B f16 to q4_0. On Windows and got ~15% faster times with mmap when the model is cached, no difference when it is not cached. Under WSL2, mmap is always about ~35% faster, cached or uncached. So I think mmap can be enabled on Windows too. |
llama.cpp
Outdated
| std::unique_ptr<llama_model_loader> ml(new llama_model_loader(fname_inp, /*use_mmap*/ false)); | ||
| // mmap consistently increases speed Linux, is inconsistent on macOS | ||
| // (possibly related to free memory), and has not been tested on Windows. | ||
| #ifdef __linux__ |
There was a problem hiding this comment.
| #ifdef __linux__ | |
| #if defined(__linux__) || defined(_WIN32) |
|
Let me run a few tests this week and we can merge. |
ggerganov
left a comment
There was a problem hiding this comment.
On M1 Pro with 32GB, quantizing 13B with mmap enabled is ~x2 slower, so let's leave mmap off on Mac until we figure out something that would always improve the performance, regardless of the model size
…example * 'master' of github.com:ggerganov/llama.cpp: ggml-cuda : perform cublas mat mul of quantized types as f16 (ggml-org#3412) llama.cpp : add documentation about rope_freq_base and scale values (ggml-org#3401) train : fix KQ_pos allocation (ggml-org#3392) llama : quantize up to 31% faster on Linux and Windows with mmap (ggml-org#3206) readme : update hot topics + model links (ggml-org#3399) readme : add link to grammars app (ggml-org#3388) swift : fix build on xcode 15 (ggml-org#3387) build : enable more non-default compiler warnings (ggml-org#3200) ggml_tensor: update the structure comments. (ggml-org#3283) ggml : release the requested thread pool resource (ggml-org#3292) llama.cpp : split llama_context_params into model and context params (ggml-org#3301) ci : multithreaded builds (ggml-org#3311) train : finetune LORA (ggml-org#2632) gguf : basic type checking in gguf_get_* (ggml-org#3346) gguf : make token scores and types optional (ggml-org#3347) ci : disable freeBSD builds due to lack of VMs (ggml-org#3381) llama : custom attention mask + parallel decoding + no context swaps (ggml-org#3228) docs : mark code as Bash (ggml-org#3375) readme : add Mistral AI release 0.1 (ggml-org#3362) ggml-cuda : perform cublas fp16 matrix multiplication as fp16 (ggml-org#3370)
…l-org#3206) * llama : enable mmap in quantize on Linux -> 31% faster * also enable mmap on Windows --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
This is a follow-up to #3115. It enables mmap for quantize on Linux, since no one seems to have reported a performance decrease on that platform. Windows has not been tested, and macOS has seen both a speed-up and a slow-down.