Use the same batch size threshold for enabling OpenBLAS and disabling ggml threading#577
Use the same batch size threshold for enabling OpenBLAS and disabling ggml threading#577ggerganov merged 1 commit intoggml-org:masterfrom Piezoid:oblas_thread_limit
Conversation
|
@linouxis9 Does this improve the performance on your machine for processing the initial prompt when it is larger than 31 tokens and less than 256? |
|
It's slightly faster @ggerganov than no BLAS (34s vs 40s for initial ingestion on llama-30B with the new chat example), but it heavily depends on the number of threads chosen and batch sizes. And, I'm having a hard time properly finding out the best parameters to evaluate the performance (number of BLAS and of ggml threads, batch sizes...) and monitoring the speed of each run. |
|
This seemed to provide a small but noticeable bump in performance for me. |
* Add logit_bias to the OpenAI api * Cleanup and refactor, test in swagger. --------- Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Commit 4640eff disabled ggml's multi-threading when OpenBLAS is used for processing large prompts.
This avoids running two thread pools at the same time.
However, OpenBLAS is used by ggml on tensors with dims >= 32, but llama.cpp only reduce the number of threads for batch size > 255.
See also this discussion: #229 (reply in thread) and issue #578