LlamaLib is a high-level C++ and C# library for running Large Language Models (LLMs) anywhere - from PCs to mobile devices and VR headsets.
-
β High-Level API
C++ and C# implementations with intuitive object-oriented design -
π¦ Self-Contained and Embedded
Runs embedded within your application.
No need for a separate server or external processes. Zero external dependencies. -
π Runs Anywhere
Cross-platform and cross-device.
Works on all major platforms:- Desktop:
Windows,macOS,Linux - Mobile:
Android,iOS - VR/AR:
Meta Quest,Apple Vision,Magic Leap
and hardware architectures:
- CPU: Intel, AMD, Apple Silicon
- GPU: NVIDIA, AMD, Metal
- Desktop:
-
π Architecture Detection at runtime
Automatically selects the optimal backend at runtime supporting all major GPU and CPU architectures. -
π€ Tiny footprint
Integration requires only 10-200 MB depending on the embedded architectures.
Custom implementation of tinyBLAS reduces CUDA integration from 1.3GB to 130MB (cuBLAS also supported). -
π Production ready
Designed for easy integration into C++ and C# applications.
Supports both local and client-server deployment.
-
Developer experience:
Direct implementation of LLM operations (completion, tokenisation, embeddings).
Clean implementation of LLM service and clients, server-client architecture and LLM agents. -
Universal deployment:
LlamaLib is the only library that lets you build your application for any hardware.
Unlike alternatives that allow you to only build for specific GPU vendor or CPU-only execution, our architecture detection happens at runtime.
If your application is developed for GPU: the GPU backend of the user hardware will be automatically selected (Nvidia, AMD, Metal) or fallback to CPU.
CPU detection will automatically identify the CPU hardware (CPU instruction set) of the user to select the optimal backend.
LlamaLib works for all platforms from PC to mobile and VR. -
π Production ready
Embeds directly in your application without opening ports or starting external servers.
LlamaLib has minimal disk space requirements allowing compact buids e.g. for mobile deployment.
- β Star the repo and spread the word!
developent!
- Join our Discord community.
- Contribute with feature requests, bug reports, or pull requests.
- LLM for Unity: The most widely used solution to integate LLMs in games
LlamaLib can be used with just a few lines of code.
The main classes are:
- LLMService: Implementation of the LLM service. For desktop environments it implements runtime detection with multiple GPU and CPU backends.
- LLMClient: Implementation of local or remote clients
- LLMAgent: High-level conversational AI with persistent chat history
Core functionality:
- LLM core methods: completion, embeddings, tokenization, LORAs
- Agent functionality: chat template formatting, chat history
- Server-client functionality: start/stop server, connect to local/remote server, SSL and authentication support
The methods API can be found here:
and basic examples here:
More detailed documentation on function level can be found on the docs:
#include "LlamaLib.h"
// LlamaLib automatically detects your hardware and selects optimal backend
LLMService llm("path/to/model.gguf");
/* You can also specify:
threads=-1, // number of CPU threads to use
gpu_layers=0, // number of layers to offload to GPU (if 0, GPU is not used),
num_slots=1 // number of slots / clients supported in parallel
*/
// Start service
llm.start();
// Generate completion
std::string response = llm.completion("Hello, how are you?");
std::cout << response << std::endl;using LlamaLib;
// Same API, different language
LLMService llm = new LLMService("path/to/model.gguf");
/* You can also specify:
threads=-1, // number of CPU threads to use
gpu_layers=0, // number of layers to offload to GPU (if 0, GPU is not used),
num_slots=1 // number of slots / clients supported in parallel
*/
llm.Start();
string response = llm.Completion("Hello, how are you?");
Console.WriteLine(response);