LLamaSharp is a cross-platform C# library that provides .NET bindings for llama.cpp enabling local execution of Large Language Models (LLMs) on CPU and GPU. The library wraps the native llama.cpp C++ implementation using P/Invoke and provides high-level APIs for model loading, inference, chat sessions, text embeddings, and multimodal processing.
This page introduces the fundamental architecture, package structure, and capabilities of LLamaSharp. For installation instructions and configuration, see Installation and Setup. For detailed API usage, see Core Architecture, Executors and Inference, and Chat and Conversation Management.
LLamaSharp enables running GGUF-formatted language models (LLaMA, Mistral, Phi, Gemma, Qwen, and others) directly in .NET applications without requiring Python or external APIs. The library handles:
LLamaWeights (LLama/LLamaWeights.cs)ILLamaExecutor implementations (LLama/Abstractions/ILLamaExecutor.cs)ChatSession (LLama/ChatSession.cs)LLamaEmbedder (LLama/LLamaEmbedder.cs)NativeApi (LLama/Native/NativeApi.cs)The current version (v0.25.0) targets netstandard2.0 and net8.0, based on llama.cpp commit 11dd5a44eb180e1d69fac24d3852b5222d66fb7f (LLama/LLamaSharp.csproj10-25).
Sources: README.md1-23 LLama/LLamaSharp.csproj1-33
LLamaSharp is distributed as a modular NuGet package ecosystem consisting of a core library, platform-specific backend packages, and framework integration packages.
Package Selection: Applications install LLamaSharp plus one or more backend packages depending on target hardware. Backends contain native binaries (*.dll, *.so, *.dylib) loaded at runtime via NativeLibraryConfig (LLama/Native/NativeLibraryConfig.cs).
Sources: README.md86-106 LLama/LLamaSharp.csproj60-72
| Package | Target Framework | Dependencies | Primary Interfaces |
|---|---|---|---|
LLamaSharp | netstandard2.0, net8.0 | Microsoft.Extensions.AI.Abstractions | IChatClient, IEmbeddingGenerator |
LLamaSharp.semantic-kernel | netstandard2.0, net8.0 | Microsoft.SemanticKernel.Abstractions | IChatCompletionService, ITextGenerationService |
LLamaSharp.kernel-memory | net8.0 | Microsoft.KernelMemory.Abstractions | ITextGenerator, ITextEmbeddingGenerator |
Sources: LLama/LLamaSharp.csproj44-57 LLama.SemanticKernel/LLamaSharp.SemanticKernel.csproj36-38 LLama.KernelMemory/LLamaSharp.KernelMemory.csproj29-31
LLamaSharp implements a five-layer architecture from high-level user APIs down to native llama.cpp integration:
Layer Responsibilities:
SafeHandle pattern with reference countingSources: LLama/ChatSession.cs LLama/LLamaWeights.cs LLama/LLamaContext.cs LLama/Native/NativeApi.cs
LLamaSharp provides multiple execution patterns and features, each implemented by specific classes:
| Capability | Primary Class | Location | Purpose |
|---|---|---|---|
| Model Loading | LLamaWeights | LLama/LLamaWeights.cs | Load GGUF files, manage model lifetime |
| Context Management | LLamaContext | LLama/LLamaContext.cs | Tokenization, encoding, decoding, KV cache |
| Chat Sessions | ChatSession | LLama/ChatSession.cs | Structured conversation management |
| Interactive Chat | InteractiveExecutor | LLama/InteractiveExecutor.cs | Stateful multi-turn conversations |
| Stateless Inference | StatelessExecutor | LLama/StatelessExecutor.cs | One-shot text generation |
| Instruction Following | InstructExecutor | LLama/InstructExecutor.cs | Instruction-tuned models |
| Parallel Processing | BatchedExecutor | LLama.Batched/BatchedExecutor.cs | Multiple concurrent conversations |
| Text Embeddings | LLamaEmbedder | LLama/LLamaEmbedder.cs | Vector generation for RAG |
| Sampling Control | DefaultSamplingPipeline | LLama/Sampling/DefaultSamplingPipeline.cs | Token selection strategies |
| Grammar Constraints | Grammar | LLama/Grammars/Grammar.cs | Structured output generation |
| Multimodal | LLavaWeights | LLama/LLavaWeights.cs | LLaVA image+text models |
| State Persistence | SessionState | LLama/SessionState.cs | Save/load conversation state |
Sources: LLama/InteractiveExecutor.cs LLama/StatelessExecutor.cs LLama/LLamaEmbedder.cs LLama/Grammars/Grammar.cs
The following diagram illustrates how data flows from user input through tokenization, inference, sampling, and back to text output:
Key Methods:
LLamaWeights.Tokenize(): Converts text to token IDs (LLama/LLamaWeights.cs)LLamaContext.Decode(): Runs inference batch through native llama.cpp (LLama/LLamaContext.cs)ISamplingPipeline.Sample(): Selects next token from logits (LLama/Abstractions/ISamplingPipeline.cs)StreamingTokenDecoder.Add(): Converts tokens back to text (LLama/Native/StreamingTokenDecoder.cs)Sources: LLama/LLamaWeights.cs LLama/LLamaContext.cs LLama/Native/StreamingTokenDecoder.cs LLama/Sampling/DefaultSamplingPipeline.cs
LLamaSharp implements standard interfaces from Microsoft's AI ecosystem, enabling drop-in compatibility with existing frameworks:
Interface Implementation Map:
| Framework Interface | LLamaSharp Implementation | Package |
|---|---|---|
IChatClient | LLamaChatClient | LLamaSharp core |
IEmbeddingGenerator<string, Embedding<float>> | LLamaEmbeddingGenerator | LLamaSharp core |
IChatCompletionService | LLamaSharpChatCompletion | LLamaSharp.semantic-kernel |
ITextGenerationService | LLamaSharpTextCompletion | LLamaSharp.semantic-kernel |
ITextEmbeddingGenerationService<string, float> | LLamaSharpTextEmbedding | LLamaSharp.semantic-kernel |
ITextGenerator | LlamaSharpTextGenerator | LLamaSharp.kernel-memory |
ITextEmbeddingGenerator | LLamaSharpTextEmbeddingGenerator | LLamaSharp.kernel-memory |
Sources: LLama/LLamaSharp.csproj54 LLama.SemanticKernel/LLamaSharp.SemanticKernel.csproj37 LLama.KernelMemory/LLamaSharp.KernelMemory.csproj30
LLamaSharp downloads pre-compiled native binaries from HuggingFace during the build process. The BinaryReleaseId property in the project file specifies the llama.cpp commit hash used for the current version:
The build system automatically:
runtimes/release_id.txt for the cached binary version (LLama/LLamaSharp.csproj82-84)deps.zip from https://github.com/SciSharp/LLamaSharpBinaries/releases/download/{BinaryReleaseId}/deps.zip if not cached (LLama/LLamaSharp.csproj69-72)*.dll, *.so, *.dylib, *.metal) to runtimes/deps/ (LLama/LLamaSharp.csproj75-78)At runtime, NativeLibraryConfig loads the appropriate binary based on platform and available acceleration (CPU, CUDA, Vulkan, Metal).
Sources: LLama/LLamaSharp.csproj60-90
LLamaSharp supports GGUF-formatted models compatible with llama.cpp. Common model families include:
Models must be in GGUF format. PyTorch (.pth) or Huggingface (.bin) models require conversion using llama.cpp's Python scripts. The library recommends quantized models (Q4_0, Q5_K_M, Q8_0) for reduced memory usage with minimal quality loss.
Sources: README.md108-117 README.md242-269
The repository includes a comprehensive examples project demonstrating various usage patterns:
The examples project references all integration packages (LLama.Examples/LLama.Examples.csproj32-37) and includes dependencies for Semantic Kernel, Kernel Memory, and multimodal processing (LLama.Examples/LLama.Examples.csproj17-30).
Sources: LLama.Examples/LLama.Examples.csproj README.md119-174
LLamaSharp version 0.25.0 is built against llama.cpp commit 11dd5a44eb180e1d69fac24d3852b5222d66fb7f. Custom-compiled backends must use this exact commit to ensure ABI compatibility. The version history mapping is maintained in the README (README.md242-269).
Breaking Change Policy: Minor version increments (0.x.0) may introduce breaking API changes. Patch versions (0.x.y) maintain backward compatibility within the same minor version.
Refresh this wiki