Overview

Relevant source files

LLamaSharp is a cross-platform C# library that provides .NET bindings for llama.cpp enabling local execution of Large Language Models (LLMs) on CPU and GPU. The library wraps the native llama.cpp C++ implementation using P/Invoke and provides high-level APIs for model loading, inference, chat sessions, text embeddings, and multimodal processing.

This page introduces the fundamental architecture, package structure, and capabilities of LLamaSharp. For installation instructions and configuration, see Installation and Setup. For detailed API usage, see Core Architecture, Executors and Inference, and Chat and Conversation Management.

What is LLamaSharp

LLamaSharp enables running GGUF-formatted language models (LLaMA, Mistral, Phi, Gemma, Qwen, and others) directly in .NET applications without requiring Python or external APIs. The library handles:

Model Loading: Loading GGUF model files via LLamaWeights (LLama/LLamaWeights.cs)
Inference Execution: Multiple execution patterns via ILLamaExecutor implementations (LLama/Abstractions/ILLamaExecutor.cs)
Chat Management: Structured conversations through ChatSession (LLama/ChatSession.cs)
Text Embeddings: Vector generation via LLamaEmbedder (LLama/LLamaEmbedder.cs)
Native Interop: Safe P/Invoke boundary through NativeApi (LLama/Native/NativeApi.cs)

The current version (v0.25.0) targets netstandard2.0 and net8.0, based on llama.cpp commit 11dd5a44eb180e1d69fac24d3852b5222d66fb7f (LLama/LLamaSharp.csproj10-25).

Sources: README.md1-23 LLama/LLamaSharp.csproj1-33

Package Architecture

LLamaSharp is distributed as a modular NuGet package ecosystem consisting of a core library, platform-specific backend packages, and framework integration packages.

Core and Backend Packages

Package Selection: Applications install LLamaSharp plus one or more backend packages depending on target hardware. Backends contain native binaries (*.dll, *.so, *.dylib) loaded at runtime via NativeLibraryConfig (LLama/Native/NativeLibraryConfig.cs).

Sources: README.md86-106 LLama/LLamaSharp.csproj60-72

Integration Package Matrix

Package	Target Framework	Dependencies	Primary Interfaces
`LLamaSharp`	netstandard2.0, net8.0	Microsoft.Extensions.AI.Abstractions	`IChatClient`, `IEmbeddingGenerator`
`LLamaSharp.semantic-kernel`	netstandard2.0, net8.0	Microsoft.SemanticKernel.Abstractions	`IChatCompletionService`, `ITextGenerationService`
`LLamaSharp.kernel-memory`	net8.0	Microsoft.KernelMemory.Abstractions	`ITextGenerator`, `ITextEmbeddingGenerator`

Sources: LLama/LLamaSharp.csproj44-57 LLama.SemanticKernel/LLamaSharp.SemanticKernel.csproj36-38 LLama.KernelMemory/LLamaSharp.KernelMemory.csproj29-31

Core Architecture Layers

LLamaSharp implements a five-layer architecture from high-level user APIs down to native llama.cpp integration:

Layer Responsibilities:

User-Facing APIs: High-level abstractions for common tasks (chat, inference, embeddings)
Core Abstractions: Lifecycle management, configuration, and processing pipelines
Safe Interop: Resource management using SafeHandle pattern with reference counting
Native Boundary: P/Invoke declarations and structure marshaling
Native llama.cpp: Underlying C++ inference engine

Sources: LLama/ChatSession.cs LLama/LLamaWeights.cs LLama/LLamaContext.cs LLama/Native/NativeApi.cs

Key Capabilities and Code Mapping

LLamaSharp provides multiple execution patterns and features, each implemented by specific classes:

Feature-to-Class Reference Table

Capability	Primary Class	Location	Purpose
Model Loading	`LLamaWeights`	LLama/LLamaWeights.cs	Load GGUF files, manage model lifetime
Context Management	`LLamaContext`	LLama/LLamaContext.cs	Tokenization, encoding, decoding, KV cache
Chat Sessions	`ChatSession`	LLama/ChatSession.cs	Structured conversation management
Interactive Chat	`InteractiveExecutor`	LLama/InteractiveExecutor.cs	Stateful multi-turn conversations
Stateless Inference	`StatelessExecutor`	LLama/StatelessExecutor.cs	One-shot text generation
Instruction Following	`InstructExecutor`	LLama/InstructExecutor.cs	Instruction-tuned models
Parallel Processing	`BatchedExecutor`	LLama.Batched/BatchedExecutor.cs	Multiple concurrent conversations
Text Embeddings	`LLamaEmbedder`	LLama/LLamaEmbedder.cs	Vector generation for RAG
Sampling Control	`DefaultSamplingPipeline`	LLama/Sampling/DefaultSamplingPipeline.cs	Token selection strategies
Grammar Constraints	`Grammar`	LLama/Grammars/Grammar.cs	Structured output generation
Multimodal	`LLavaWeights`	LLama/LLavaWeights.cs	LLaVA image+text models
State Persistence	`SessionState`	LLama/SessionState.cs	Save/load conversation state

Sources: LLama/InteractiveExecutor.cs LLama/StatelessExecutor.cs LLama/LLamaEmbedder.cs LLama/Grammars/Grammar.cs

Inference Data Flow

The following diagram illustrates how data flows from user input through tokenization, inference, sampling, and back to text output:

Key Methods:

LLamaWeights.Tokenize(): Converts text to token IDs (LLama/LLamaWeights.cs)
LLamaContext.Decode(): Runs inference batch through native llama.cpp (LLama/LLamaContext.cs)
ISamplingPipeline.Sample(): Selects next token from logits (LLama/Abstractions/ISamplingPipeline.cs)
StreamingTokenDecoder.Add(): Converts tokens back to text (LLama/Native/StreamingTokenDecoder.cs)

Sources: LLama/LLamaWeights.cs LLama/LLamaContext.cs LLama/Native/StreamingTokenDecoder.cs LLama/Sampling/DefaultSamplingPipeline.cs

Integration Points

LLamaSharp implements standard interfaces from Microsoft's AI ecosystem, enabling drop-in compatibility with existing frameworks:

Interface Implementation Map:

Framework Interface	LLamaSharp Implementation	Package
`IChatClient`	`LLamaChatClient`	LLamaSharp core
`IEmbeddingGenerator<string, Embedding<float>>`	`LLamaEmbeddingGenerator`	LLamaSharp core
`IChatCompletionService`	`LLamaSharpChatCompletion`	LLamaSharp.semantic-kernel
`ITextGenerationService`	`LLamaSharpTextCompletion`	LLamaSharp.semantic-kernel
`ITextEmbeddingGenerationService<string, float>`	`LLamaSharpTextEmbedding`	LLamaSharp.semantic-kernel
`ITextGenerator`	`LlamaSharpTextGenerator`	LLamaSharp.kernel-memory
`ITextEmbeddingGenerator`	`LLamaSharpTextEmbeddingGenerator`	LLamaSharp.kernel-memory

Sources: LLama/LLamaSharp.csproj54 LLama.SemanticKernel/LLamaSharp.SemanticKernel.csproj37 LLama.KernelMemory/LLamaSharp.KernelMemory.csproj30

Native Binary Distribution

LLamaSharp downloads pre-compiled native binaries from HuggingFace during the build process. The BinaryReleaseId property in the project file specifies the llama.cpp commit hash used for the current version:

The build system automatically:

Checks runtimes/release_id.txt for the cached binary version (LLama/LLamaSharp.csproj82-84)
Downloads deps.zip from https://github.com/SciSharp/LLamaSharpBinaries/releases/download/{BinaryReleaseId}/deps.zip if not cached (LLama/LLamaSharp.csproj69-72)
Extracts platform-specific binaries (*.dll, *.so, *.dylib, *.metal) to runtimes/deps/ (LLama/LLamaSharp.csproj75-78)

At runtime, NativeLibraryConfig loads the appropriate binary based on platform and available acceleration (CPU, CUDA, Vulkan, Metal).

Sources: LLama/LLamaSharp.csproj60-90

Supported Models and Formats

LLamaSharp supports GGUF-formatted models compatible with llama.cpp. Common model families include:

LLaMA/LLaMA2/LLaMA3: Base foundation models
Mistral/Mixtral: Efficient transformer variants
Phi: Small high-performance models
Gemma: Google's open models
Qwen: Alibaba's multilingual models
DeepSeek: Reasoning-focused models
LLaVA: Multimodal vision-language models

Models must be in GGUF format. PyTorch (.pth) or Huggingface (.bin) models require conversion using llama.cpp's Python scripts. The library recommends quantized models (Q4_0, Q5_K_M, Q8_0) for reduced memory usage with minimal quality loss.

Sources: README.md108-117 README.md242-269

Example Application Structure

The repository includes a comprehensive examples project demonstrating various usage patterns:

The examples project references all integration packages (LLama.Examples/LLama.Examples.csproj32-37) and includes dependencies for Semantic Kernel, Kernel Memory, and multimodal processing (LLama.Examples/LLama.Examples.csproj17-30).

Sources: LLama.Examples/LLama.Examples.csproj README.md119-174

Version Compatibility

LLamaSharp version 0.25.0 is built against llama.cpp commit 11dd5a44eb180e1d69fac24d3852b5222d66fb7f. Custom-compiled backends must use this exact commit to ensure ABI compatibility. The version history mapping is maintained in the README (README.md242-269).

Breaking Change Policy: Minor version increments (0.x.0) may introduce breaking API changes. Patch versions (0.x.y) maintain backward compatibility within the same minor version.

Sources: README.md239-269 LLama/LLamaSharp.csproj10-25

Overview

Relevant source files

What is LLamaSharp

LLamaSharp enables running GGUF-formatted language models (LLaMA, Mistral, Phi, Gemma, Qwen, and others) directly in .NET applications without requiring Python or external APIs. The library handles:

Model Loading: Loading GGUF model files via LLamaWeights (LLama/LLamaWeights.cs)
Inference Execution: Multiple execution patterns via ILLamaExecutor implementations (LLama/Abstractions/ILLamaExecutor.cs)
Chat Management: Structured conversations through ChatSession (LLama/ChatSession.cs)
Text Embeddings: Vector generation via LLamaEmbedder (LLama/LLamaEmbedder.cs)
Native Interop: Safe P/Invoke boundary through NativeApi (LLama/Native/NativeApi.cs)

The current version (v0.25.0) targets netstandard2.0 and net8.0, based on llama.cpp commit 11dd5a44eb180e1d69fac24d3852b5222d66fb7f (LLama/LLamaSharp.csproj10-25).

Sources: README.md1-23 LLama/LLamaSharp.csproj1-33

Package Architecture

LLamaSharp is distributed as a modular NuGet package ecosystem consisting of a core library, platform-specific backend packages, and framework integration packages.

Core and Backend Packages

Sources: README.md86-106 LLama/LLamaSharp.csproj60-72

Integration Package Matrix

Package	Target Framework	Dependencies	Primary Interfaces
`LLamaSharp`	netstandard2.0, net8.0	Microsoft.Extensions.AI.Abstractions	`IChatClient`, `IEmbeddingGenerator`
`LLamaSharp.semantic-kernel`	netstandard2.0, net8.0	Microsoft.SemanticKernel.Abstractions	`IChatCompletionService`, `ITextGenerationService`
`LLamaSharp.kernel-memory`	net8.0	Microsoft.KernelMemory.Abstractions	`ITextGenerator`, `ITextEmbeddingGenerator`

Sources: LLama/LLamaSharp.csproj44-57 LLama.SemanticKernel/LLamaSharp.SemanticKernel.csproj36-38 LLama.KernelMemory/LLamaSharp.KernelMemory.csproj29-31

Core Architecture Layers

LLamaSharp implements a five-layer architecture from high-level user APIs down to native llama.cpp integration:

Layer Responsibilities:

User-Facing APIs: High-level abstractions for common tasks (chat, inference, embeddings)
Core Abstractions: Lifecycle management, configuration, and processing pipelines
Safe Interop: Resource management using SafeHandle pattern with reference counting
Native Boundary: P/Invoke declarations and structure marshaling
Native llama.cpp: Underlying C++ inference engine

Sources: LLama/ChatSession.cs LLama/LLamaWeights.cs LLama/LLamaContext.cs LLama/Native/NativeApi.cs

Key Capabilities and Code Mapping

LLamaSharp provides multiple execution patterns and features, each implemented by specific classes:

Feature-to-Class Reference Table

Capability	Primary Class	Location	Purpose
Model Loading	`LLamaWeights`	LLama/LLamaWeights.cs	Load GGUF files, manage model lifetime
Context Management	`LLamaContext`	LLama/LLamaContext.cs	Tokenization, encoding, decoding, KV cache
Chat Sessions	`ChatSession`	LLama/ChatSession.cs	Structured conversation management
Interactive Chat	`InteractiveExecutor`	LLama/InteractiveExecutor.cs	Stateful multi-turn conversations
Stateless Inference	`StatelessExecutor`	LLama/StatelessExecutor.cs	One-shot text generation
Instruction Following	`InstructExecutor`	LLama/InstructExecutor.cs	Instruction-tuned models
Parallel Processing	`BatchedExecutor`	LLama.Batched/BatchedExecutor.cs	Multiple concurrent conversations
Text Embeddings	`LLamaEmbedder`	LLama/LLamaEmbedder.cs	Vector generation for RAG
Sampling Control	`DefaultSamplingPipeline`	LLama/Sampling/DefaultSamplingPipeline.cs	Token selection strategies
Grammar Constraints	`Grammar`	LLama/Grammars/Grammar.cs	Structured output generation
Multimodal	`LLavaWeights`	LLama/LLavaWeights.cs	LLaVA image+text models
State Persistence	`SessionState`	LLama/SessionState.cs	Save/load conversation state

Sources: LLama/InteractiveExecutor.cs LLama/StatelessExecutor.cs LLama/LLamaEmbedder.cs LLama/Grammars/Grammar.cs

Inference Data Flow

The following diagram illustrates how data flows from user input through tokenization, inference, sampling, and back to text output:

Key Methods:

LLamaWeights.Tokenize(): Converts text to token IDs (LLama/LLamaWeights.cs)
LLamaContext.Decode(): Runs inference batch through native llama.cpp (LLama/LLamaContext.cs)
ISamplingPipeline.Sample(): Selects next token from logits (LLama/Abstractions/ISamplingPipeline.cs)
StreamingTokenDecoder.Add(): Converts tokens back to text (LLama/Native/StreamingTokenDecoder.cs)

Sources: LLama/LLamaWeights.cs LLama/LLamaContext.cs LLama/Native/StreamingTokenDecoder.cs LLama/Sampling/DefaultSamplingPipeline.cs

Integration Points

LLamaSharp implements standard interfaces from Microsoft's AI ecosystem, enabling drop-in compatibility with existing frameworks:

Interface Implementation Map:

Framework Interface	LLamaSharp Implementation	Package
`IChatClient`	`LLamaChatClient`	LLamaSharp core
`IEmbeddingGenerator<string, Embedding<float>>`	`LLamaEmbeddingGenerator`	LLamaSharp core
`IChatCompletionService`	`LLamaSharpChatCompletion`	LLamaSharp.semantic-kernel
`ITextGenerationService`	`LLamaSharpTextCompletion`	LLamaSharp.semantic-kernel
`ITextEmbeddingGenerationService<string, float>`	`LLamaSharpTextEmbedding`	LLamaSharp.semantic-kernel
`ITextGenerator`	`LlamaSharpTextGenerator`	LLamaSharp.kernel-memory
`ITextEmbeddingGenerator`	`LLamaSharpTextEmbeddingGenerator`	LLamaSharp.kernel-memory

Sources: LLama/LLamaSharp.csproj54 LLama.SemanticKernel/LLamaSharp.SemanticKernel.csproj37 LLama.KernelMemory/LLamaSharp.KernelMemory.csproj30

Native Binary Distribution

The build system automatically:

Checks runtimes/release_id.txt for the cached binary version (LLama/LLamaSharp.csproj82-84)
Downloads deps.zip from https://github.com/SciSharp/LLamaSharpBinaries/releases/download/{BinaryReleaseId}/deps.zip if not cached (LLama/LLamaSharp.csproj69-72)
Extracts platform-specific binaries (*.dll, *.so, *.dylib, *.metal) to runtimes/deps/ (LLama/LLamaSharp.csproj75-78)

At runtime, NativeLibraryConfig loads the appropriate binary based on platform and available acceleration (CPU, CUDA, Vulkan, Metal).

Sources: LLama/LLamaSharp.csproj60-90

Supported Models and Formats

LLamaSharp supports GGUF-formatted models compatible with llama.cpp. Common model families include:

LLaMA/LLaMA2/LLaMA3: Base foundation models
Mistral/Mixtral: Efficient transformer variants
Phi: Small high-performance models
Gemma: Google's open models
Qwen: Alibaba's multilingual models
DeepSeek: Reasoning-focused models
LLaVA: Multimodal vision-language models

Sources: README.md108-117 README.md242-269

Example Application Structure

The repository includes a comprehensive examples project demonstrating various usage patterns:

Sources: LLama.Examples/LLama.Examples.csproj README.md119-174

Version Compatibility

Breaking Change Policy: Minor version increments (0.x.0) may introduce breaking API changes. Patch versions (0.x.y) maintain backward compatibility within the same minor version.

Sources: README.md239-269 LLama/LLamaSharp.csproj10-25

Overview

What is LLamaSharp

Package Architecture

Core and Backend Packages

Integration Package Matrix

Core Architecture Layers

Key Capabilities and Code Mapping

Feature-to-Class Reference Table

Inference Data Flow

Integration Points

Native Binary Distribution

Supported Models and Formats

Example Application Structure

Version Compatibility

On this page

Overview

What is LLamaSharp

Package Architecture

Core and Backend Packages

Integration Package Matrix

Core Architecture Layers

Key Capabilities and Code Mapping

Feature-to-Class Reference Table

Inference Data Flow

Integration Points

Native Binary Distribution

Supported Models and Formats

Example Application Structure

Version Compatibility

On this page