Overview

Relevant source files

TensorRT-LLM is an inference optimization library for Large Language Models on NVIDIA GPUs. The library is architected on PyTorch and provides a Python API (tensorrt_llm.LLM) for single-GPU to multi-node deployments, command-line tools (trtllm-build, trtllm-bench, trtllm-serve), and an OpenAI-compatible HTTP server. The current version is 1.3.0rc7.

The library supports model quantization (FP8, FP4, NVFP4, INT4 AWQ, INT8 SmoothQuant), distributed execution (tensor/pipeline/expert/context parallelism), speculative decoding (Eagle3, MTP, N-Gram), and optimized attention kernels (FlashAttention, PagedAttention, Multi-head Latent Attention).

This document provides an overview of the system architecture and main components. For detailed subsystem information:

Installation and dependencies: see page 2
Python API (LLM class): see page 3
Command-line tools: see page 4
Testing infrastructure: see page 5
CI/CD and build system: see page 6
PyExecutor runtime: see page 7
Model architecture: see page 8
MoE implementation: see page 9
Quantization system: see page 11
Distributed execution: see page 13
Serving infrastructure: see page 14

Sources: README.md1-6 README.md236-243 tensorrt_llm/version.py15 tensorrt_llm/__init__.py124

Core Capabilities

TensorRT-LLM provides the following inference optimization capabilities:

Capability	Implementation	Key Classes/Functions
Quantization	Per-tensor and block-wise weight/activation quantization	`QuantConfig`, `QuantAlgo`, `FP8QDQLinearMethod`, `NvFP4LinearMethod`
Distributed Execution	Multi-GPU/multi-node parallelism strategies	`Mapping` (tp_size, pp_size, ep_size, cp_size), `MpiSession`
Speculative Decoding	Multi-token prediction and verification	`SpeculativeDecodingMode`, `Eagle3`, `MtpDecoder`
Attention Backends	Optimized attention kernel implementations	`TRTLLMAttention`, `FlashInferAttention`, `MultiHeadLatentAttention`
Kernel Selection	Runtime kernel profiling and selection	`AutoTuner`, `TunableRunner`, min-latency mode
KV Cache Management	Paged memory allocation with block reuse	`KVCacheManager`, `KVCacheManagerV1`, `KVCacheManagerV2`

Sources: README.md238 tensorrt_llm/__init__.py122-134

System Architecture

High-Level Component Architecture

The architecture consists of five layers:

Entry Points: Command-line tools in tensorrt_llm.commands and Python API classes (LLM, AsyncLLM) in tensorrt_llm.llmapi
Configuration: LlmArgs and subclasses (TorchLlmArgs, TrtLlmArgs) define model setup, SamplingParams controls generation
Execution Layer: GenerationExecutorProxy coordinates worker processes, PyExecutor._executor_loop implements the main inference loop, RequestScheduler manages request batching
Resource Management: ResourceManager allocates KV cache blocks (KVCacheManager) and sequence slots (SeqSlotManager)
Model Execution: PyTorchModelEngine executes DecoderModel.forward(), Sampler generates tokens

Sources: tensorrt_llm/__init__.py124-134 README.md236-243

Entry Points and User Interfaces

Python API Entry Point

The LLM class in tensorrt_llm.llmapi.llm is the primary programmatic interface. Users call LLM(model='path') to initialize, then generate() for synchronous inference or generate_async() for asynchronous streaming. The preprocess() method tokenizes prompts without running inference.

Command-Line Tools

Three command-line tools in tensorrt_llm.commands:

trtllm-build: Converts checkpoints to TensorRT engines, outputs config.json and model weights
trtllm-bench: Measures throughput/latency, generates CSV and JSON reports
trtllm-serve: Launches FastAPI server with OpenAI-compatible endpoints at /v1/completions and /v1/chat/completions

Sources: tensorrt_llm/__init__.py124 README.md247-255

Supported Model Architectures

TensorRT-LLM implements model architectures in tensorrt_llm._torch.models:

Model Family	Implementation File	Architecture Features
Llama	`modeling_llama.py`	RoPE, GQA, SwiGLU MLP
DeepSeek	`modeling_deepseekv3.py`	Multi-head Latent Attention (MLA), MoE with DeepEP
Qwen	`modeling_qwen.py`	MoE, supports Eagle3 speculative decoding
GPT-OSS	`modeling_gptoss.py`	W4AFP8 quantization, Harmony adapter
Mixtral	`modeling_mixtral.py`	Sparse MoE with expert parallelism
Gemma	`modeling_gemma.py`	Multi-query attention (MQA)
Mistral	`modeling_mistral.py`	Sliding window attention
Mamba	`modeling_mamba.py`	State-space model (SSM) architecture

Models inherit from DecoderModel and implement architecture-specific attention mechanisms, MLP layers, and weight loading strategies. Each model class defines a forward() method and integrates with quantization methods and distributed execution patterns.

Sources: README.md250 tensorrt_llm/__init__.py107

Quantization System

Quantization Configuration and Algorithms

Quantization in TensorRT-LLM uses a strategy pattern with QuantConfig specifying the algorithm (QuantAlgo enum) and LinearMethodBase subclasses implementing quantization/dequantization. Supported algorithms:

FP8: Per-tensor or block-wise 8-bit float (FP8QDQLinearMethod)
NVFP4: NVIDIA 4-bit float for Blackwell GPUs (NvFP4LinearMethod)
INT4_AWQ/W4A8_AWQ: Activation-aware weight quantization (AWQLinearMethod)
INT8_SQ: SmoothQuant per-tensor/per-channel quantization

KV cache quantization is configured via kv_cache_dtype parameter and supports FP8, NVFP4, and INT8 formats.

Sources: README.md238 tensorrt_llm/__init__.py111

Distributed Execution Patterns

Parallelism Configuration

Distributed execution is configured via the Mapping class with five parallelism dimensions:

Tensor Parallel (tp_size): Shard weights/activations horizontally, uses AllReduce for reduction
Pipeline Parallel (pp_size): Shard layers vertically, uses Send/Recv for activation passing
Expert Parallel (ep_size): Shard MoE experts, uses AlltoAll for token routing (see page 9 for MoE details)
Context Parallel (cp_size): Split sequence length, uses ring/helix attention patterns
Data Parallel (dp_size): Run independent model replicas for throughput scaling

Process management uses MpiSession (spawns workers via MpiPoolSession) or connects to existing MPI communicators (MpiCommSession).

Sources: tensorrt_llm/__init__.py127 README.md240

Testing Infrastructure

Test Organization

The testing framework (see page 5 for details) consists of:

Test Files: test_llm_api_pytorch.py (accuracy tests), test_e2e.py (integration tests), test_perf.py (performance tests)
Test Database: YAML files in tests/integration/test_lists/test-db/ define GPU-specific test suites (H100, B200, DGX H100)
Reference Data: YAML files in tests/integration/defs/accuracy/references/ contain expected accuracy metrics for benchmarks (GSM8K, MMLU, CNN/DailyMail)
Waives List: waives.txt tracks known issues and temporarily skipped tests

Tests are executed via pytest with GPU-specific selection based on test database conditions.

Sources: README.md254 constraints.txt1-3

Request Execution Flow

PyExecutor Inference Pipeline

The inference pipeline for a request in PyExecutor:

Tokenization: Convert prompts to token IDs using HuggingFace tokenizer
Enqueue: Add ExecutorRequest to ExecutorRequestQueue
Executor Loop: PyExecutor._executor_loop() continuously processes requests
Scheduling: RequestScheduler.schedule() selects requests that fit in batch
Resource Allocation: ResourceManager.allocate() assigns KV cache blocks via KVCacheManager.allocate_blocks()
Model Execution: PyTorchModelEngine.execute() runs DecoderModel.forward()
Sampling: Sampler.sample() generates next tokens using configured strategy
Sequence Update: update_sequences() appends tokens and updates state
Loop: Repeat scheduling/execution until all sequences finish
Detokenization: Convert token IDs back to text, return RequestOutput

See page 7 for detailed PyExecutor documentation.

Sources: README.md240 tensorrt_llm/__init__.py124

Getting Started

For typical usage, the workflow is:

Install TensorRT-LLM: See Installation and Dependencies for setup instructions
Load a model using Python API:

Or use command-line tools:

For distributed execution:

For detailed configuration options and advanced features, refer to the specific subsystem documentation linked at the top of this page.

Sources: tests/integration/defs/accuracy/test_llm_api_pytorch.py78-169 tests/integration/defs/test_e2e.py446-608 README.md1-100

Overview

Relevant source files

This document provides an overview of the system architecture and main components. For detailed subsystem information:

Installation and dependencies: see page 2
Python API (LLM class): see page 3
Command-line tools: see page 4
Testing infrastructure: see page 5
CI/CD and build system: see page 6
PyExecutor runtime: see page 7
Model architecture: see page 8
MoE implementation: see page 9
Quantization system: see page 11
Distributed execution: see page 13
Serving infrastructure: see page 14

Sources: README.md1-6 README.md236-243 tensorrt_llm/version.py15 tensorrt_llm/__init__.py124

Core Capabilities

TensorRT-LLM provides the following inference optimization capabilities:

Capability	Implementation	Key Classes/Functions
Quantization	Per-tensor and block-wise weight/activation quantization	`QuantConfig`, `QuantAlgo`, `FP8QDQLinearMethod`, `NvFP4LinearMethod`
Distributed Execution	Multi-GPU/multi-node parallelism strategies	`Mapping` (tp_size, pp_size, ep_size, cp_size), `MpiSession`
Speculative Decoding	Multi-token prediction and verification	`SpeculativeDecodingMode`, `Eagle3`, `MtpDecoder`
Attention Backends	Optimized attention kernel implementations	`TRTLLMAttention`, `FlashInferAttention`, `MultiHeadLatentAttention`
Kernel Selection	Runtime kernel profiling and selection	`AutoTuner`, `TunableRunner`, min-latency mode
KV Cache Management	Paged memory allocation with block reuse	`KVCacheManager`, `KVCacheManagerV1`, `KVCacheManagerV2`

Sources: README.md238 tensorrt_llm/__init__.py122-134

System Architecture

High-Level Component Architecture

The architecture consists of five layers:

Entry Points: Command-line tools in tensorrt_llm.commands and Python API classes (LLM, AsyncLLM) in tensorrt_llm.llmapi
Configuration: LlmArgs and subclasses (TorchLlmArgs, TrtLlmArgs) define model setup, SamplingParams controls generation
Execution Layer: GenerationExecutorProxy coordinates worker processes, PyExecutor._executor_loop implements the main inference loop, RequestScheduler manages request batching
Resource Management: ResourceManager allocates KV cache blocks (KVCacheManager) and sequence slots (SeqSlotManager)
Model Execution: PyTorchModelEngine executes DecoderModel.forward(), Sampler generates tokens

Sources: tensorrt_llm/__init__.py124-134 README.md236-243

Entry Points and User Interfaces

Python API Entry Point

Command-Line Tools

Three command-line tools in tensorrt_llm.commands:

trtllm-build: Converts checkpoints to TensorRT engines, outputs config.json and model weights
trtllm-bench: Measures throughput/latency, generates CSV and JSON reports
trtllm-serve: Launches FastAPI server with OpenAI-compatible endpoints at /v1/completions and /v1/chat/completions

Sources: tensorrt_llm/__init__.py124 README.md247-255

Supported Model Architectures

TensorRT-LLM implements model architectures in tensorrt_llm._torch.models:

Model Family	Implementation File	Architecture Features
Llama	`modeling_llama.py`	RoPE, GQA, SwiGLU MLP
DeepSeek	`modeling_deepseekv3.py`	Multi-head Latent Attention (MLA), MoE with DeepEP
Qwen	`modeling_qwen.py`	MoE, supports Eagle3 speculative decoding
GPT-OSS	`modeling_gptoss.py`	W4AFP8 quantization, Harmony adapter
Mixtral	`modeling_mixtral.py`	Sparse MoE with expert parallelism
Gemma	`modeling_gemma.py`	Multi-query attention (MQA)
Mistral	`modeling_mistral.py`	Sliding window attention
Mamba	`modeling_mamba.py`	State-space model (SSM) architecture

Sources: README.md250 tensorrt_llm/__init__.py107

Quantization System

Quantization Configuration and Algorithms

FP8: Per-tensor or block-wise 8-bit float (FP8QDQLinearMethod)
NVFP4: NVIDIA 4-bit float for Blackwell GPUs (NvFP4LinearMethod)
INT4_AWQ/W4A8_AWQ: Activation-aware weight quantization (AWQLinearMethod)
INT8_SQ: SmoothQuant per-tensor/per-channel quantization

KV cache quantization is configured via kv_cache_dtype parameter and supports FP8, NVFP4, and INT8 formats.

Sources: README.md238 tensorrt_llm/__init__.py111

Distributed Execution Patterns

Parallelism Configuration

Distributed execution is configured via the Mapping class with five parallelism dimensions:

Tensor Parallel (tp_size): Shard weights/activations horizontally, uses AllReduce for reduction
Pipeline Parallel (pp_size): Shard layers vertically, uses Send/Recv for activation passing
Expert Parallel (ep_size): Shard MoE experts, uses AlltoAll for token routing (see page 9 for MoE details)
Context Parallel (cp_size): Split sequence length, uses ring/helix attention patterns
Data Parallel (dp_size): Run independent model replicas for throughput scaling

Process management uses MpiSession (spawns workers via MpiPoolSession) or connects to existing MPI communicators (MpiCommSession).

Sources: tensorrt_llm/__init__.py127 README.md240

Testing Infrastructure

Test Organization

The testing framework (see page 5 for details) consists of:

Test Files: test_llm_api_pytorch.py (accuracy tests), test_e2e.py (integration tests), test_perf.py (performance tests)
Test Database: YAML files in tests/integration/test_lists/test-db/ define GPU-specific test suites (H100, B200, DGX H100)
Reference Data: YAML files in tests/integration/defs/accuracy/references/ contain expected accuracy metrics for benchmarks (GSM8K, MMLU, CNN/DailyMail)
Waives List: waives.txt tracks known issues and temporarily skipped tests

Tests are executed via pytest with GPU-specific selection based on test database conditions.

Sources: README.md254 constraints.txt1-3

Request Execution Flow

PyExecutor Inference Pipeline

The inference pipeline for a request in PyExecutor:

Tokenization: Convert prompts to token IDs using HuggingFace tokenizer
Enqueue: Add ExecutorRequest to ExecutorRequestQueue
Executor Loop: PyExecutor._executor_loop() continuously processes requests
Scheduling: RequestScheduler.schedule() selects requests that fit in batch
Resource Allocation: ResourceManager.allocate() assigns KV cache blocks via KVCacheManager.allocate_blocks()
Model Execution: PyTorchModelEngine.execute() runs DecoderModel.forward()
Sampling: Sampler.sample() generates next tokens using configured strategy
Sequence Update: update_sequences() appends tokens and updates state
Loop: Repeat scheduling/execution until all sequences finish
Detokenization: Convert token IDs back to text, return RequestOutput

See page 7 for detailed PyExecutor documentation.

Sources: README.md240 tensorrt_llm/__init__.py124

Getting Started

For typical usage, the workflow is:

Install TensorRT-LLM: See Installation and Dependencies for setup instructions
Load a model using Python API:

Or use command-line tools:

For distributed execution:

For detailed configuration options and advanced features, refer to the specific subsystem documentation linked at the top of this page.

Sources: tests/integration/defs/accuracy/test_llm_api_pytorch.py78-169 tests/integration/defs/test_e2e.py446-608 README.md1-100

Overview

Core Capabilities

System Architecture

Entry Points and User Interfaces

Supported Model Architectures

Quantization System

Distributed Execution Patterns

Testing Infrastructure

Request Execution Flow

Getting Started

On this page

Overview

Core Capabilities

System Architecture

Entry Points and User Interfaces

Supported Model Architectures

Quantization System

Distributed Execution Patterns

Testing Infrastructure

Request Execution Flow

Getting Started

On this page