TensorRT-LLM is an inference optimization library for Large Language Models on NVIDIA GPUs. The library is architected on PyTorch and provides a Python API (tensorrt_llm.LLM) for single-GPU to multi-node deployments, command-line tools (trtllm-build, trtllm-bench, trtllm-serve), and an OpenAI-compatible HTTP server. The current version is 1.3.0rc7.
The library supports model quantization (FP8, FP4, NVFP4, INT4 AWQ, INT8 SmoothQuant), distributed execution (tensor/pipeline/expert/context parallelism), speculative decoding (Eagle3, MTP, N-Gram), and optimized attention kernels (FlashAttention, PagedAttention, Multi-head Latent Attention).
This document provides an overview of the system architecture and main components. For detailed subsystem information:
LLM class): see page 3Sources: README.md1-6 README.md236-243 tensorrt_llm/version.py15 tensorrt_llm/__init__.py124
TensorRT-LLM provides the following inference optimization capabilities:
| Capability | Implementation | Key Classes/Functions |
|---|---|---|
| Quantization | Per-tensor and block-wise weight/activation quantization | QuantConfig, QuantAlgo, FP8QDQLinearMethod, NvFP4LinearMethod |
| Distributed Execution | Multi-GPU/multi-node parallelism strategies | Mapping (tp_size, pp_size, ep_size, cp_size), MpiSession |
| Speculative Decoding | Multi-token prediction and verification | SpeculativeDecodingMode, Eagle3, MtpDecoder |
| Attention Backends | Optimized attention kernel implementations | TRTLLMAttention, FlashInferAttention, MultiHeadLatentAttention |
| Kernel Selection | Runtime kernel profiling and selection | AutoTuner, TunableRunner, min-latency mode |
| KV Cache Management | Paged memory allocation with block reuse | KVCacheManager, KVCacheManagerV1, KVCacheManagerV2 |
Sources: README.md238 tensorrt_llm/__init__.py122-134
High-Level Component Architecture
The architecture consists of five layers:
tensorrt_llm.commands and Python API classes (LLM, AsyncLLM) in tensorrt_llm.llmapiLlmArgs and subclasses (TorchLlmArgs, TrtLlmArgs) define model setup, SamplingParams controls generationGenerationExecutorProxy coordinates worker processes, PyExecutor._executor_loop implements the main inference loop, RequestScheduler manages request batchingResourceManager allocates KV cache blocks (KVCacheManager) and sequence slots (SeqSlotManager)PyTorchModelEngine executes DecoderModel.forward(), Sampler generates tokensSources: tensorrt_llm/__init__.py124-134 README.md236-243
Python API Entry Point
The LLM class in tensorrt_llm.llmapi.llm is the primary programmatic interface. Users call LLM(model='path') to initialize, then generate() for synchronous inference or generate_async() for asynchronous streaming. The preprocess() method tokenizes prompts without running inference.
Command-Line Tools
Three command-line tools in tensorrt_llm.commands:
trtllm-build: Converts checkpoints to TensorRT engines, outputs config.json and model weightstrtllm-bench: Measures throughput/latency, generates CSV and JSON reportstrtllm-serve: Launches FastAPI server with OpenAI-compatible endpoints at /v1/completions and /v1/chat/completionsSources: tensorrt_llm/__init__.py124 README.md247-255
TensorRT-LLM implements model architectures in tensorrt_llm._torch.models:
| Model Family | Implementation File | Architecture Features |
|---|---|---|
| Llama | modeling_llama.py | RoPE, GQA, SwiGLU MLP |
| DeepSeek | modeling_deepseekv3.py | Multi-head Latent Attention (MLA), MoE with DeepEP |
| Qwen | modeling_qwen.py | MoE, supports Eagle3 speculative decoding |
| GPT-OSS | modeling_gptoss.py | W4AFP8 quantization, Harmony adapter |
| Mixtral | modeling_mixtral.py | Sparse MoE with expert parallelism |
| Gemma | modeling_gemma.py | Multi-query attention (MQA) |
| Mistral | modeling_mistral.py | Sliding window attention |
| Mamba | modeling_mamba.py | State-space model (SSM) architecture |
Models inherit from DecoderModel and implement architecture-specific attention mechanisms, MLP layers, and weight loading strategies. Each model class defines a forward() method and integrates with quantization methods and distributed execution patterns.
Sources: README.md250 tensorrt_llm/__init__.py107
Quantization Configuration and Algorithms
Quantization in TensorRT-LLM uses a strategy pattern with QuantConfig specifying the algorithm (QuantAlgo enum) and LinearMethodBase subclasses implementing quantization/dequantization. Supported algorithms:
FP8QDQLinearMethod)NvFP4LinearMethod)AWQLinearMethod)KV cache quantization is configured via kv_cache_dtype parameter and supports FP8, NVFP4, and INT8 formats.
Sources: README.md238 tensorrt_llm/__init__.py111
Parallelism Configuration
Distributed execution is configured via the Mapping class with five parallelism dimensions:
tp_size): Shard weights/activations horizontally, uses AllReduce for reductionpp_size): Shard layers vertically, uses Send/Recv for activation passingep_size): Shard MoE experts, uses AlltoAll for token routing (see page 9 for MoE details)cp_size): Split sequence length, uses ring/helix attention patternsdp_size): Run independent model replicas for throughput scalingProcess management uses MpiSession (spawns workers via MpiPoolSession) or connects to existing MPI communicators (MpiCommSession).
Sources: tensorrt_llm/__init__.py127 README.md240
Test Organization
The testing framework (see page 5 for details) consists of:
test_llm_api_pytorch.py (accuracy tests), test_e2e.py (integration tests), test_perf.py (performance tests)tests/integration/test_lists/test-db/ define GPU-specific test suites (H100, B200, DGX H100)tests/integration/defs/accuracy/references/ contain expected accuracy metrics for benchmarks (GSM8K, MMLU, CNN/DailyMail)waives.txt tracks known issues and temporarily skipped testsTests are executed via pytest with GPU-specific selection based on test database conditions.
Sources: README.md254 constraints.txt1-3
PyExecutor Inference Pipeline
The inference pipeline for a request in PyExecutor:
ExecutorRequest to ExecutorRequestQueuePyExecutor._executor_loop() continuously processes requestsRequestScheduler.schedule() selects requests that fit in batchResourceManager.allocate() assigns KV cache blocks via KVCacheManager.allocate_blocks()PyTorchModelEngine.execute() runs DecoderModel.forward()Sampler.sample() generates next tokens using configured strategyupdate_sequences() appends tokens and updates stateRequestOutputSee page 7 for detailed PyExecutor documentation.
Sources: README.md240 tensorrt_llm/__init__.py124
For typical usage, the workflow is:
Install TensorRT-LLM: See Installation and Dependencies for setup instructions
Load a model using Python API:
For detailed configuration options and advanced features, refer to the specific subsystem documentation linked at the top of this page.
Sources: tests/integration/defs/accuracy/test_llm_api_pytorch.py78-169 tests/integration/defs/test_e2e.py446-608 README.md1-100
Refresh this wiki
This wiki was recently refreshed. Please wait 7 days to refresh again.