BatchLLM

Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching

Large language models (LLMs) increasingly play an important role in a wide range of information processing and management tasks in industry. Many of these tasks are performed in large batches or even offline, where the key performance indicator is throughput. These workloads frequently exhibit prefix sharing — different prompt inputs partially share a common prefix.

BatchLLM achieves:

Global prefix identification: Explicitly identifies common prefixes across all requests globally. Requests sharing the same prefix are scheduled together to maximize KV context reuse.
Elimination of redundant computation in inference: By applying prefix-sharing groups, BatchLLM first infers and caches the KV states of the common prefix (as a single request without any decoding operation), then generates tokens for the remaining non-shared contexts/requests, saving redundant computation in both attention and non-attention parts for specific models.

Paper: [BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching]

Authors: Zhen Zheng, Xin Ji, Taosong Fang, Fanghao Zhou, Chuanjie Liu, Gang Peng

Architecture

                         ┌────────────────────────────┐
                         │     Prompt Inputs           │
                         └────────────┬───────────────┘
                                      │
                         ┌────────────▼───────────────┐
                         │    Prefix Clustering        │
                         │  (Global prefix detection   │
                         │   via dynamic programming)  │
                         └────────────┬───────────────┘
                                      │
                         ┌────────────▼───────────────┐
                         │  Group-based Scheduling     │
                         │  (Decoding-first reorder +  │
                         │   Resource-aware batching)  │
                         └────────────┬───────────────┘
                                      │
                    ┌─────────────────┼─────────────────┐
                    │                 │                   │
           ┌────────▼──────┐  ┌──────▼───────┐  ┌───────▼──────┐
           │ Common Prefix │  │   Distinct   │  │  Reduction   │
           │   Attention   │  │  Attention   │  │   Kernel     │
           └───────────────┘  └──────────────┘  └──────────────┘

Key Components

Component	Path	Description
BatchLLM Backend	`vllm/attention/backends/batch_llm.py`	Core attention backend with context sharing metadata
Prefix-shared Kernel	`vllm/attention/ops/ds_attn_*.py`	Triton JIT-compiled kernels with horizontal fusion
Prefix Clustering	`vllm/inputs/prefix_clustering.py`	Global prefix detection using prefix trees and DP
CS Group Manager	`vllm/core/csgroup_manager.py`	Context-sharing group lifecycle management
Token Batching Scheduler	`vllm/core/scheduler.py`	Resource-aware scheduler with decoding-first ordering

Getting Started

Prerequisites

NVIDIA GPU (CUDA)
Docker

Installation

Step 1: Pull the official vLLM 0.6.4 Docker image:

docker pull vllm/vllm-openai:v0.6.4

Image available at: https://hub.docker.com/r/vllm/vllm-openai/tags?name=0.6.4

Step 2: Start a container from the image:

docker run -itd --shm-size 32g --gpus all -v [your mount path] --ipc=host --ulimit nofile=65536:65536 --ulimit memlock=-1 --ulimit stack=67108864 --privileged --name vllm_064_retest --entrypoint="" vllm/vllm-openai:v0.6.4 sleep infinity

Step 3: Inside the container, clone this repository and replace the official vLLM with the BatchLLM branch:

git clone https://github.com/microsoft/MixLLM.git -b batchllm_vllm_064
cd batchllm_vllm_064
pip install -e .

Step 4: Run the test:

python3 test_context_share_064.py

Usage

Environment Variables

Variable	Description
`VLLM_USE_BATCHLLM`	Set to `1` to enable Triton kernels for BatchLLM in global prefix sharing scenarios
`DISABLE_CS_KERNELS`	Set to `1` to disable context-sharing kernels (falls back to FlashAttention)

CLI Arguments

Argument	Type	Default	Description
`--enable-ahead-of-prefix-clustering`	bool	`False`	Enable global prefix clustering as a preprocessing step

Disclaimer

This is NOT an official Microsoft product. This repository provides research and prototype code for academic and experimental purposes only. APIs, internal abstractions, and performance characteristics may change.

License

This project is released under the Apache License, Version 2.0. See the LICENSE file for full details.

Citation

waiting for the final version ready

Contact

For questions related to the paper or this research prototype, please open an issue or contact the authors listed in the paper.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
benchmarks		benchmarks
cmake		cmake
csrc		csrc
docs		docs
examples		examples
tests		tests
tools		tools
vllm.egg-info		vllm.egg-info
vllm		vllm
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
DCO		DCO
Dockerfile		Dockerfile
Dockerfile.cpu		Dockerfile.cpu
Dockerfile.hpu		Dockerfile.hpu
Dockerfile.neuron		Dockerfile.neuron
Dockerfile.openvino		Dockerfile.openvino
Dockerfile.ppc64le		Dockerfile.ppc64le
Dockerfile.rocm		Dockerfile.rocm
Dockerfile.tpu		Dockerfile.tpu
Dockerfile.xpu		Dockerfile.xpu
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
SECURITY.md		SECURITY.md
collect_env.py		collect_env.py
compare.py		compare.py
cwrsync_6.3.3_x64_free.zip		cwrsync_6.3.3_x64_free.zip
find_cuda_init.py		find_cuda_init.py
format.sh		format.sh
id.md		id.md
id.ms		id.ms
intro.md		intro.md
pyproject.toml		pyproject.toml
python_only_dev.py		python_only_dev.py
requirements-build.txt		requirements-build.txt
requirements-common.txt		requirements-common.txt
requirements-cpu.txt		requirements-cpu.txt
requirements-cuda.txt		requirements-cuda.txt
requirements-dev.txt		requirements-dev.txt
requirements-hpu.txt		requirements-hpu.txt
requirements-lint.txt		requirements-lint.txt
requirements-neuron.txt		requirements-neuron.txt
requirements-openvino.txt		requirements-openvino.txt
requirements-rocm.txt		requirements-rocm.txt
requirements-test.in		requirements-test.in
requirements-test.txt		requirements-test.txt
requirements-tpu.txt		requirements-tpu.txt
requirements-xpu.txt		requirements-xpu.txt
setup.py		setup.py
test_context_share_064.py		test_context_share_064.py
use_existing_torch.py		use_existing_torch.py
vllm_README.md		vllm_README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BatchLLM

Architecture

Key Components

Getting Started

Prerequisites

Installation

Usage

Environment Variables

CLI Arguments

Disclaimer

License

Citation

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BatchLLM

Architecture

Key Components

Getting Started

Prerequisites

Installation

Usage

Environment Variables

CLI Arguments

Disclaimer

License

Citation

Contact

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages