Tiny C++ LLM inference implementation from scratch.
- GPT-2
- Llama 3.2
- Qwen 2.5
- Qwen 3
- Mistral
- Fast BPE tokenizer, inspired by tiktoken
- CPU / CUDA inference
- FP32 / FP16 / BF16 inference
- KV Cache
- Flash Attention via TinyFA
tinygpt::tokenizer is faster than both HuggingFace Tokenizers
and OpenAI tiktoken. The encoding speed was measured using
the benches/tokenizer.py script on a machine with
an Intel(R) Xeon(R) Platinum 8255C CPU @ 2.50GHz.
- Distributed Inference
- Paged Attention
- Continuous Batching
git clone --recurse-submodules https://github.com/keith2018/TinyGPT.git
cd TinyGPTDownload model files from HuggingFace:
git clone https://huggingface.co/openai-community/gpt2
git clone https://huggingface.co/meta-llama/Llama-3.2-1B
git clone https://huggingface.co/meta-llama/Llama-3.2-3B
git clone https://huggingface.co/Qwen/Qwen2.5-0.5B
git clone https://huggingface.co/Qwen/Qwen2.5-3B
git clone https://huggingface.co/Qwen/Qwen3-1.7B
git clone https://huggingface.co/mistralai/Mistral-7B-v0.3mkdir build
cmake -B ./build -DCMAKE_BUILD_TYPE=Release
cmake --build ./build --config ReleaseThe examples/ directory contains independent sub-projects that can be built and run separately.
Benchmark the BPE tokenizer encoding speed:
cd examples/tokenizer/bin
./TinyGPT_example_tokenizerRun model inference with configurable parameters:
cd examples/inference/bin
./TinyGPT_example_inference --model /path/to/modelAvailable options:
| Option | Default | Description |
|---|---|---|
--model <path> |
(required) | Path to HuggingFace model directory |
--device <cpu|cuda> |
cuda |
Device type |
--dtype <fp32|fp16|bf16> |
bf16 |
Data type |
--max-tokens <n> |
32 |
Max new tokens to generate |
--temperature <f> |
0.8 |
Sampling temperature |
--top-p <f> |
0.9 |
Top-p (nucleus) sampling |
Example output:
[INFO] Load model ...
[INFO] Load model done.
[INFO] Generated Outputs:
[INFO] ------------------------------------------------------------
[INFO] Prompt: 'Hello, my name is'
[INFO] Output: ' Max! I am Phelan and I'm the world's greatest magician! ...'
[INFO] ------------------------------------------------------------
[INFO] Prompt: 'The president of the United States is'
[INFO] Output: ' on a temporary trip to Asia, and the Pentagon has made several announcements ...'
[INFO] ------------------------------------------------------------
[INFO] Time cost: 1907 ms, speed: 83.90 token/s
TinyGPT includes an OpenAI-compatible API server with a built-in Web UI.
cd server/bin
./TinyGPT_server --model /path/to/modelAvailable options:
| Option | Default | Description |
|---|---|---|
--model <path> |
(required) | Path to HuggingFace model directory |
--host <addr> |
0.0.0.0 |
Server host address |
--port <port> |
8080 |
Server port |
--max-tokens <n> |
4096 |
Max new tokens per request |
--temperature <f> |
0.7 |
Sampling temperature |
--top-p <f> |
0.9 |
Top-p sampling |
--min-p <f> |
0.0 |
Min-p sampling |
--chat-template <s> |
auto | Custom chat template (Jinja2 string or file path) |
--web-dir <path> |
auto | Path to web UI directory |
The server implements the following OpenAI-compatible endpoints:
GET /v1/models— List available modelsPOST /v1/completions— Text completionsPOST /v1/chat/completions— Chat completions (supports streaming via SSE)
Once the server is running, open http://localhost:8080 in your browser to access the built-in Web UI.
# pip install .
import tinygpt
enc = tinygpt.Tokenizer()
enc.init_with_config("tokenizer.json", "tokenizer_config.json")
ids = enc.encode("This is a test")| Library | Purpose |
|---|---|
| TinyTorch | Tensor operations |
| TinyFA | Flash Attention |
| RapidJSON | JSON parsing |
| pcre2 | Regex |
| utf8proc | Unicode |
| ankerl::unordered_dense | HashMap |
| moodycamel::ConcurrentQueue | Concurrent queue |
This code is licensed under the MIT License (see LICENSE).

