Structural Conventions: Why Claude Loves XML and GPT Loves JSON
Different models respond better to different formatting conventions. Here's what actually works across Claude, GPT, Llama, and others.
Different models respond better to different formatting conventions. Here's what actually works across Claude, GPT, Llama, and others.
The same prompt behaves differently across models. The reason isn't the model weights - it's the chat template. Here's what you need to know.
Static batching wastes GPU cycles waiting for the slowest request. Continuous batching fixes this by adding and removing requests at every iteration. Here's how it works.
Quantization cuts memory and speeds up inference. But naive 8-bit quantization breaks at 6.7B+ parameters. Here's why, and how modern methods fix it.
LLM decoding is sequential and memory-bound. Speculative decoding breaks this by guessing multiple tokens and verifying in parallel. Here's how it works and when to use it.
Data parallelism hits a wall when your model doesn't fit on one GPU. Tensor parallelism solves this by sharding the model itself.
A batch size of 1 on an A100 typically achieves 10-20% utilization. Understanding why—and how to fix it—is key to cost-effective inference.
Time-to-first-token is compute-bound. Token generation is memory-bound. Understanding this split is key to optimizing inference.
Flash Attention doesn't reduce computation—it reduces memory traffic. Understanding why that matters is key to optimizing transformers.
Multi-Head, Grouped-Query, and Multi-Head Latent Attention explained with memory calculations you can verify yourself.
DeepSeek uses 256 experts, Llama 4 uses 8,192. Why expert count matters, and what the tradeoffs actually are.
GPT-2 popularized Pre-Norm. Now models like OLMo 2 and Gemma 3 are switching back to Post-Norm. What changed?
Standard attention is O(n²). Linear variants promise O(n). Qwen3 and Kimi K2 use hybrid approaches. Here's what you give up.