advanced ml optimization framework for tinyllama-1.1b demonstrating production-ready inference acceleration techniques with measurable performance gains.
this project implements multiple state-of-the-art optimization techniques to accelerate inference and reduce memory consumption for tinyllama-1.1b while maintaining model quality. targets: 60%+ speedup, 35%+ memory reduction, 97%+ accuracy retention.
- converts majority of model to fp16 for faster computation
- keeps numerically sensitive layers (layernorm, embeddings) in fp32
- benefits: 40-50% speedup, 50% memory reduction, no quality loss
- tradeoffs: requires gpu/mps support for optimal performance
- applies pytorch dynamic quantization to linear layers
- symmetric int8 quantization: maps [-max, max] → [-128, 127]
- benefits: 60-75% model size reduction, lower memory footprint
- tradeoffs: 5-10% slower on some hardware, slight quality degradation (<2%)
- identifies low-importance heads using l2 norm of output weights
- zeros out least important 25% of attention heads
- benefits: 15-25% speedup, reduced computation
- tradeoffs: requires careful tuning, ~1-3% quality loss
- quantizes key-value cache to int8 during generation
- reduces cache memory by 50-75%
- benefits: enables longer context windows, lower peak memory
- tradeoffs: minimal (~5%) compute overhead for quant/dequant
tinyllama-optimization/
├── README.md
├── requirements.txt
├── baseline.py
├── optimizations/
│ ├── mixed_precision.py
│ ├── quantization.py
│ ├── head_pruning.py
│ └── kv_cache_quant.py
├── benchmark.py
├── run_optimization.py
└── results/
├── baseline_results.json
├── consolidated_results.json
└── visualizations/
python baseline.pypython optimizations/mixed_precision.py
python optimizations/quantization.py
python optimizations/head_pruning.py
python optimizations/kv_cache_quant.pypython run_optimization.py| Method | Tokens/sec | Memory (MB) | Model Size (MB) | Perplexity |
|---|---|---|---|---|
| Baseline | 24.3 | 4521 | 4200 | 15.2 |
| Mixed FP16 | 35.7 (+47%) | 2340 (-48%) | 2100 (-50%) | 15.2 (100%) |
| INT8 Quant | 21.8 (-10%) | 2850 (-37%) | 1260 (-70%) | 15.6 (98%) |
| Head Prune | 29.1 (+20%) | 3900 (-14%) | 4200 (0%) | 15.7 (97%) |
- Edge Deployment: 70% size reduction enables on-device inference
- Cost Savings: 50% memory reduction = 2x throughput per GPU
- Latency: 45% speedup improves user experience in chatbots
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16
)
for module in model.modules():
if isinstance(module, (nn.LayerNorm, nn.Embedding)):
module.to(torch.float32)scale = 127.0 / max_val
quantized = torch.clamp(torch.round(tensor * scale), -128, 127)
dequantized = quantized.float() / scaleimportance = torch.norm(attention.o_proj.weight, p=2, dim=0)
threshold = torch.quantile(importance, 0.25)
mask = importance > threshold- mixed precision training - micikevicius et al. (2017)
- quantization and training of neural networks - jacob et al. (2018)
- are sixteen heads really better than one? - michel et al. (2019)
- kv-cache quantization - sheng et al. (2023)
- reduce
max_new_tokensin generation - use quantization first before other optimizations
- close other applications
- ensure model uses mps device:
device="mps" - check activity monitor for thermal throttling
- mixed precision has best m2 performance
- reduce pruning ratio (try 0.15 instead of 0.25)
- use fp16 instead of int8 quantization
- combine fewer optimizations
pruned = PrunedInference(prune_ratio=0.15)test_prompts = [
"your custom prompt 1:",
"your custom prompt 2:",
]mit license