Unified Projections for Efficient Language Models
We introduce Yocto, a 484K parameter language model that reduces attention parameters by 67% while achieving better perplexity than models 2-4× larger. The key insight: Q, K, V projections share structure and can be unified into a single projection.
Standard transformer attention uses three separate projections (Q, K, V), each with d² parameters. We show this is redundant.
We introduce Unified Attention: a single projection whose output splits into [seeking|offering|content] bands. Through training, these bands learn the functions of Q, K, and V respectively—but with 67% fewer attention parameters.
Results:
- 484,272 total parameters (1.85 MB at float32, <1 MB quantized)
- 700+ tokens/sec on CPU (no GPU required)
- 5.7% of parameters in attention (vs ~25% in standard transformers)
- 9.58 validation perplexity on TinyStories (matching models 2-4× larger)
- Geometric preservation: Controlled experiments show Berry phase and layer orthogonality within 2% of standard attention
We interpret our findings through wave physics: vectors are waveforms, weight matrices are fields, and projection computes amplitude resonance with phase alignment. This interpretation predicted unification would work—the three projections share structure because they transform the same input and optimize the same objective.
The physics of attention is simpler than standard architectures suggest.
Transformer attention computes:
where Q = W_Q·x, K = W_K·x, V = W_V·x are separate linear projections. This requires 3d² attention parameters per layer.
But why three separate matrices? The original paper [1] offered no theoretical justification. It worked, so the field adopted it. Nine years later, we still use three matrices.
Our question: What is the minimal parameterization for attention?
Our finding: One matrix suffices. A single projection, split into three bands, achieves the same geometric properties with 67% fewer attention parameters—and better perplexity.
- Unified Attention: Single projection → [seeking|offering|content] bands
- 67% reduction in attention parameters (from 3d² to d²)
- Improved perplexity (9.58 vs ~10-12 baseline)
- Geometric verification via Berry phase and orthogonality measurements
- Wave-field interpretation explaining why unification works
Every vector is a waveform with d dimensions. Each dimension carries:
- Amplitude: |vᵢ| — activation strength
- Phase: sign(vᵢ) — positive or negative direction
The dot product between vectors is wave interference:
Same sign → constructive interference (positive contribution) Opposite sign → destructive interference (negative contribution)
The three projections are not independent:
- Same input: All three transform x
- Same objective: All three optimize the same loss
- Coupled function: Q must find what K offers
If Q, K, V share underlying structure, learning them jointly (one matrix) should be more efficient than learning them separately (three matrices). The single matrix acts as an implicit regularizer.
Prediction: Unified projection will match or exceed standard attention with fewer parameters.
Standard: Q = W_Q·x, K = W_K·x, V = W_V·x [3d² params]
Unified: u = W·x, Q = u[:d/3], K = u[d/3:2d/3], V = u[2d/3:] [d² params]
We apply Rotary Position Embedding (RoPE) to Q and K bands but not to V. Position affects routing (who attends to whom) but not content (what information transfers).
| Component | Value |
|---|---|
| Embedding dimension | 72 |
| Layers | 4 |
| Attention heads | 3 |
| FFN hidden | 288 |
| Vocabulary | 4,000 |
| Context length | 512 |
| Total parameters | 484,272 |
| Component | Parameters | Share |
|---|---|---|
| Embeddings | 288,000 | 59.5% |
| Attention | 27,648 | 5.7% |
| FFN | 166,464 | 34.4% |
| Other | 2,160 | 0.4% |
Standard transformers allocate ~25% to attention. Ours uses 5.7%.
- Data: TinyStories [3]
- Training: 82,000 steps, batch size 64, AdamW
- Learning rate: 1e-3 → 1e-4 (cosine decay)
| Model | Total Params | Attention Share | Val PPL |
|---|---|---|---|
| Ours (Unified) | 484K | 5.7% | 9.58 |
| TinyStories-1M | 1M | ~25% | ~10-12 |
| seangoedecke [4] | 1.8M | ~25% | ~9.6 |
With 52% fewer total parameters than TinyStories-1M, we achieve better perplexity. With 73% fewer parameters than seangoedecke, we match their perplexity.
| Metric | Score |
|---|---|
| Quality score | 99.6/100 |
| Vocabulary diversity | 67.0% |
| 2-gram repetition | 4.7% |
| Story elements | 68.8% |
Example (prompt: "Once upon a time"):
Once upon a time there was a little girl named Lily. She loved to run and run, but one day she didn't have any friends to play with.
Lily went home and bought her favorite toy to play. When she woke up, she saw something in the closet. It was a ball of yarn! She played with it all night long.
Named characters, temporal progression, narrative coherence—from 484K parameters.
| Hardware | Tokens/sec |
|---|---|
| Apple M-series CPU | 700+ |
| Standard laptop CPU | 200-400 |
| HuggingFace Spaces (shared CPU) | 100-300 |
The model is so small that CPU outperforms GPU—memory transfer overhead exceeds compute savings. No GPU required.
Does unified attention preserve the geometric properties of standard attention? We ran controlled experiments comparing attention mechanisms on identical architectures (6-layer transformers, embed_dim=126, trained on synthetic language tasks).
Berry phase measures accumulated rotation through layers—the "geometric memory" of the path through representation space.
| Attention Type | Berry Phase | vs Baseline |
|---|---|---|
| Standard Q/K/V | 135.23° | 100% |
| Unified [seek|offer|content] | 137.32° | 101.5% |
Within 2%: geometric path preserved.
Average angle between consecutive layer representations:
| Attention Type | Mean Angle |
|---|---|
| Standard Q/K/V | 22.54° |
| Unified | 22.89° |
Within 2%: rotation structure preserved.
These controlled experiments demonstrate that unified attention traces an equivalent geometric path through representation space. The 67% parameter reduction does not distort the fundamental geometry—it removes redundancy while preserving structure.
The Yocto model (4 layers, embed_dim=72) applies this same unified architecture to TinyStories, achieving 9.58 perplexity with 5.7% attention parameters.
Shared structure: Q, K, V transform the same input for the same objective. Separate matrices learn redundant structure. A unified matrix learns it once.
Implicit regularization: Fewer parameters may prevent overfitting. Our improved perplexity (9.58 vs ~10-12) supports this.
Our parameter distribution (59.5% embeddings, 5.7% attention, 34.4% FFN) suggests:
- Attention is routing: It decides WHERE information flows, not what it becomes
- Embeddings carry meaning: Most capacity represents tokens well
- FFN is essential: Nonlinear transformation cannot be reduced
- CPU is optimal: At 484K params, GPU memory transfer overhead exceeds compute benefits. The model runs faster on CPU than MPS/CUDA.
- Domain: TinyStories is constrained. General language may need more capacity.
- Scale: 484K tested. Behavior at 100M+ unknown.
- Context: 512 tokens tested. Long-context behavior unexplored.
Multi-Query Attention [5]: Shares K, V across heads. We go further—unifying Q, K, V into a single projection.
LoRA [6]: Reduces parameters post-training via low-rank adaptation. We reduce architecturally, during training.
Efficient Attention: Linear attention and sparse attention reduce O(n²) complexity. We reduce parameters while keeping full attention.
F-Net [7]: Replaces attention with Fourier transforms. Loses differential weighting—all tokens mixed identically, cannot focus on relevant context.
Mamba [8]: Replaces attention with selective state spaces. Loses Q/K asymmetry—"what I seek" merges with "what I offer" in a single state.
Ours: Preserves both differential weighting and asymmetry. Removes only the redundant parameterization.
We've demonstrated unified attention at 484K parameters. Two questions remain:
Scale: Does the 67% reduction hold at 100M or 1B parameters? The redundancy argument should apply regardless of scale, but this requires empirical verification.
Domains: TinyStories uses constrained vocabulary. Does unified attention maintain its advantage on general language, code, or multilingual data?
Yocto is the first step toward AI that needs no cloud, no API, no internet:
| Platform | Use Case |
|---|---|
| Video games | NPCs with real dialogue, not scripted lines |
| Browsers | Local assistants without sending data to servers |
| Phones | Private AI that works offline |
| IoT/Embedded | Smart devices that think locally |
At 946 KB and 700+ tokens/sec on CPU, Yocto proves this is possible. The goal is to scale unified attention to build larger models that remain small enough to run anywhere — private, fast, and free.
We release all code, weights, and training configs to enable this future.
We asked: What is the minimal parameterization for attention?
Answer: One projection suffices. The three matrices Q, K, V share structure that a unified projection captures more efficiently.
Results:
- 67% fewer attention parameters
- 5.7% of model in attention (vs 25% standard)
- 700+ tokens/sec on CPU (no GPU needed)
- Better perplexity than larger models
- Geometric properties preserved within 2%
Implication: Standard attention is over-parameterized. The wave-field interpretation predicted this—and experiments confirm it.
[1] Vaswani et al., "Attention Is All You Need," NeurIPS 2017.
[2] Su et al., "RoFormer: Enhanced Transformer with Rotary Position Embedding," 2021.
[3] Eldan & Li, "TinyStories: How Small Can Language Models Be?," 2023.
[4] Goedecke, "Training a Language Model on a Laptop," 2024.
[5] Shazeer, "Fast Transformer Decoding: One Write-Head is All You Need," 2019.
[6] Hu et al., "LoRA: Low-Rank Adaptation of Large Language Models," ICLR 2022.
[7] Lee-Thorp et al., "FNet: Mixing Tokens with Fourier Transforms," NAACL 2022.
[8] Gu & Dao, "Mamba: Linear-Time Sequence Modeling with Selective State Spaces," 2023.
class UnifiedAttention(nn.Module):
def __init__(self, embed_dim, num_heads):
self.third = embed_dim // 3
self.W_unified = nn.Linear(embed_dim, embed_dim, bias=False)
self.W_out = nn.Linear(self.third, embed_dim, bias=False)
self.rope = RotaryPositionEmbedding(self.third // num_heads)
def forward(self, x):
# Single projection → three bands
u = self.W_unified(x)
seeking, offering, content = u.split(self.third, dim=-1)
# RoPE on seeking/offering only (not content)
cos, sin = self.rope(x)
seeking, offering = apply_rope(seeking, offering, cos, sin)
# Standard attention computation
out = F.scaled_dot_product_attention(
seeking, offering, content, is_causal=True
)
return self.W_out(out)484,272 Parameters · 946 KB (fp16) · 700+ tok/s · 67% Less Attention · Open Source
git clone https://github.com/reinforceai/yocto
cd yocto
pip install -r requirements.txt
python inference.py --prompt "Once upon a time"700+ tokens/sec on CPU — no GPU needed.
🤗 Try it now: HuggingFace Space
📦 Model weights: HuggingFace Model
If you use this work, please cite:
@misc{deshwal2026yocto,
title={Attention Fields: Unified Projections for Efficient Language Models},
author={Deshwal, Viraj},
year={2026},
url={https://www.reinforceai.com/yocto},
howpublished={\url{https://github.com/reinforceai/yocto}}
}