Experiment: CAT diagonal alignment for head_dim=128

## Hypothesis
Per-channel scaling before FWHT reduces the head_dim=128 quality gap by aligning channel variances.

## Background
head_dim=128 degrades 2-4% PPL vs head_dim=256. Root cause: fewer FWHT butterfly stages (7 vs 8), weaker mixing, heavier tails in post-rotation distribution. CAT paper (arXiv:2603.04359) proposes per-channel scaling m_i = sqrt(E[k_i^2] / sum_j W_K_{ij}^2) before rotation.

## What to test
- Compute per-channel variance calibration (128 multiplies per token)
- Apply scaling before WHT rotation in SET_ROWS
- PPL on head_dim=128 models (Qwen2.5-7B, Llama-3.1-8B)
- Compare to baseline turbo3 on same models
- Measure overhead of extra multiply

## Expected outcome
Close 25-39% of head_dim=128 PPL gap per CAT paper. Minimal speed impact (one extra multiply in SET_ROWS).

## Priority
Medium — addresses the biggest open quality problem.

## Source
AutoRepl: TODO-002 (buun, fork_dc582a), arXiv:2603.04359

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiment: CAT diagonal alignment for head_dim=128 #9

Hypothesis

Background

What to test

Expected outcome

Priority

Source

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Experiment: CAT diagonal alignment for head_dim=128 #9

Description

Hypothesis

Background

What to test

Expected outcome

Priority

Source

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions