Skip to content

[WIP] SSM LRU Baseline — First State Space Model Submission#220

Draft
timothywangdev wants to merge 1 commit intoopenai:mainfrom
timothywangdev:ssm-lru-baseline
Draft

[WIP] SSM LRU Baseline — First State Space Model Submission#220
timothywangdev wants to merge 1 commit intoopenai:mainfrom
timothywangdev:ssm-lru-baseline

Conversation

@timothywangdev
Copy link
Copy Markdown

Summary

  • First non-transformer submission to parameter golf — uses a Linear Recurrent Unit (LRU) state space model
  • SSM blocks are 36% smaller than attention blocks at equivalent dimension, enabling 12-15 layers where transformers fit 9-10 in 16MB
  • Complex diagonal recurrence with parallel scan (cumsum trick), gated projection, ReLU^2 MLP
  • MuonAdamW optimizer with SSM-aware parameter groups

Status: WIP

Applying for compute credits to validate on H100. Current results on RTX 3090 (5 min budget):

  • val_bpb: 1.848 (bottlenecked by pure PyTorch scan speed, 2.8% MFU)
  • With mamba_ssm CUDA kernels on H100, expect 10-50x speedup → competitive results

Why SSMs Could Win

  • Parameter efficiency: SSM-specific params are <0.2% of total; projections dominate
  • No KV cache: Native sliding window eval without recomputation
  • Linear complexity: More tokens processed in fixed time budget
  • Compressibility: Research shows 50% of SSM weights can be pruned with zero accuracy loss (SparseSSM)

Research Backing

  • 6 deep-dive research documents (LinOSS, Mamba-3, SSM taxonomy, compression frontiers)
  • Autonomous experiment loop with brainstorming, arxiv paper reading, self-reflection
  • Concrete 16MB configs identified from parameter analysis

Test plan

  • Validate on H100 with mamba_ssm CUDA kernels
  • Establish competitive val_bpb baseline
  • Optimize architecture (depth, width, d_state) for 16MB budget
  • Add int8+zlib quantization pipeline
  • Statistical significance testing (3+ seeds, p < 0.01)

First non-transformer submission to parameter golf. Uses Linear Recurrent
Unit (LRU) with complex diagonal recurrence and parallel scan. SSM blocks
are 36% smaller than attention — can fit 12-15 layers in 16MB vs 9-10
for transformers. WIP pending H100 compute validation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant