Non-record: JEPA-LM — When Synthetic Success Doesn't Transfer to Real… by himanshudongre · Pull Request #1012 · openai/parameter-golf

himanshudongre · 2026-03-28T10:14:53Z

Summary

Implements JEPA (Joint Embedding Predictive Architecture) as training-time auxiliary loss for language modeling
On synthetic Markov chain data, JEPA showed -19.5% cross-entropy improvement
On real English text, improvement collapsed to -0.24% with +40% throughput overhead
Key finding: Markov chains have exploitable repetitive statistical structure that JEPA excels at, but natural language doesn't
Checks off "JEPA" from Requests for PRs

Why This Matters

This is a negative result, but an informative one. Understanding why it doesn't work at this scale is as valuable as showing something that does.

The Trap: Synthetic vs Real Text

Benchmark	JEPA CE Improvement	Throughput Overhead
Synthetic Markov chains	-19.5%	+40%
Real English text	-0.24%	+40%

The synthetic result was completely misleading. Markov chains have fixed transition probabilities that JEPA's representation prediction can exploit. Natural language has semantic ambiguity, long-range dependencies, and non-stationary statistics — JEPA's predictions become nearly meaningless.

How JEPA-LM Works

Standard LM: tokens → encoder → LM head → CE loss

JEPA-LM adds: tokens → target encoder (EMA) → target representations
Plus: tokens → online encoder → predictor → predicted representations → JEPA loss

The target encoder is training-only (discarded at export). Zero eval-time overhead. But the training overhead (+40%) costs ~1500 training steps in 600s, which a -0.24% quality improvement cannot justify.

Companion PR

See S4D-Lin SSM Hybrid PR for where this research led next. The key lesson from JEPA (always validate on real text) informed the SSM approach — though even real-text validation at small scale turned out to be unreliable.

See full README

… Language Implements JEPA (Joint Embedding Predictive Architecture) as training-time auxiliary loss for language modeling. On synthetic Markov chain data, JEPA showed -19.5% cross-entropy improvement. On real English text, the improvement collapsed to -0.24% with +40% throughput overhead. Key finding: Markov chains have exploitable repetitive statistical structure that JEPA excels at, but natural language doesn't. This is a cautionary tale about synthetic benchmark validation. Checks off "JEPA" from Requests for PRs. See companion PR (S4D-Lin SSM Hybrid) for where this research led next.

himanshudongre added a commit to himanshudongre/parameter-golf that referenced this pull request Mar 28, 2026

Add PR cross-references (openai#1013 <-> openai#1012)

dafde7d

Add PR cross-references (openai#1012 <-> openai#1013)

7f92855

notapplica mentioned this pull request Mar 29, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: JEPA-LM — When Synthetic Success Doesn't Transfer to Real…#1012

Non-record: JEPA-LM — When Synthetic Success Doesn't Transfer to Real…#1012
himanshudongre wants to merge 2 commits intoopenai:mainfrom
himanshudongre:nonrecord/jepa-lm-research

himanshudongre commented Mar 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

himanshudongre commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why This Matters

The Trap: Synthetic vs Real Text

How JEPA-LM Works

Companion PR

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

himanshudongre commented Mar 28, 2026 •

edited

Loading