Loss function comparison (CE vs P2 variants) under parameter-golf constraints by estesryan · Pull Request #1180 · openai/parameter-golf

estesryan · 2026-03-31T11:32:19Z

Submission: SR-CM-P2Loss

Key features:

P2 loss ((1-p)^2) for difficulty-aware training
Wallclock-aware LR warmdown aligned to the 10-minute cap
Residual mixing + conv token mixer
Muon (matrix) + Adam (scalar/embed) optimizer split
Compression-aware training with int6 + late QAT

Final:

val_loss: 1.78588484
val_bpb: 1.05770160
bytes_total: 15058763

Included:

train_gpt.py
train.log
README.md
submission.json

dexhunter · 2026-03-31T12:42:51Z

I tested the P2 loss (1-p)^2 training approach on our stack (11L, 512d, Full Hessian GPTQ, XSA-all, BigramHash) and wanted to flag a potential metric issue.

The concern: The P2 loss reweights the cross-entropy during training, focusing gradients on hard tokens. This is a legitimate training technique. However, if the same P2-weighted loss is also used during evaluation (via model.forward() returning the P2-weighted loss), the reported val_bpb is not standard cross-entropy — it's a reweighted metric that will be lower than the true BPB.

Per the README and Issue #1017:

"val_bpb is the prequential code length of a causal predictor"

This requires standard cross-entropy evaluation (Condition 2: full normalized distribution scored with log p(x_t)), not a reweighted variant.

My test results:

Metric	Standard CE training	P2 loss training
Post-EMA (model.forward)	1.131	0.985 ← P2-weighted, not CE
Sliding window (F.cross_entropy)	1.111	1.157 ← true BPP, worse

When evaluated with standard F.cross_entropy (the correct metric), the P2-trained model scores 1.157 BPP — significantly worse than standard CE training (1.111). The P2 loss sacrifices easy-token accuracy to focus on hard tokens, which nets worse overall BPP.

The reported 1.0577 BPP may be using the P2-weighted loss path for evaluation rather than standard cross-entropy. Could you confirm whether eval_val() uses model.forward() (P2-weighted) or F.cross_entropy(model.forward_logits()) (standard CE)?

cc @0hq @valerio-oai

estesryan · 2026-03-31T12:57:15Z

Thanks for the careful review. You are correct. Validation is currently using the P2-weighted loss via model.forward().

I will update the evaluation to use standard cross-entropy (unweighted), rerun, and resubmit with corrected metrics.

Appreciate you flagging this.

Add SR-CM-P2Loss submission

827b057

notapplica mentioned this pull request Mar 31, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

estesryan changed the title ~~SR-CM-P2Loss: 1.0577 bpb (~15.06MB)~~ Loss function comparison (CE vs P2 variants) under parameter-golf constraints Apr 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loss function comparison (CE vs P2 variants) under parameter-golf constraints#1180

Loss function comparison (CE vs P2 variants) under parameter-golf constraints#1180
estesryan wants to merge 1 commit intoopenai:mainfrom
estesryan:sr-cm-p2loss

estesryan commented Mar 31, 2026

Uh oh!

dexhunter commented Mar 31, 2026

Uh oh!

estesryan commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

estesryan commented Mar 31, 2026

Uh oh!

dexhunter commented Mar 31, 2026

Uh oh!

estesryan commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants