Skip to content

Loss function comparison (CE vs P2 variants) under parameter-golf constraints#1180

Open
estesryan wants to merge 1 commit intoopenai:mainfrom
estesryan:sr-cm-p2loss
Open

Loss function comparison (CE vs P2 variants) under parameter-golf constraints#1180
estesryan wants to merge 1 commit intoopenai:mainfrom
estesryan:sr-cm-p2loss

Conversation

@estesryan
Copy link
Copy Markdown

Submission: SR-CM-P2Loss

Key features:

  • P2 loss ((1-p)^2) for difficulty-aware training
  • Wallclock-aware LR warmdown aligned to the 10-minute cap
  • Residual mixing + conv token mixer
  • Muon (matrix) + Adam (scalar/embed) optimizer split
  • Compression-aware training with int6 + late QAT

Final:

  • val_loss: 1.78588484
  • val_bpb: 1.05770160
  • bytes_total: 15058763

Included:

  • train_gpt.py
  • train.log
  • README.md
  • submission.json

@dexhunter
Copy link
Copy Markdown

I tested the P2 loss (1-p)^2 training approach on our stack (11L, 512d, Full Hessian GPTQ, XSA-all, BigramHash) and wanted to flag a potential metric issue.

The concern: The P2 loss reweights the cross-entropy during training, focusing gradients on hard tokens. This is a legitimate training technique. However, if the same P2-weighted loss is also used during evaluation (via model.forward() returning the P2-weighted loss), the reported val_bpb is not standard cross-entropy — it's a reweighted metric that will be lower than the true BPB.

Per the README and Issue #1017:

"val_bpb is the prequential code length of a causal predictor"

This requires standard cross-entropy evaluation (Condition 2: full normalized distribution scored with log p(x_t)), not a reweighted variant.

My test results:

Metric Standard CE training P2 loss training
Post-EMA (model.forward) 1.131 0.985 ← P2-weighted, not CE
Sliding window (F.cross_entropy) 1.111 1.157 ← true BPP, worse

When evaluated with standard F.cross_entropy (the correct metric), the P2-trained model scores 1.157 BPP — significantly worse than standard CE training (1.111). The P2 loss sacrifices easy-token accuracy to focus on hard tokens, which nets worse overall BPP.

The reported 1.0577 BPP may be using the P2-weighted loss path for evaluation rather than standard cross-entropy. Could you confirm whether eval_val() uses model.forward() (P2-weighted) or F.cross_entropy(model.forward_logits()) (standard CE)?

cc @0hq @valerio-oai

@estesryan
Copy link
Copy Markdown
Author

Thanks for the careful review. You are correct. Validation is currently using the P2-weighted loss via model.forward().

I will update the evaluation to use standard cross-entropy (unweighted), rerun, and resubmit with corrected metrics.

Appreciate you flagging this.

@estesryan estesryan changed the title SR-CM-P2Loss: 1.0577 bpb (~15.06MB) Loss function comparison (CE vs P2 variants) under parameter-golf constraints Apr 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants