[Cross-entropy-loss] return mean token accuracy metric with CE loss by kashif · Pull Request #910 · linkedin/Liger-Kernel

kashif · 2025-10-16T19:47:03Z

Summary

Returns the mean token accuracy metric when minimizing the cross-entropy loss without materializing the logits

https://x.com/jeremyphoward/status/1703246293802586155

Testing Done

Hardware Type:
run make test to ensure correctness
run make checkstyle to ensure code style
run make test-convergence to ensure convergence

kashif · 2025-10-20T09:57:59Z

@vaibhavjindal would you be able to kindly review?

kashif · 2025-10-28T11:15:13Z

@shimizust this will be a breaking change i believe BTW

vaibhavjindal · 2025-11-03T21:13:23Z

@kashif could you please elaborate on how it will be a breaking change? Will it break the intergration with transformers or trl?

kashif · 2025-11-03T21:17:16Z

yes if someone is using the raw functions in their lib. then now that functions returns one more thing... but on the HF side this PR takes care of this

kashif · 2025-11-03T21:18:36Z

@vaibhavjindal see here https://github.com/linkedin/Liger-Kernel/pull/910/files#diff-7654a885e261ec4986c229d953681635ab4c8761c89342d2dde604f169783b35L350

vaibhavjindal · 2025-11-03T21:35:06Z

@kashif got it. So if i understand correctly, it will make sure that liger remains compatible with newer versions from HF. However, just want to confirm it will break liger support with older transformers/trl versions?

kashif · 2025-11-03T21:42:28Z

no i believe my changes here will work with older version of HF.. i just meant non-HF frameworks

kashif · 2025-11-03T21:43:34Z

TRL relies on HF integration for the CE loss so in TRL I will just pin to the liger version that has these changes

kashif · 2025-11-05T16:02:04Z

@vaibhavjindal let me fix up the new qwen3-vl model to update its API

kashif · 2025-11-05T21:02:48Z

@vaibhavjindal all good from my side

vaibhavjindal · 2025-11-05T21:04:41Z

@vaibhavjindal all good from my side

Thanks a lot! I will do some final checks on correctness and benchmarks and will try to get it merged soon.

kashif · 2025-11-05T21:07:23Z

thank you so much.. also see here: huggingface/trl#4302 (comment)

kashif · 2025-11-06T08:01:27Z

thanks @vaibhavjindal for the typo fix and making it more robust!

## Summary Add a `return_predicted_tokens` flag to `LigerCrossEntropyLoss` and `LigerFusedLinearCrossEntropyLoss` that returns per-token argmax predictions (as `int64` tensor) **without materializing full logits**. ## Motivation During training, it is often useful to access the model's predicted tokens (argmax of logits) for logging, visualization, and metric computation — for example, inspecting what the model actually predicts at each position, or tracking prediction distributions over time. Currently, obtaining predicted tokens requires either: 1. **Materializing full logits** and calling `.argmax(dim=-1)`, which defeats the memory savings of `FusedLinearCrossEntropy`, or 2. **Recomputing** the forward pass separately for metrics. Since the cross-entropy kernel already tracks `argmax` internally (for `return_token_accuracy`, introduced in #910), we can return the predicted token indices as a byproduct at near-zero additional cost. ## Design This builds on the `return_token_accuracy` infrastructure (#910). The existing `argmax_idx` tracking in the Triton kernel is reused, so: - When `return_predicted_tokens=False` (default), there is **zero overhead** — the `RETURN_PREDICTED_TOKENS` constexpr is compiled out. - When both `return_token_accuracy` and `return_predicted_tokens` are enabled, the argmax computation is **shared** (no duplicate work). - Ignored tokens (`ignore_index`) return `-1` as a sentinel value. ## Changes - **`ops/cross_entropy.py`**, **`ops/fused_linear_cross_entropy.py`**: Add `RETURN_PREDICTED_TOKENS` constexpr to the Triton kernel; store `argmax_idx` for non-ignored tokens, `-1` for ignored tokens. - **`transformers/cross_entropy.py`**, **`transformers/fused_linear_cross_entropy.py`**, **`transformers/functional.py`**: Propagate `return_predicted_tokens` through module and functional APIs. Return `CrossEntropyOutput` when any extra output is requested. - **`transformers/model/loss_utils.py`**: Thread `return_predicted_tokens` through `LigerForCausalLMLoss` → `fixed_fused_linear_cross_entropy`. - **`transformers/model/output_classes.py`**: Add `predicted_tokens` field to all `Liger*CausalLMOutputWithPast` dataclasses. - **`transformers/model/*.py`** (32 model files): Unpack and forward `predicted_tokens` in both tuple and dict return paths, following the same pattern as `token_accuracy`. ## Usage ```python # Standalone loss_fn = LigerCrossEntropyLoss(return_predicted_tokens=True) result = loss_fn(logits, target) # logits: (B*T, V), target: (B*T,) result.loss # scalar loss result.predicted_tokens # (B*T,) int64 tensor, -1 for ignored tokens # Fused (no logits materialization) loss_fn = LigerFusedLinearCrossEntropyLoss(return_predicted_tokens=True) result = loss_fn(lm_head_weight, hidden_states, target) # hidden_states: (B*T, H) result.predicted_tokens # (B*T,) int64 tensor # Can combine with token_accuracy loss_fn = LigerCrossEntropyLoss( return_token_accuracy=True, return_predicted_tokens=True, ) result = loss_fn(logits, target) result.token_accuracy # scalar result.predicted_tokens # (B*T,) int64 tensor ``` > **Note:** `predicted_tokens` is returned as a flat `(B*T,)` tensor, matching the input shape convention of the cross-entropy API (which expects `(B*T, V)` logits and `(B*T,)` targets, consistent with `torch.nn.CrossEntropyLoss`). Reshape as needed: > ```python > result.predicted_tokens.view(B, T) > ``` ## Testing Done - Hardware Type: NVIDIA GPU - [x] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [ ] run `make test-convergence` to ensure convergence ### New/updated tests: - `test_correctness_with_predicted_tokens` (cross-entropy): Verifies predicted tokens match reference argmax, ignored tokens are `-1`, backward works. Tests multiple dtypes, shapes, and ignore indices. - `test_correctness_with_predicted_tokens` (fused linear cross-entropy): Same coverage with logit-value comparison (handles chunked bfloat16 matmul tie-breaking). - `test_liger_cross_entropy_structured_output`: Extended to parametrize `return_predicted_tokens` across all 8 combinations of `(return_z_loss, return_token_accuracy, return_predicted_tokens)`. Includes consistency check between `predicted_tokens` and `token_accuracy` when both are enabled. Co-authored-by: Chun-Mao (Michael) Lai <72752478+Mecoli1219@users.noreply.github.com>

kashif added 10 commits October 16, 2025 19:46

add return_token_accuracy flag to fused_linear_cross_entropy

a254769

rename to token_accuracy

b670b7d

return token_accuracy in transformer models

e872bf4

formatting

d11b24d

add missing output class

002c0ec

typos

d67c511

more typos

3a4a883

added test_correctness_with_token_accuracy

1e6da16

formatting

e9d0954

consistency

038035d

albertvillanova reviewed Oct 17, 2025

View reviewed changes

Comment thread src/liger_kernel/transformers/functional.py Outdated

albertvillanova reviewed Oct 17, 2025

View reviewed changes

Comment thread src/liger_kernel/transformers/functional.py Outdated

albertvillanova reviewed Oct 17, 2025

View reviewed changes

Comment thread src/liger_kernel/transformers/model/falcon_h1.py Outdated

kashif mentioned this pull request Oct 18, 2025

[SFT] Log mean token accuracy from Liger kernel huggingface/trl#4302

Merged

5 tasks

kashif added 3 commits October 20, 2025 11:46

use CrossEntropyOutput

2212623

Merge branch 'main' into mean_token_accuracy

33a999b

update qwen3 next

a50e03e

kashif and others added 5 commits October 20, 2025 12:04

formatting

338e70a

add missing return_dict

d1d9f52

Merge branch 'main' into mean_token_accuracy

c5857fd

Merge branch 'main' into mean_token_accuracy

ddfdb0b

Merge branch 'main' into mean_token_accuracy

f268c27

shimizust assigned vaibhavjindal Oct 28, 2025

kashif changed the title ~~[Cross-entropy-loss] add return_token_accuracy flag to fused_linear_cross_entropy~~ [Cross-entropy-loss] return mean token accuracy metric with CE loss Nov 1, 2025

Merge branch 'main' into mean_token_accuracy

704c3b4

kashif and others added 5 commits November 5, 2025 17:02

Merge branch 'main' into mean_token_accuracy

c6c2d27

checktyle fixes

181b11f

Merge branch 'main' into mean_token_accuracy

a06c5db

fix qwen3_vl

0069dcf

checkstyle

3ef06ee

vaibhavjindal added 3 commits November 5, 2025 14:49

Merge branch 'main' into mean_token_accuracy

f855c29

fix circular import

dd0790c

fix output classes for different transformers versions

d20e8b6

vaibhavjindal approved these changes Nov 5, 2025

View reviewed changes

vaibhavjindal merged commit 7dd8ecc into linkedin:main Nov 5, 2025
3 of 7 checks passed

baeseongsu mentioned this pull request Nov 11, 2025

[Qwen3]: TypeError: liger_fused_linear_cross_entropy() got an unexpected keyword argument 'return_dict' #925

Closed

kashif deleted the mean_token_accuracy branch November 22, 2025 22:09

yukiu00 mentioned this pull request Feb 10, 2026

Add return_predicted_tokens support for cross-entropy kernels #1091

Merged

3 tasks

Conversation

kashif commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing Done

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kashif commented Oct 20, 2025

Uh oh!

kashif commented Oct 28, 2025

Uh oh!

vaibhavjindal commented Nov 3, 2025

Uh oh!

kashif commented Nov 3, 2025

Uh oh!

kashif commented Nov 3, 2025

Uh oh!

vaibhavjindal commented Nov 3, 2025

Uh oh!

kashif commented Nov 3, 2025

Uh oh!

kashif commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kashif commented Nov 5, 2025

Uh oh!

kashif commented Nov 5, 2025

Uh oh!

vaibhavjindal commented Nov 5, 2025

Uh oh!

kashif commented Nov 5, 2025

Uh oh!

Uh oh!

kashif commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kashif commented Oct 16, 2025 •

edited

Loading

kashif commented Nov 3, 2025 •

edited

Loading