Add FusedLinearJSD by Tcc0403 · Pull Request #300 · linkedin/Liger-Kernel

Tcc0403 · 2024-10-08T23:44:53Z

Summary

similar to the fuse linear CE.

It handles the forward and backward pass of the final linear layer via JSD by avoiding the materialization of the large logits tensor. Since JSD is the last layer, we can compute the gradient at the forward pass.

Testing Done

Hidden size: 4096, Vocab size: 128256

Hardware Type: NVIDIA H100 80GB HBM3 (SXM5)
run make test to ensure correctness
run make checkstyle to ensure code style
run make test-convergence to ensure convergence

Tcc0403 · 2024-10-09T13:42:34Z

+                out=grad_weight,
+            )
+
+    loss = torch.sum(loss_1d) / BT


I just noticed that torch.sum might overflow when BT size is large. It can be fixed by either

do division first before sum, i.e. torch.sum(loss_1d / BT)

put division inside JSD kernel
I prefre the second solution more, but I'm not sure if it is ok to modify another kernel in this PR.

can the existing division in JSD be able to handle this? 🤔 I saw you change the n_row for each chunk to be BT now.

currently n_row parameter in JSD is only for gradients calculation.
it can be modified to something like cross_entropy does, which can calculate loss wrt expected reduction method

Liger-Kernel/src/liger_kernel/ops/cross_entropy.py

Line 143 in de12602

loss = loss / n_non_ignore

I saw you change the n_row for each chunk to be BT now.

simply passing BT can perform correct calculations without further alpha tweaking like flce does

Liger-Kernel/src/liger_kernel/ops/fused_linear_cross_entropy.py

Line 112 in de12602

alpha = n_non_ignore / total_n_non_ignore if total_n_non_ignore > 0 else 0.0

can the existing division in JSD be able to handle this?

it has potential overflow issue too, thats one of the reason why i think moving division into kernel is better too

Gotcha. I'm good with moving it inside the JSD kernel. For ce/flce, let's keep them for now. Thanks!

qingquansong

LGTM! We can add the ignore index later similar to here so it can be easily used for the SFT context. Great work!

qingquansong · 2024-10-09T22:29:09Z

+                out=grad_weight,
+            )
+
+    loss = torch.sum(loss_1d) / BT


can the existing division in JSD be able to handle this? 🤔 I saw you change the n_row for each chunk to be BT now.

qingquansong · 2024-10-09T22:33:33Z

+
+
+@triton.jit
+def element_mul_kernel(


maybe we can delete the one in the original ce and share this

yeah, I can add the new ce file

Tcc0403 added 10 commits October 9, 2024 03:13

Add fused_linear_jsd draft

ebb5a9f

Fix formula in fl_jsd

e1c61f5

Fix forward pass

57d0e7b

Fix loss_1d shape

227518c

Fix backward pass

eb9d885

Fix test to avoid exploding gradients

f7459ba

Update test config

407928d

Cleanup

ee10f0c

Make checkstyle

20b5581

Modify student's hidden size to show hidden sizes can be different

ec16230

Tcc0403 marked this pull request as ready for review October 9, 2024 01:51

Tcc0403 and others added 3 commits October 9, 2024 10:11

Add the fused_linear_jsd benchmark script

62b8fc7

Lower range of x_label in benchmark to avoid OOM and add benchmark data

3459bb1

Remove low BT data in benchmark

ecd8b88

Tcc0403 commented Oct 9, 2024

View reviewed changes

qingquansong previously approved these changes Oct 9, 2024

View reviewed changes

Tcc0403 added 2 commits October 10, 2024 08:47

Move division into kernel

5d621c6

Move element_mul_kernel into util

efb7f72

Tcc0403 dismissed qingquansong’s stale review via efb7f72 October 10, 2024 00:48

Set scalar in test cases to 1.0

38215ce

qingquansong approved these changes Oct 10, 2024

View reviewed changes

Merge branch 'main' into fl-jsd

b947ec7

qingquansong merged commit ff6650b into linkedin:main Oct 11, 2024

ByronHsu mentioned this pull request Oct 31, 2024

2024 Q4 Roadmap #285

Open

Tcc0403 deleted the fl-jsd branch December 1, 2024 03:12

Tcc0403 mentioned this pull request Dec 1, 2024

Introduce Distillation with a Chunked, Fused Linear JS-divergence Loss #408

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add FusedLinearJSD#300

Add FusedLinearJSD#300
qingquansong merged 17 commits into
linkedin:mainfrom
Tcc0403:fl-jsd

Tcc0403 commented Oct 8, 2024 •

edited

Loading

Uh oh!

Tcc0403 Oct 9, 2024

Uh oh!

qingquansong Oct 9, 2024

Uh oh!

Tcc0403 Oct 9, 2024 •

edited

Loading

Uh oh!

Tcc0403 Oct 9, 2024 •

edited

Loading

Uh oh!

qingquansong Oct 10, 2024

Uh oh!

qingquansong left a comment

Uh oh!

qingquansong Oct 9, 2024

Uh oh!

qingquansong Oct 9, 2024

Uh oh!

Tcc0403 Oct 9, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants



		@triton.jit
		def element_mul_kernel(

Conversation

Tcc0403 commented Oct 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing Done

Uh oh!

Tcc0403 Oct 9, 2024

Choose a reason for hiding this comment

Uh oh!

qingquansong Oct 9, 2024

Choose a reason for hiding this comment

Uh oh!

Tcc0403 Oct 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Tcc0403 Oct 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qingquansong Oct 10, 2024

Choose a reason for hiding this comment

Uh oh!

qingquansong left a comment

Choose a reason for hiding this comment

Uh oh!

qingquansong Oct 9, 2024

Choose a reason for hiding this comment

Uh oh!

qingquansong Oct 9, 2024

Choose a reason for hiding this comment

Uh oh!

Tcc0403 Oct 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Tcc0403 commented Oct 8, 2024 •

edited

Loading

Tcc0403 Oct 9, 2024 •

edited

Loading

Tcc0403 Oct 9, 2024 •

edited

Loading

Tcc0403 Oct 9, 2024 •

edited

Loading