Introduce Knowledge Distillation Base by austin362667 · Pull Request #417 · linkedin/Liger-Kernel

austin362667 · 2024-12-02T09:48:04Z

Summary

Thanks to the nice suggestions from @Tcc0403 and @hongpeng-guo. This PR is the first split from #408, focusing solely on introducing the Knowledge Distillation base class. As a result, this PR does not include any tests at the moment.

Code Changes

Refactor beta into two weights: weight_hard_loss and weight_soft_loss, as coefficients between hard_loss and soft_loss. @Tcc0403 also pointed out that we could use torch.lerp if applicable.
Pass teacher_logits and student_logits directly to the divergence loss function. This avoids redundant computations of converting logits to log probabilities and then reverting them to raw logits. However note that we are not reusing the student_log_probs value calculated during ce_loss in distillation base.
1. Remove the unnecessary get_batch_logps in test/utils.py.
Modify chunking dimensions from B to B * T. Thanks to @hongpeng-guo's great advice.
1. Fix the loss calculation to use per-token values instead of averaging across the sequence length dimension.
Normalize the distillation_loss using (full_target != ignore_index).sum().

TODO

Although a slightly slowdown is reasonable, we need to investigate why this PR's implementation is significantly slower compared to the naive approach. Thanks to @Tcc0403 's clarification.

The issue arises because we are not properly configuring the chunk_size for the B * T dimension, which is extremely large (a few thousand). The previous default of 1 results in an excessive number of chunks.

In contrast, this problem does not occur with the preference loss, as chunking is performed on the B dimension. This produces fewer than 10 chunks, which is efficient and works as expected.

In conclusion, I set chunk_size to 1024 works pretty well in new benchmark results as shown in Add JSD Loss for Distillation #425

Knowledge Distillation

Knowledge Distillation (KD; Hinton et al. 2015, Gou et al. 2020) is a straightforward way to build a smaller, cheaper model (“student model”) to speed up inference by transferring skills from a pre-trained expensive model (“teacher model”) into the student.

In knowledge distillation, a student model is trained to replicate the outputs of a teacher model using a distillation loss. Neural networks typically include a softmax layer; for instance, a large language model produces a probability distribution over tokens. Let z_t and z_s represent the logits before the softmax layer for the teacher and student models, respectively. The distillation loss reduces the discrepancy between the two softmax outputs at a high temperature T. When ground truth labels y are available, this approach can be combined with a supervised learning objective, such as cross-entropy, to compare the student’s outputs with the ground truth.

The combined loss function is defined as:

$$\mathcal{L}_{\text{knowledge distillation}} = \mathcal{w}_{\text{soft}} \cdot \mathcal{L}_{\text{distill}}(\mathbf{z_t}, \mathbf{z_s}, T) + \mathcal{w}_{\text{hard}} \cdot \mathcal{L}_{\text{cross entropy}}(\mathbf{y}, \mathbf{z_s}),$$

Here, we directly pass in logits rather than logpbs. @Tcc0403

Shared `DistillationBase`

To support various distillation learning objectives, this PR aims to add a LigerFusedLinearDistillationBase which is basically same as propose by @hongpeng-guo within this discussion #371 (comment). Thank you @hongpeng-guo for thinking through this.

Testing Done

I'll post JSD tests and benchmarks results in next PR: #425

Hardware Type: L40S
run make test to ensure correctness
run make checkstyle to ensure code style
run make test-convergence to ensure convergence

winglian · 2024-12-05T14:43:11Z

+            hard_loss,
+        ) = forward_output
+
+        soft_loss = self.distillation_loss(student_logits, teacher_logits)


the method use logprobs : def distillation_loss(self, student_logps, teacher_logps): but you use logits here.

I'd actually like to see both a logit and logprob implementation since it's easy to get logprobs offline from vllm and that is a faster way to generate the dataset.

the method use logprobs : def distillation_loss(self, student_logps, teacher_logps): but you use logits here.

@winglian Nice catch! Thank you so much.

I'd actually like to see both a logit and logprob implementation since it's easy to get logprobs offline from vllm and that is a faster way to generate the dataset.

Sure, I think it's doable. And, I'm not quite sure I fully understand the need for logprobs implementation. Mind elaborate more on the vLLM use case?

So rather than having to have the teacher model loaded during training, depending on the workload type, it can be faster and more compute efficient to pre-compute the logins/logprobs offline beforehand. However, vllm and sglang only provide the logprobs, and that's not easily back-calculated to logits.

I see. That makes a lot sense to me. Thank you!

@winglian curious if vllm/sglang support temperature scaled logprobs. This would be needed to enable https://github.com/huggingface/trl/blob/9c5388b69e0842f76edc46a2ff9d0b51e1db4337/trl/trainer/gkd_trainer.py#L174

I believe we can address this ask in a subsequent PR
@ByronHsu what do you think?

austin362667 · 2024-12-06T03:21:40Z

Although a slightly slowdown is reasonable, we need to investigate why this PR's implementation is significantly slower compared to the naive approach. Thanks to @Tcc0403 's clarification.

The issue arises because we are not properly configuring the chunk_size for the B * T dimension, which is extremely large (a few thousand). The previous default of 1 results in an excessive number of chunks.

In contrast, this problem does not occur with the preference loss, as chunking is performed on the B dimension. This produces fewer than 10 chunks, which is efficient and works as expected.

In conclusion, I set chunk_size to 1024 works pretty well in new benchmark results as shown in #425

Signed-off-by: Austin Liu <austin362667@gmail.com>

Signed-off-by: Austin Liu <austin362667@gmail.com> Set default `chunk_size` to `1024` Signed-off-by: Austin Liu <austin362667@gmail.com> Rebase Signed-off-by: Austin Liu <austin362667@gmail.com>

Signed-off-by: Austin Liu <austin362667@gmail.com>

hongpeng-guo

@austin362667 FWIW, to run the Modal GPU CIs, this PR needs to be made from the main repo, i.e., linkedin/Liger-Kernel, instead of the forked repo.
A similar example is: I closed #399 and moved to #400 to enable the CI pipeline.

shivam15s

LGTM

shivam15s

can you create another PR in linkedin? Some tests fail for me locally so I'd like to confirm before merging

austin362667 · 2024-12-07T03:34:12Z

@shivam15s Certainly, right here #432 Thanks a lot

austin362667 · 2024-12-07T03:35:38Z

Move discussion to #432

@Tcc0403

## Summary Made #417 from the main repo. Thanks to the nice suggestions from @Tcc0403 and @hongpeng-guo. This PR is the s first split from #408, focusing solely on introducing the Knowledge Distillation base class. As a result, this PR does not include any tests at the moment. #### Code Changes 1. Refactor `beta` into two weights: `weight_hard_loss` and `weight_soft_loss`, as coefficients between `hard_loss` and `soft_loss`. @Tcc0403 also pointed out that we could use `torch.lerp` if applicable. 2. Pass `teacher_logits` and `student_logits` directly to the divergence loss function. This avoids redundant computations of converting logits to log probabilities and then reverting them to raw logits. However note that we are not reusing the `student_log_probs` value calculated during `ce_loss` in distillation base. 1. Remove the unnecessary `get_batch_logps` in `test/utils.py`. 3. Modify `chunking` dimensions from `B` to `B * T`. Thanks to @hongpeng-guo's great advice. 1. Fix the loss calculation to use per-token values instead of averaging across the sequence length dimension. 4. Normalize the `distillation_loss` using `(full_target != ignore_index).sum()`. #### TODO 1. [X] Although a slightly slowdown is reasonable, we need to investigate why this PR's implementation is **significantly slower** compared to the naive approach. Thanks to @Tcc0403 's clarification. The issue arises because we are not properly configuring the `chunk_size` for the `B * T` dimension, which is extremely large (a few thousand). The previous default of 1 results in an excessive number of chunks. In contrast, this problem does not occur with the preference loss, as chunking is performed on the `B` dimension. This produces fewer than 10 chunks, which is efficient and works as expected. In conclusion, I set `chunk_size` to `1024` works pretty well in new benchmark results as shown in #425 2. [ ] #417 (comment) #### Knowledge Distillation Knowledge Distillation (KD; [Hinton et al. 2015](https://arxiv.org/abs/1503.02531), [Gou et al. 2020](https://arxiv.org/abs/2006.05525)) is a straightforward way to build a smaller, cheaper model (“student model”) to speed up inference by transferring skills from a pre-trained expensive model (“teacher model”) into the student. In knowledge distillation, a student model is trained to replicate the outputs of a teacher model using a distillation loss. Neural networks typically include a softmax layer; for instance, a large language model produces a probability distribution over tokens. Let `z_t` and `z_s` represent the logits before the softmax layer for the teacher and student models, respectively. The distillation loss reduces the discrepancy between the two softmax outputs at a high temperature `T`. When ground truth labels `y` are available, this approach can be combined with a supervised learning objective, such as cross-entropy, to compare the student’s outputs with the ground truth. The combined loss function is defined as: ```math \mathcal{L}_{\text{knowledge distillation}} = \mathcal{w}_{\text{soft}} \cdot \mathcal{L}_{\text{distill}}(\mathbf{z_t}, \mathbf{z_s}, T) + \mathcal{w}_{\text{hard}} \cdot \mathcal{L}_{\text{cross entropy}}(\mathbf{y}, \mathbf{z_s}), ``` Here, we directly pass in `logits` rather than `logpbs`. @Tcc0403 #### Shared `DistillationBase` To support various distillation learning objectives, this PR aims to add a `LigerFusedLinearDistillationBase` which is basically same as propose by @hongpeng-guo within this discussion #371 (comment). Thank you @hongpeng-guo for thinking through this. ## Testing Done I'll post JSD tests and benchmarks results in next PR: #425 - Hardware Type: L40S - [ ] run `make test` to ensure correctness - [ ] run `make checkstyle` to ensure code style - [ ] run `make test-convergence` to ensure convergence --------- Signed-off-by: Austin Liu <austin362667@gmail.com> Co-authored-by: shivam15s <shivam15800@gmail.com>

## Summary > [!CAUTION] > This PR depends on #417. Do not merge until #417 (later #432) is merged. This is a pure torch compiled, chunked fused linear JSD Loss, aiming for knowledge distillation. #### Jensen-Shannon Divergence Loss This PR implements Jensen-Shannon Divergence (JSD) loss as the soft learning objective in a distillation setting (teacher & student). This component can be replaced with other losses (e.g., KL divergence) as `distillation_loss_fn`. JSD is defined as the average of the KL divergences between each distribution and the mean distribution: ```math \text{JSD}(P || Q) = \frac{1}{2} \text{KL}(P || M) + \frac{1}{2} \text{KL}(Q || M), \quad \text{where } M = \frac{1}{2}(P + Q) ``` Here, `P`and `Q` are the two probability distributions, and `M` is their average. ## Testing Done Below figures are benchmark results with different `chunk_size`, which also significantly affects performance. #### Hint: User can tune their `chunk_size` as suggested by the liger [paper](https://arxiv.org/pdf/2306.13649) for the moment: ```math 2^{\lceil \log_2 \lceil \frac{BT}{V/H} \rceil \rceil} ``` #### Memory 1. `chunk_size` = 1 ![distill_jsd_loss_memory_chunk_size_1](https://github.com/user-attachments/assets/e00b2044-e075-4e34-b302-3808f7216837) 2. `chunk_size` = 1024 ![distill_jsd_loss_memory_chunk_size_1024](https://github.com/user-attachments/assets/abe9fe17-726c-4fd0-899f-5d0e563ceb05) #### Speed (Elapsed Time) 1. `chunk_size` = 1 ![distill_jsd_loss_speed_chunk_size_1](https://github.com/user-attachments/assets/e2da495e-ff20-4e63-b7df-d6e1837774c8) 2. `chunk_size` = 1024 ![distill_jsd_loss_speed_chunk_size_1024](https://github.com/user-attachments/assets/c2767754-a984-4f11-b5a1-cb21e8117ef6) - Hardware Type: NVIDIA H100 80GB HBM3 (SXM5) - [X] run `make test` to ensure correctness - [X] run `make checkstyle` to ensure code style - [ ] run `make test-convergence` to ensure convergence --------- Signed-off-by: Austin Liu <austin362667@gmail.com>

austin362667 mentioned this pull request Dec 2, 2024

Introduce Distillation with a Chunked, Fused Linear JS-divergence Loss #408

Closed

5 tasks

austin362667 changed the title ~~Feat/distill base~~ Introduce Knowledge Distillation Base Dec 2, 2024

austin362667 force-pushed the feat/distill_base branch 2 times, most recently from 5257d26 to 3a9f125 Compare December 4, 2024 15:55

austin362667 mentioned this pull request Dec 4, 2024

Add JSD Loss for Distillation #425

Merged

3 tasks

winglian reviewed Dec 5, 2024

View reviewed changes

austin362667 and others added 9 commits December 7, 2024 00:03

Add liger and naive distill base

0769e97

Signed-off-by: Austin Liu <austin362667@gmail.com>

Format

a81c959

Signed-off-by: Austin Liu <austin362667@gmail.com>

Refactor beta

e13994a

Signed-off-by: Austin Liu <austin362667@gmail.com>

Remove imports

720b5cb

Signed-off-by: Austin Liu <austin362667@gmail.com>

Fix distill base chunk_size scaling

17c5b33

Signed-off-by: Austin Liu <austin362667@gmail.com> Set default `chunk_size` to `1024` Signed-off-by: Austin Liu <austin362667@gmail.com> Rebase Signed-off-by: Austin Liu <austin362667@gmail.com>

Fix chunk division

e3dada0

Signed-off-by: Austin Liu <austin362667@gmail.com>

Remove chunk arg

5662554

Signed-off-by: Austin Liu <austin362667@gmail.com>

Fix distillation_loss arg typo

7acb5ca

Signed-off-by: Austin Liu <austin362667@gmail.com>

use torch no grad and change normalization term

e381569

shivam15s force-pushed the feat/distill_base branch from 4ada908 to e381569 Compare December 7, 2024 00:03

hongpeng-guo reviewed Dec 7, 2024

View reviewed changes

shivam15s added 2 commits December 7, 2024 00:33

rearrange fns for readability

8aa842a

add no grad in tests

3561525

shivam15s approved these changes Dec 7, 2024

View reviewed changes

Merge branch 'main' into feat/distill_base

076c220

shivam15s requested changes Dec 7, 2024

View reviewed changes

austin362667 mentioned this pull request Dec 7, 2024

Introduce Knowledge Distillation Base #432

Merged

5 tasks

austin362667 closed this Dec 7, 2024

austin362667 mentioned this pull request Dec 9, 2024

Support offline logits for teacher model #441

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce Knowledge Distillation Base#417

Introduce Knowledge Distillation Base#417
austin362667 wants to merge 12 commits into
linkedin:mainfrom
austin362667:feat/distill_base

austin362667 commented Dec 2, 2024 •

edited

Loading

Uh oh!

winglian Dec 5, 2024

Uh oh!

austin362667 Dec 5, 2024 •

edited

Loading

Uh oh!

winglian Dec 6, 2024

Uh oh!

austin362667 Dec 6, 2024

Uh oh!

shivam15s Dec 7, 2024

Uh oh!

shivam15s Dec 7, 2024

Uh oh!

austin362667 commented Dec 6, 2024 •

edited

Loading

Uh oh!

hongpeng-guo left a comment

Uh oh!

shivam15s left a comment

Uh oh!

shivam15s left a comment

Uh oh!

austin362667 commented Dec 7, 2024

Uh oh!

austin362667 commented Dec 7, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

austin362667 commented Dec 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Code Changes

TODO

Knowledge Distillation

Shared DistillationBase

Testing Done

Uh oh!

winglian Dec 5, 2024

Choose a reason for hiding this comment

Uh oh!

austin362667 Dec 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

winglian Dec 6, 2024

Choose a reason for hiding this comment

Uh oh!

austin362667 Dec 6, 2024

Choose a reason for hiding this comment

Uh oh!

shivam15s Dec 7, 2024

Choose a reason for hiding this comment

Uh oh!

shivam15s Dec 7, 2024

Choose a reason for hiding this comment

Uh oh!

austin362667 commented Dec 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hongpeng-guo left a comment

Choose a reason for hiding this comment

Uh oh!

shivam15s left a comment

Choose a reason for hiding this comment

Uh oh!

shivam15s left a comment

Choose a reason for hiding this comment

Uh oh!

austin362667 commented Dec 7, 2024

Uh oh!

austin362667 commented Dec 7, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

austin362667 commented Dec 2, 2024 •

edited

Loading

Shared `DistillationBase`

austin362667 Dec 5, 2024 •

edited

Loading

austin362667 commented Dec 6, 2024 •

edited

Loading