Skip to content

add repetition penalty support#5703

Closed
XiaobingSuper wants to merge 6 commits intosgl-project:mainfrom
XiaobingSuper:xiaobing/repetition_penalty
Closed

add repetition penalty support#5703
XiaobingSuper wants to merge 6 commits intosgl-project:mainfrom
XiaobingSuper:xiaobing/repetition_penalty

Conversation

@XiaobingSuper
Copy link
Copy Markdown

@XiaobingSuper XiaobingSuper commented Apr 24, 2025

Motivation

This PR is about adding repetition penalty support.

Modifications

Checklist

@XiaobingSuper
Copy link
Copy Markdown
Author

XiaobingSuper commented Apr 24, 2025

@merrymercy, I re-added the repetition penalty that was removed by #3988. I found this issue when comparing the sglang output with HF output when the repetition penalty was applied. Please help review it. Thank you.

@merrymercy
Copy link
Copy Markdown
Contributor

merrymercy commented Apr 26, 2025

Why do you need this? OpenAI API does not provide this functionality. Why are the frequency and presence penalty not enough?

@XiaobingSuper
Copy link
Copy Markdown
Author

XiaobingSuper commented Apr 27, 2025

@merrymercy , I do offline generation for one of my use cases, which wants to align with HF output, and another use case mainly uses Python Request. Do you mean if I use the frequency and presence penalty, I can get the same behavior with the repetition penalty method?

@merrymercy
Copy link
Copy Markdown
Contributor

Can you disable repetition penalty for your HF use cases?

@XiaobingSuper
Copy link
Copy Markdown
Author

Can you disable repetition penalty for your HF use cases?

I can, but I think we shouldn't forbid this usage case for our customer users.

@merrymercy
Copy link
Copy Markdown
Contributor

make sense. We can merge this. Can you add some test cases here?

def test_frequency_penalty(self):
self.run_decode({"frequency_penalty": 2})
def test_min_new_tokens(self):
self.run_decode({"min_new_tokens": 16})
def test_presence_penalty(self):
self.run_decode({"presence_penalty": 2})

@XiaobingSuper XiaobingSuper force-pushed the xiaobing/repetition_penalty branch from e78a8e4 to 9104dbd Compare April 27, 2025 06:13
@XiaobingSuper
Copy link
Copy Markdown
Author

make sense. We can merge this. Can you add some test cases here?

def test_frequency_penalty(self):
self.run_decode({"frequency_penalty": 2})
def test_min_new_tokens(self):
self.run_decode({"min_new_tokens": 16})
def test_presence_penalty(self):
self.run_decode({"presence_penalty": 2})

Done.

@XiaobingSuper
Copy link
Copy Markdown
Author

@merrymercy can we merge it?

@XiaobingSuper
Copy link
Copy Markdown
Author

@merrymercy

@XiaobingSuper
Copy link
Copy Markdown
Author

@merrymercy could you help review it? thanks!

@THU-LIJX
Copy link
Copy Markdown

THU-LIJX commented Jun 6, 2025

@merrymercy We initially used vllm for model inference, and recently we plan to integrate sglang, allowing users to choose between vllm and sglang for inference. However, since we previously set the repetition_penalty parameter when using vllm for some tasks, and currently sglang does not support this parameter. It is difficult for us to align the results between vllm and sglang. We hope to have the repetition penalty feature reintegrated to sglang. Thanks!

@syskn
Copy link
Copy Markdown

syskn commented Aug 12, 2025

Is this going to be merged? HF repetition penalty and OpenAI freq/presence penalties have highly different behavior (mainly because HF rep-pen takes the whole prefill context into account, while freq/pres. penalties are only taking currently generated tokens in account) and it is quite a pain having to manually merge this for every SGLang version for our use case.

@XiaobingSuper
Copy link
Copy Markdown
Author

cc @merrymercy

@junliu-mde
Copy link
Copy Markdown
Contributor

Support this PR. For small models or those without sufficient SFT, using "anyway" to prevent the model from repeating is still quite necessary.

@liguodongiot
Copy link
Copy Markdown

liguodongiot commented Oct 29, 2025

@XiaobingSuper Hi, Currently, repetition penalty only considers the generated text, not the original input text. Both HF Transformers and vLLM consider the original input text.

Modify as follows:

class BatchedRepetitionPenalizer(_BatchedPenalizer):

    def _prepare(self):
        batch_cumulated_repetition_penalties = []
        for req in self.orchestrator.reqs():
            cumulated_repetition_penalties_lst = [1] * self.orchestrator.vocab_size
            for idx in req.origin_input_ids:
                cumulated_repetition_penalties_lst[idx] = req.sampling_params.repetition_penalty
            batch_cumulated_repetition_penalties.append(cumulated_repetition_penalties_lst)

        self.cumulated_repetition_penalties = torch.tensor(
            data=batch_cumulated_repetition_penalties,
                dtype=torch.float32,
                device=self.orchestrator.device,
        )

        self.repetition_penalties = (
            torch.tensor(
                data=[
                    req.sampling_params.repetition_penalty
                    for req in self.orchestrator.reqs()
                ],
                dtype=torch.float32,
                device=self.orchestrator.device,
            )
        ).unsqueeze_(1)

@hnyls2002
Copy link
Copy Markdown
Collaborator

Inactive and duplicate #21258

@hnyls2002 hnyls2002 closed this Mar 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants