Skip to content

feat: implement top-nsigma sampling method#825

Closed
AlpinDale wants to merge 7 commits intomainfrom
top-nsigma
Closed

feat: implement top-nsigma sampling method#825
AlpinDale wants to merge 7 commits intomainfrom
top-nsigma

Conversation

@AlpinDale
Copy link
Copy Markdown
Member

This PR adds the top-nσ sampling method from the paper "Top-nσ: Not All Logits Are You Need".

Top-nσ seems to be a novel sampling method that operates directly on pre-softmax logits by leveraging statistical properties of the distribution. Instead of using complex probability manipulations like top-p/top-k, it filters tokens based on their distance from the maximum logit in terms of standard deviations.

I haven't tested it much, but values should be between 0.0 (disabled) and 2.0, probably.

TODO:

  • See if this sampler should disable all other truncation samplers (top-p/k/a, min_p)
  • Write tests

@AlpinDale
Copy link
Copy Markdown
Member Author

@Tomorrowdawn hi! I was wondering if it's a good idea to disable alphabet sampling (and other truncators) when nsigma is being used? We used to do this for mirostat as well, but it would disable all other samplers instead of just truncators.

@Tomorrowdawn
Copy link
Copy Markdown

Tomorrowdawn commented Nov 20, 2024

It's an honor to see this PR! nsigma and top-p/top-a/min-p are definitely conflicting and thus not recommended. While it doesn't conflict with top-k, k typically only reduces the number of available tokens. I'm not sure what the practical requirements are, but technically this is how it works.

@Tomorrowdawn
Copy link
Copy Markdown

Correction: After careful thought, the order is important. If you use top-nsigma before all alphabet sampling, it won't cause errors; however, if you use top-nsigma after alphabet sampling, it will cause errors due to the introduction of negative infinity(you will get a -inf mean and std).

@AlpinDale
Copy link
Copy Markdown
Member Author

AlpinDale commented Nov 20, 2024

The current order is penalties -> temperature -> top-n-sigma -> alphabet. We also allow users to perform temperature sampling last if they wish. Do you recommend moving it to before or after temperature?

I had some users test this sampler offline, and they noted it performs better when temperature is applied before top-nsigma. I guess all that remains is whether to disable alphabet sampling when this one is applied -- it won't cause errors, but will it cause degraded generations?

@Tomorrowdawn
Copy link
Copy Markdown

It is same before or after temperature scaling, see the temperature invariance Sec 3.2.

For the concern of quality: I guess the best choice is to let the user decide. All these methods are "denoising" the distribution, so the final result is bounded by the most strict sampler. At worst, it just tends toward greedy decoding.

@AlpinDale AlpinDale marked this pull request as ready for review November 20, 2024 03:27
@AlpinDale
Copy link
Copy Markdown
Member Author

I suppose you're right. I'll add sampler order soon so the users can control how this happens. Merging for now as it passes all tests.

@AlpinDale
Copy link
Copy Markdown
Member Author

AlpinDale commented Nov 20, 2024

Whoops, messed up the git command, but it's in 2242760

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants