Conversation
|
@Tomorrowdawn hi! I was wondering if it's a good idea to disable alphabet sampling (and other truncators) when nsigma is being used? We used to do this for mirostat as well, but it would disable all other samplers instead of just truncators. |
|
It's an honor to see this PR! nsigma and top-p/top-a/min-p are definitely conflicting and thus not recommended. While it doesn't conflict with top-k, k typically only reduces the number of available tokens. I'm not sure what the practical requirements are, but technically this is how it works. |
|
Correction: After careful thought, the order is important. If you use top-nsigma before all alphabet sampling, it won't cause errors; however, if you use top-nsigma after alphabet sampling, it will cause errors due to the introduction of negative infinity(you will get a -inf mean and std). |
|
The current order is penalties -> temperature -> top-n-sigma -> alphabet. We also allow users to perform temperature sampling last if they wish. Do you recommend moving it to before or after temperature? I had some users test this sampler offline, and they noted it performs better when temperature is applied before top-nsigma. I guess all that remains is whether to disable alphabet sampling when this one is applied -- it won't cause errors, but will it cause degraded generations? |
|
It is same before or after temperature scaling, see the temperature invariance Sec 3.2. For the concern of quality: I guess the best choice is to let the user decide. All these methods are "denoising" the distribution, so the final result is bounded by the most strict sampler. At worst, it just tends toward greedy decoding. |
|
I suppose you're right. I'll add sampler order soon so the users can control how this happens. Merging for now as it passes all tests. |
|
Whoops, messed up the git command, but it's in 2242760 |
This PR adds the top-nσ sampling method from the paper "Top-nσ: Not All Logits Are You Need".
Top-nσ seems to be a novel sampling method that operates directly on pre-softmax logits by leveraging statistical properties of the distribution. Instead of using complex probability manipulations like top-p/top-k, it filters tokens based on their distance from the maximum logit in terms of standard deviations.
I haven't tested it much, but values should be between 0.0 (disabled) and 2.0, probably.
TODO: