Improve block weighting with uniform and hat functions#147
Improve block weighting with uniform and hat functions#147markus583 merged 1 commit intosegment-any-text:mainfrom
Conversation
|
Hi! Thanks a lot for implementing this. Interesting idea, cool stuff! It intuitively makes sense, but I'm unsure if it makes a practical difference. It would be interesting to test it on some benchmarks. For the time being, I'd be happy to add it as a feature and leave the default to |
|
LGTM! I've tried this idea a while ago (when I was working on the original WtP) and didn't see improvements on benchmarks, but maybe it helps on other model / benchmark combinations. I agree that intuitively it makes total sense. So let's add it and leave the default to uniform as you suggested @markus583. |
|
Thanks for the reviews @markus583 @bminixhofer! For your information, I created an inference-only version of wtpsplit called wtpsplit-lite with minimal dependencies to make it easier to integrate SaT into projects that only need inference. Thanks for your work! |
|
Cool, thanks for letting us know. Thanks and keep up the good work! :) |
This PR makes the current uniform weighting scheme explicit, and adds an improved hat weighting scheme.
The rationale behind hat weighting is that predictions for tokens near the beginning or end of the block will be less accurate than predictions for tokens near the middle of the block, where the model has maximal context.
For instance, let's say we use
stride=128andblock_size=256and compare the predictions for the token with index 128:0.5 * first_block[128] + 0.5 * second_block[0].1 * first_block[128] + 1/256 * second_block[0].In this example, hat weighting is preferable because the first token of the second block is likely to be much less accurate than the middle token of first block.
Anecdotally, I've also observed that hat weighting improves output quality on test data.