Embed encoder.json/vocab.bpe in OpenAI connector assembly#1800
Embed encoder.json/vocab.bpe in OpenAI connector assembly#1800stephentoub wants to merge 1 commit intomicrosoft:mainfrom
Conversation
|
@dluc Could you please respond to question in PR description and check changes in the code? Thanks! |
|
@stephentoub the motivation, as you suspect, was indeed to avoid incurring the size penalty. Not sure how effectively we're doing that now though. I'm going to put this one on ice (convert to draft) until you're back, and then we can weight the pros and cons. There may also be a possibility of putting this is in its own nuget so that it can be included in the SK metapackage, but if a caller doesn't want it, they can avoid it by grabbing the specific packages indifividually. |
|
I thought it would be nice not having to access the file system and use embedded resources, however, given that we advocate the kernel as a lightweight SDK and say that instantiating the kernel is a very cheap operation, the considerable size of the dictionary/settings files was concerning, so I went for a file based approach + caching. Currently, the files and that payload is not ever considered, unless one invokes the tokenizer. Only during the first use, this data is loaded and cached in memory, paying the cost on the 1st call. Consider that with lambdas and ad hoc processes, every request might be the "1st request". If we embed the resources, the OpenAI connector DLL will grow considerably (though zip compression might help), and I think it would help understanding what the change looks like on lambdas and edge devices. |
|
Bringing up SharpToken as this would solve 1) embedding of the tokenization resource files 2) extracting the tokenization logic to a separate package (maybe wrapped as an official SK package in the future) and 3) bug #2334. |
This makes these files part of the .dll rather than anyone using the nuget package deplying the separate .json/.bpe files.
1a5978f to
4110ea8
Compare
|
Thanks for your PR and contribution. For now, we are closing old drafts. Please re-open as appropriate. |


Motivation and Context
This makes these files part of the .dll rather than anyone using the nuget package deplying the separate .json/.bpe files.
(If it was an explicit design choice to have these be loose files, e.g. if the goal was to avoid paying the size increase if they weren't being used such that someone could explicitly delete them from their deployment, or being able to somehow update them separately after a build, then this can of course be closed.)
Contribution Checklist
dotnet format