-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Closed
Labels
kernelIssues or pull requests impacting the core kernelIssues or pull requests impacting the core kernel
Description
Might be nice to add the ability to split text by Tokens.
Currently LLMs have limitations, for example GPT4 allows just 8192 tokens maximum.
In case of larger text it might be nice to split it according to the LLM tokens limitation.
LangChain has a RecursiveCharacterTextSplitter function, although it splits by characters instead of tokens which is not useful for LLMs text slitting.
For example, with tiktoken we can split text in the way shown below, but might be nice to include a similar function to the Semantic Kernel.
public static List<string> SplitText(this string text, int maxTokens = 1024)
{
var encoding = GptEncoding.GetEncoding("cl100k_base");
var tokenizedText = encoding.Encode(text.Trim());
var chunks = new List<string>();
var currentChunk = new List<int>();
int currentLength = 0;
foreach (var token in tokenizedText)
{
currentChunk.Add(token);
currentLength++;
if (currentLength >= maxTokens)
{
chunks.Add(encoding.Decode(currentChunk).Trim(' ', '.', ',', ';'));
currentChunk.Clear();
currentLength = 0;
}
}
if (currentChunk.Count > 0)
{
chunks.Add(encoding.Decode(currentChunk).Trim(' ', '.', ',', ';'));
}
return chunks;
}Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
kernelIssues or pull requests impacting the core kernelIssues or pull requests impacting the core kernel
Type
Projects
Status
Sprint: Done