Skip to content

Split text by Tokens #1240

@zs-dima

Description

@zs-dima

Might be nice to add the ability to split text by Tokens.
Currently LLMs have limitations, for example GPT4 allows just 8192 tokens maximum.
In case of larger text it might be nice to split it according to the LLM tokens limitation.
LangChain has a RecursiveCharacterTextSplitter function, although it splits by characters instead of tokens which is not useful for LLMs text slitting.

For example, with tiktoken we can split text in the way shown below, but might be nice to include a similar function to the Semantic Kernel.

public static List<string> SplitText(this string text, int maxTokens = 1024)
{
    var encoding = GptEncoding.GetEncoding("cl100k_base");

    var tokenizedText = encoding.Encode(text.Trim());
    var chunks = new List<string>();
    var currentChunk = new List<int>();
    int currentLength = 0;

    foreach (var token in tokenizedText)
    {
        currentChunk.Add(token);
        currentLength++;

        if (currentLength >= maxTokens)
        {
            chunks.Add(encoding.Decode(currentChunk).Trim(' ', '.', ',', ';'));
            currentChunk.Clear();
            currentLength = 0;
        }
    }

    if (currentChunk.Count > 0)
    {
        chunks.Add(encoding.Decode(currentChunk).Trim(' ', '.', ',', ';'));
    }

    return chunks;
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    kernelIssues or pull requests impacting the core kernel

    Type

    No type

    Projects

    Status

    Sprint: Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions