Split text by Tokens

Might be nice to add the ability to split text by Tokens.
Currently LLMs have limitations, for example GPT4 allows just 8192 tokens maximum.
In case of larger text it might be nice to split it according to the LLM tokens limitation.
LangChain has a RecursiveCharacterTextSplitter function, although it splits by characters instead of tokens which is not useful for LLMs text slitting.

For example, with tiktoken we can split text in the way shown below, but might be nice to include a similar function to the Semantic Kernel.

```C#
public static List<string> SplitText(this string text, int maxTokens = 1024)
{
    var encoding = GptEncoding.GetEncoding("cl100k_base");

    var tokenizedText = encoding.Encode(text.Trim());
    var chunks = new List<string>();
    var currentChunk = new List<int>();
    int currentLength = 0;

    foreach (var token in tokenizedText)
    {
        currentChunk.Add(token);
        currentLength++;

        if (currentLength >= maxTokens)
        {
            chunks.Add(encoding.Decode(currentChunk).Trim(' ', '.', ',', ';'));
            currentChunk.Clear();
            currentLength = 0;
        }
    }

    if (currentChunk.Count > 0)
    {
        chunks.Add(encoding.Decode(currentChunk).Trim(' ', '.', ',', ';'));
    }

    return chunks;
}

```



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split text by Tokens #1240

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Split text by Tokens #1240

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions