Skip to content

Allow external tokenizer when splitting text #1387

@gramhagen

Description

@gramhagen

I'd like to use tiktoken or HuggingFace tokenizer when splitting text using the python text chunker.

Example usage:

from semantic_kernel.text.text_chunker import split_plaintext_lines
import tiktoken

encoding = tiktoken.get_encoding('cl100k_base')
token_counter = lambda x: len(encoding.encode(x))
lines = split_plaintext_lines(text=text, max_token_per_line=256, token_counter=token_counter)

Related Issues:
#1240
#478

Metadata

Metadata

Assignees

No one assigned

    Labels

    pythonPull requests for the Python Semantic Kernel

    Type

    No type

    Projects

    Status

    Sprint: Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions