Skip to content

.Net: cl100k_base tokenizer #2334

@anthonypuppo

Description

@anthonypuppo

The existing GPT3Tokenizer class uses p50k_base encoding. GPT-3.5, GPT-4, and text-embedding-ada-002 all rely on cl100k_base encoding.

The below table is extracted from the OpenAI notebook on token counting.

Encoding name OpenAI models
cl100k_base gpt-4, gpt-3.5-turbo, text-embedding-ada-002
p50k_base Codex models, text-davinci-002, text-davinci-003
r50k_base (or gpt2) GPT-3 models like davinci

Metadata

Metadata

Assignees

Labels

.NETIssue or Pull requests regarding .NET code

Type

No type

Projects

Status

Sprint: Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions