Skip to content

feat(text): Add subword tokenization (BPE, WordPiece, SentencePiece) #103

@noahgift

Description

@noahgift

Summary

Add subword tokenization algorithms to aprender::text::tokenize module for LLM support.

Current State

The text::tokenize module has basic tokenizers:

  • WhitespaceTokenizer - splits on Unicode whitespace
  • WordTokenizer - separates punctuation
  • CharacterTokenizer - splits into characters

These are suitable for classical NLP (TF-IDF, bag of words) but not for modern LLMs.

Proposed Additions

Algorithm Used By Priority
BPE (Byte Pair Encoding) GPT, LLaMA, Mistral High
WordPiece BERT, DistilBERT Medium
Unigram/SentencePiece T5, mT5 Medium

API Design

use aprender::text::tokenize::{BpeTokenizer, WordPieceTokenizer};

// Load pre-trained vocabulary
let tokenizer = BpeTokenizer::from_file("vocab.json", "merges.txt")?;

// Or train from corpus
let tokenizer = BpeTokenizer::train(&corpus, vocab_size: 32000)?;

// Encode/decode
let tokens: Vec<u32> = tokenizer.encode("Hello, world!")?;
let text: String = tokenizer.decode(&tokens)?;

Integration

This enables the full sovereign tokenization chain:

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions