Summary
Add subword tokenization algorithms to aprender::text::tokenize module for LLM support.
Current State
The text::tokenize module has basic tokenizers:
WhitespaceTokenizer - splits on Unicode whitespace
WordTokenizer - separates punctuation
CharacterTokenizer - splits into characters
These are suitable for classical NLP (TF-IDF, bag of words) but not for modern LLMs.
Proposed Additions
| Algorithm |
Used By |
Priority |
| BPE (Byte Pair Encoding) |
GPT, LLaMA, Mistral |
High |
| WordPiece |
BERT, DistilBERT |
Medium |
| Unigram/SentencePiece |
T5, mT5 |
Medium |
API Design
use aprender::text::tokenize::{BpeTokenizer, WordPieceTokenizer};
// Load pre-trained vocabulary
let tokenizer = BpeTokenizer::from_file("vocab.json", "merges.txt")?;
// Or train from corpus
let tokenizer = BpeTokenizer::train(&corpus, vocab_size: 32000)?;
// Encode/decode
let tokens: Vec<u32> = tokenizer.encode("Hello, world!")?;
let text: String = tokenizer.decode(&tokens)?;
Integration
This enables the full sovereign tokenization chain:
References
Summary
Add subword tokenization algorithms to
aprender::text::tokenizemodule for LLM support.Current State
The
text::tokenizemodule has basic tokenizers:WhitespaceTokenizer- splits on Unicode whitespaceWordTokenizer- separates punctuationCharacterTokenizer- splits into charactersThese are suitable for classical NLP (TF-IDF, bag of words) but not for modern LLMs.
Proposed Additions
API Design
Integration
This enables the full sovereign tokenization chain:
aprender- tokenization algorithms (this issue)entrenar- training with tokenized data (feat: Integrate subword tokenization from aprender for LLM training entrenar#26)realizar- inference with tokenization (feat: Integrate subword tokenization from aprender for LLM inference realizar#17)References