feat(text): Add subword tokenization (BPE, WordPiece, SentencePiece)

## Summary

Add subword tokenization algorithms to `aprender::text::tokenize` module for LLM support.

## Current State

The `text::tokenize` module has basic tokenizers:
- `WhitespaceTokenizer` - splits on Unicode whitespace
- `WordTokenizer` - separates punctuation
- `CharacterTokenizer` - splits into characters

These are suitable for classical NLP (TF-IDF, bag of words) but **not** for modern LLMs.

## Proposed Additions

| Algorithm | Used By | Priority |
|-----------|---------|----------|
| **BPE** (Byte Pair Encoding) | GPT, LLaMA, Mistral | High |
| **WordPiece** | BERT, DistilBERT | Medium |
| **Unigram/SentencePiece** | T5, mT5 | Medium |

## API Design

```rust
use aprender::text::tokenize::{BpeTokenizer, WordPieceTokenizer};

// Load pre-trained vocabulary
let tokenizer = BpeTokenizer::from_file("vocab.json", "merges.txt")?;

// Or train from corpus
let tokenizer = BpeTokenizer::train(&corpus, vocab_size: 32000)?;

// Encode/decode
let tokens: Vec<u32> = tokenizer.encode("Hello, world!")?;
let text: String = tokenizer.decode(&tokens)?;
```

## Integration

This enables the full sovereign tokenization chain:
- `aprender` - tokenization algorithms (this issue)
- `entrenar` - training with tokenized data (paiml/entrenar#26)
- `realizar` - inference with tokenization (paiml/realizar#17)

## References

- [HuggingFace Tokenizers](https://github.com/huggingface/tokenizers) (Rust implementation to study)
- [BPE Paper](https://arxiv.org/abs/1508.07909)
- [SentencePiece](https://github.com/google/sentencepiece)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(text): Add subword tokenization (BPE, WordPiece, SentencePiece) #103

Summary

Current State

Proposed Additions

API Design

Integration

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Algorithm	Used By	Priority
BPE (Byte Pair Encoding)	GPT, LLaMA, Mistral	High
WordPiece	BERT, DistilBERT	Medium
Unigram/SentencePiece	T5, mT5	Medium

feat(text): Add subword tokenization (BPE, WordPiece, SentencePiece) #103

Description

Summary

Current State

Proposed Additions

API Design

Integration

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions