Feste

A GPT-2 style transformer language model implemented from scratch in Rust for educational purposes. Companion code for the blog series Building an LLM From Scratch in Rust.

Why Feste?

Feste is the fool in Shakespeare's Twelfth Night, known for his wordplay and wit. The model trains on Shakespeare's complete works and generates text in his style, making the name a natural fit.

What This Is

A complete trainable transformer that demonstrates how language models work by implementing every component from basic operations. No deep learning frameworks are used.

The implementation trains on Shakespeare's works and generates text in similar style, showing clear perplexity improvements as training progresses.

Blog Series

Each part of the blog has a companion doc with configuration details and implementation reference:

Part	Blog Post	Code Reference
1	Tokenization	`docs/01_TOKENIZATION.md`
2	Tensor Operations	`docs/02_TENSOR_OPERATIONS.md`
3	Model Architecture	`docs/03_MODEL_ARCHITECTURE.md`
4	Training Infrastructure	`docs/04_TRAINING.md`
5	A Witless Fool	`docs/05_TRAINING_EXAMPLES.md`

Quick Start

# Get training data
curl -o shakespeare.txt https://www.gutenberg.org/files/100/100-0.txt

# Train a small model (10-15 minutes)
cargo run --release --example 06_train_shakespeare_small

Reproducing Blog Experiments

The configurable training example lets you reproduce any experiment from the Part 5 blog post using named presets:

# List available presets
cargo run --release --example train -- --list-presets

# Run a preset
cargo run --release --example train -- --preset pocket-bard

# Override parameters
cargo run --release --example train -- --preset spider --steps 10000

# Fully custom configuration
cargo run --release --example train -- \
    --embd 256 --layers 6 --heads 12 --context 448 --vocab 8192

See docs/05_TRAINING_EXAMPLES.md for the full preset table, transfer learning instructions, and details on all training examples.

Examples

Foundation (Parts 1-4)

01_train_tokenizers - BPE tokenization at multiple vocab sizes
02_tensor_operations - Matrix multiplication and operations
03_model_architecture - Transformer architecture exploration
04_training_infrastructure - Training loop components

Training (Part 5)

05_train_shakespeare_tiny - 50K parameters, 2-5 minutes
06_train_shakespeare_small - 200K parameters, 10-20 minutes
07_train_shakespeare_medium - 4M parameters, 1-2 hours
08_train_shakespeare_gpt2 - 163M parameters (GPT-2 Small), 24-30 hours
train - Configurable training with blog experiment presets

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
docs		docs
examples		examples
src		src
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
DATA.md		DATA.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Feste

Why Feste?

What This Is

Blog Series

Quick Start

Reproducing Blog Experiments

Examples

Foundation (Parts 1-4)

Training (Part 5)

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Feste

Why Feste?

What This Is

Blog Series

Quick Start

Reproducing Blog Experiments

Examples

Foundation (Parts 1-4)

Training (Part 5)

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages