This repository contains a reproduction of the TinyStories language models described in the paper "TinyStories: How Small Can Language Models Be and Still Speak Coherent English?" by Ronen Eldan and Yuanzhi Li.
The goal of this project is to demonstrate that a very small transformer model, when trained on a simplified, synthetic dataset, can generate fluent, grammatically correct, and consistent short stories.
Execute the Flask server:
python app.py
Head-over to the Local Server end-point in your browser to talk to the model.
cd tiny-stories-with-hf
python inference.py "<YOUR_PROMPT_HERE>""We introduce TinyStories, a synthetic dataset of short stories that only contain words that a typical 3 to 4-year-olds usually understand, generated by GPT-3.5 and GPT-4. We show that TinyStories can be used to train and evaluate LMs that are much smaller than the state-of-the-art models (below 10 million total parameters)... yet still produce fluent and consistent stories." — Eldan & Li (2023)
This model is a decoder-only Transformer (GPT-style) designed to fit within 10M trainable parameters.
Configuration 1: Model Size S (3.6M) - HuggingFace
| Hyperparameter | Value |
|---|---|
| Parameters | ~3.6 Million |
| Attention Layers | 8 |
| Hidden Dimension (Embedding Dimensions) | 64 |
| Attention Heads per Layer | 16 |
| Context Window | 512 tokens |
| Vocab Size | ~50,257 (GPT-Neo tokenizer) |
| Dropout | 0.1 |
| Learning Rate | 5e-4 |
Configuration 2: Model Size M (19.3M) - HuggingFace
| Hyperparameter | Value |
|---|---|
| Parameters | ~19.3 Million |
| Attention Layers | 8 |
| Hidden Dimension (Embedding Dimensions) | 256 |
| Attention Heads per Layer | 16 |
| Context Window | 512 tokens |
| Vocab Size | ~50,257 (GPT-Neo tokenizer) |
| Dropout | 0.1 |
| Learning Rate | 5e-4 |
The model was trained from scratch on a NVIDIA T4 GPU for around 3 hours to achieve a loss of 2.17. The model was trained for 0.22 epochs estimating around 55K steps. We used EleutherAI/gpt-neo-125M tokenizer model training and inference.
- Training Hyper-parameters
- Training regime:
- Epochs: 0.22
- Loss: 2.17
- GPU: NVIDIA T4
- Training Steps: 55,000
- Training Time: ~3 hours
The model was trained from scratch on a NVIDIA A100 GPU for around 4 hours 40 minutes to achieve a loss of 1.40. The model was trained for 1 epoch estimating around 265K steps. We used EleutherAI/gpt-neo-125M tokenizer model training and inference.
- Training Hyper-parameters
- Training regime:
- Epochs: 1
- Loss: 1.40
- GPU: NVIDIA A100
- Training Steps: 264,965
- Training Time: ~4 hours 40 minutes
The model was trained on the TinyStories dataset, which consists of synthetic short stories generated by GPT-3.5/4. The stories use a restricted vocabulary typical of a 3-year-old child.
- Source: Hugging Face Datasets (roneneldan/TinyStories)
- Size: ~2GB text data



