Learn how to prepare your data for training with ModelForge.
ModelForge uses JSONL (JSON Lines) format for training datasets. Each line is a valid JSON object representing one training example.
{"field1": "value1", "field2": "value2"}
{"field1": "value1", "field2": "value2"}- One JSON object per line
- No commas between lines
- UTF-8 encoding
- File extension:
.jsonl
Use Case: Chatbots, instruction following, code generation, creative writing
Format:
{"input": "What is machine learning?", "output": "Machine learning is a subset of artificial intelligence..."}
{"input": "Explain neural networks", "output": "Neural networks are computing systems inspired by biological neural networks..."}
{"input": "Write a Python function to sort a list", "output": "Here's a Python function:\n\ndef sort_list(lst):\n return sorted(lst)"}Required Fields:
input(string): The prompt or instructionoutput(string): The expected response
Example Dataset (customer support):
{"input": "How do I reset my password?", "output": "To reset your password:\n1. Click 'Forgot Password' on the login page\n2. Enter your email address\n3. Check your email for reset link\n4. Follow the link and create new password"}
{"input": "Where is my order?", "output": "To track your order:\n1. Log into your account\n2. Go to 'My Orders'\n3. Click on the order number\n4. View tracking information"}Use Case: Document summarization, article condensing, meeting notes
Format:
{"input": "Long article or document text here...", "output": "Concise summary here."}Required Fields:
input(string): The long text to summarizeoutput(string): The summary
Example Dataset (news summarization):
{"input": "The Federal Reserve announced today that it will maintain interest rates at their current level of 5.25-5.50%, citing ongoing concerns about inflation despite recent economic slowdowns. Federal Reserve Chair Jerome Powell stated in a press conference that the central bank remains data-dependent and will adjust policy as needed. Markets reacted positively to the news, with the S&P 500 gaining 1.2% in afternoon trading.", "output": "The Federal Reserve kept interest rates unchanged at 5.25-5.50% due to inflation concerns. Chair Powell emphasized data-dependent approach. Markets rose 1.2%."}
{"input": "Scientists at MIT have developed a new battery technology that could potentially triple the range of electric vehicles. The breakthrough involves using solid-state electrolytes instead of traditional liquid electrolytes, which allows for higher energy density and improved safety. The research team, led by Professor Jane Smith, published their findings in Nature Energy this week. Commercial applications are expected within 5-10 years.", "output": "MIT researchers developed solid-state battery technology that could triple EV range. The innovation improves energy density and safety. Commercial use expected in 5-10 years."}Use Case: RAG systems, document search, FAQ systems
Format:
{"context": "Background information and document text", "question": "Question about the context", "answer": "Exact answer from context"}Required Fields:
context(string): The paragraph or document containing the answerquestion(string): The question being askedanswer(string): The answer extracted from context
Example Dataset (FAQ):
{"context": "ModelForge is a no-code toolkit for fine-tuning Large Language Models on your local GPU. It supports text generation, summarization, and question answering tasks. The tool uses LoRA (Low-Rank Adaptation) for efficient fine-tuning and supports both HuggingFace and Unsloth providers.", "question": "What tasks does ModelForge support?", "answer": "text generation, summarization, and question answering"}
{"context": "To install ModelForge, you need Python 3.11, an NVIDIA GPU with at least 4GB VRAM, CUDA installed, and a HuggingFace account. The installation process involves running 'pip install modelforge-finetuning' and then setting up your HuggingFace token.", "question": "What are the prerequisites for ModelForge?", "answer": "Python 3.11, NVIDIA GPU with 4GB+ VRAM, CUDA, HuggingFace account"}Use Case: Aligning model with human preferences (DPO or RLHF strategy)
Format:
{"prompt": "What is the capital of France?", "chosen": "The capital of France is Paris.", "rejected": "I don't know."}
{"prompt": "Explain quantum computing", "chosen": "Quantum computing uses quantum bits...", "rejected": "It's complicated."}Required Fields:
prompt(string): The input prompt or questionchosen(string): The preferred/better responserejected(string): The non-preferred/worse response
Note: When using DPO or RLHF strategy, this format is required instead of the task-specific format above. Both strategies also require
"task": "text-generation".
| Model Size | Min Examples | Recommended | Optimal |
|---|---|---|---|
| < 1B params | 100 | 500 | 1,000+ |
| 1B-3B params | 200 | 1,000 | 5,000+ |
| 3B-7B params | 500 | 2,000 | 10,000+ |
| 7B+ params | 1,000 | 5,000 | 20,000+ |
Quality > Quantity: 100 high-quality examples are better than 10,000 low-quality ones.
✅ Use clean, grammatically correct text
✅ Ensure input-output pairs are logically related
✅ Include diverse examples covering different scenarios
✅ Use consistent formatting across examples
✅ Remove personal information (PII)
✅ Verify all data is relevant to your use case
❌ Include malformed JSON
❌ Use inconsistent field names
❌ Include duplicate or near-duplicate examples
❌ Mix different tasks in one dataset
❌ Use copyrighted content without permission
❌ Include biased or harmful content
Create a file dataset.jsonl:
{"input": "Example 1 input", "output": "Example 1 output"}
{"input": "Example 2 input", "output": "Example 2 output"}import json
data = [
{"input": "What is AI?", "output": "AI stands for Artificial Intelligence..."},
{"input": "Explain ML", "output": "Machine Learning is..."},
]
with open('dataset.jsonl', 'w', encoding='utf-8') as f:
for item in data:
f.write(json.dumps(item, ensure_ascii=False) + '\n')import pandas as pd
import json
# Read CSV
df = pd.read_csv('data.csv')
# Convert to JSONL
with open('dataset.jsonl', 'w', encoding='utf-8') as f:
for _, row in df.iterrows():
item = {"input": row['question'], "output": row['answer']}
f.write(json.dumps(item, ensure_ascii=False) + '\n')from datasets import load_dataset
import json
# Load dataset
dataset = load_dataset("squad", split="train[:1000]")
# Convert to JSONL
with open('dataset.jsonl', 'w', encoding='utf-8') as f:
for item in dataset:
formatted = {
"context": item['context'],
"question": item['question'],
"answer": item['answers']['text'][0]
}
f.write(json.dumps(formatted, ensure_ascii=False) + '\n')ModelForge automatically validates datasets before training:
✅ JSON syntax validation
✅ Required fields check
✅ Minimum size check (at least 10 examples)
✅ Field type validation
Error: Missing required field 'output'
- Fix: Ensure all examples have required fields
Error: Invalid JSON on line 42
- Fix: Check line 42 for syntax errors (missing quotes, commas, etc.)
Error: Dataset too small (5 examples, minimum 10)
- Fix: Add more examples to your dataset
Error: Field 'input' must be a string
- Fix: Ensure all field values are strings, not numbers or objects
Use \n for line breaks:
{"input": "Write a haiku", "output": "Code flows like water\nBugs hide in shadows unseen\nDebug, test, deploy"}Escape special characters:
{"input": "What is JSON?", "output": "JSON uses \"quotes\" for strings and {\"key\": \"value\"} for objects"}Full UTF-8 support:
{"input": "Translate: Hello", "output": "你好 (Chinese), Hola (Spanish), Bonjour (French)"}No hard limit on length, but consider model's max sequence length:
{"input": "Summarize this article", "output": "Summary here"}ModelForge includes sample datasets for testing:
# Located in: ModelForge/test_datasets/
low_text_generation.jsonl # Text generation examples
low_summarization_train_set.jsonl # Summarization examples
low_qa_train_set.jsonl # QA examplesDownload from repository:
curl -O https://raw.githubusercontent.com/forgeopus/modelforge/main/ModelForge/test_datasets/low_text_generation.jsonlModelForge automatically splits data into train/validation sets:
{
"eval_split": 0.2 // 20% for validation, 80% for training
}Ensure balanced representation:
- Equal distribution of topics
- Diverse input lengths
- Varied complexity levels
Before creating JSONL:
- Remove duplicates
- Fix typos and grammar
- Normalize formatting
- Remove irrelevant examples
- Start with small dataset (100-500 examples)
- Train and evaluate
- Identify weak areas
- Add targeted examples
- Repeat
Problem: Upload fails
Checks:
- File is valid JSONL (one JSON object per line)
- File size < 500MB
- Proper UTF-8 encoding
- No special characters in filename
Problem: Training starts but fails immediately
Checks:
- All required fields present
- Fields are correct type (strings)
- No empty values
- No extremely long examples (> max_seq_length)
Problem: Model doesn't learn effectively
Solutions:
- Add more examples (aim for 1,000+)
- Improve data quality
- Ensure examples are representative
- Check for data leakage or duplicates
- Configuration Guide - Learn about training parameters
- Training Tasks - Understand different task types
- Quick Start - Train your first model
- Troubleshooting - Common issues
Good data is the foundation of good models! 📊