dataset-formats.md

Dataset Formats

Learn how to prepare your data for training with ModelForge.

Overview

ModelForge uses JSONL (JSON Lines) format for training datasets. Each line is a valid JSON object representing one training example.

General Format

{"field1": "value1", "field2": "value2"}
{"field1": "value1", "field2": "value2"}

One JSON object per line
No commas between lines
UTF-8 encoding
File extension: .jsonl

Task-Specific Formats

Text Generation

Use Case: Chatbots, instruction following, code generation, creative writing

Format:

{"input": "What is machine learning?", "output": "Machine learning is a subset of artificial intelligence..."}
{"input": "Explain neural networks", "output": "Neural networks are computing systems inspired by biological neural networks..."}
{"input": "Write a Python function to sort a list", "output": "Here's a Python function:\n\ndef sort_list(lst):\n    return sorted(lst)"}

Required Fields:

input (string): The prompt or instruction
output (string): The expected response

Example Dataset (customer support):

{"input": "How do I reset my password?", "output": "To reset your password:\n1. Click 'Forgot Password' on the login page\n2. Enter your email address\n3. Check your email for reset link\n4. Follow the link and create new password"}
{"input": "Where is my order?", "output": "To track your order:\n1. Log into your account\n2. Go to 'My Orders'\n3. Click on the order number\n4. View tracking information"}

Summarization

Use Case: Document summarization, article condensing, meeting notes

Format:

{"input": "Long article or document text here...", "output": "Concise summary here."}

Required Fields:

input (string): The long text to summarize
output (string): The summary

Example Dataset (news summarization):

{"input": "The Federal Reserve announced today that it will maintain interest rates at their current level of 5.25-5.50%, citing ongoing concerns about inflation despite recent economic slowdowns. Federal Reserve Chair Jerome Powell stated in a press conference that the central bank remains data-dependent and will adjust policy as needed. Markets reacted positively to the news, with the S&P 500 gaining 1.2% in afternoon trading.", "output": "The Federal Reserve kept interest rates unchanged at 5.25-5.50% due to inflation concerns. Chair Powell emphasized data-dependent approach. Markets rose 1.2%."}
{"input": "Scientists at MIT have developed a new battery technology that could potentially triple the range of electric vehicles. The breakthrough involves using solid-state electrolytes instead of traditional liquid electrolytes, which allows for higher energy density and improved safety. The research team, led by Professor Jane Smith, published their findings in Nature Energy this week. Commercial applications are expected within 5-10 years.", "output": "MIT researchers developed solid-state battery technology that could triple EV range. The innovation improves energy density and safety. Commercial use expected in 5-10 years."}

Extractive Question Answering

Use Case: RAG systems, document search, FAQ systems

Format:

{"context": "Background information and document text", "question": "Question about the context", "answer": "Exact answer from context"}

Required Fields:

context (string): The paragraph or document containing the answer
question (string): The question being asked
answer (string): The answer extracted from context

Example Dataset (FAQ):

{"context": "ModelForge is a no-code toolkit for fine-tuning Large Language Models on your local GPU. It supports text generation, summarization, and question answering tasks. The tool uses LoRA (Low-Rank Adaptation) for efficient fine-tuning and supports both HuggingFace and Unsloth providers.", "question": "What tasks does ModelForge support?", "answer": "text generation, summarization, and question answering"}
{"context": "To install ModelForge, you need Python 3.11, an NVIDIA GPU with at least 4GB VRAM, CUDA installed, and a HuggingFace account. The installation process involves running 'pip install modelforge-finetuning' and then setting up your HuggingFace token.", "question": "What are the prerequisites for ModelForge?", "answer": "Python 3.11, NVIDIA GPU with 4GB+ VRAM, CUDA, HuggingFace account"}

Preference Data (DPO/RLHF)

Use Case: Aligning model with human preferences (DPO or RLHF strategy)

Format:

{"prompt": "What is the capital of France?", "chosen": "The capital of France is Paris.", "rejected": "I don't know."}
{"prompt": "Explain quantum computing", "chosen": "Quantum computing uses quantum bits...", "rejected": "It's complicated."}

Required Fields:

prompt (string): The input prompt or question
chosen (string): The preferred/better response
rejected (string): The non-preferred/worse response

Note: When using DPO or RLHF strategy, this format is required instead of the task-specific format above. Both strategies also require "task": "text-generation".

Dataset Size Recommendations

Model Size	Min Examples	Recommended	Optimal
< 1B params	100	500	1,000+
1B-3B params	200	1,000	5,000+
3B-7B params	500	2,000	10,000+
7B+ params	1,000	5,000	20,000+

Quality > Quantity: 100 high-quality examples are better than 10,000 low-quality ones.

Data Quality Guidelines

DO:

✅ Use clean, grammatically correct text
✅ Ensure input-output pairs are logically related
✅ Include diverse examples covering different scenarios
✅ Use consistent formatting across examples
✅ Remove personal information (PII)
✅ Verify all data is relevant to your use case

DON'T:

❌ Include malformed JSON
❌ Use inconsistent field names
❌ Include duplicate or near-duplicate examples
❌ Mix different tasks in one dataset
❌ Use copyrighted content without permission
❌ Include biased or harmful content

Creating Your Dataset

Method 1: Manual Creation

Create a file dataset.jsonl:

{"input": "Example 1 input", "output": "Example 1 output"}
{"input": "Example 2 input", "output": "Example 2 output"}

Method 2: Python Script

import json

data = [
    {"input": "What is AI?", "output": "AI stands for Artificial Intelligence..."},
    {"input": "Explain ML", "output": "Machine Learning is..."},
]

with open('dataset.jsonl', 'w', encoding='utf-8') as f:
    for item in data:
        f.write(json.dumps(item, ensure_ascii=False) + '\n')

Method 3: Convert from CSV

import pandas as pd
import json

# Read CSV
df = pd.read_csv('data.csv')

# Convert to JSONL
with open('dataset.jsonl', 'w', encoding='utf-8') as f:
    for _, row in df.iterrows():
        item = {"input": row['question'], "output": row['answer']}
        f.write(json.dumps(item, ensure_ascii=False) + '\n')

Method 4: From HuggingFace Dataset

from datasets import load_dataset
import json

# Load dataset
dataset = load_dataset("squad", split="train[:1000]")

# Convert to JSONL
with open('dataset.jsonl', 'w', encoding='utf-8') as f:
    for item in dataset:
        formatted = {
            "context": item['context'],
            "question": item['question'],
            "answer": item['answers']['text'][0]
        }
        f.write(json.dumps(formatted, ensure_ascii=False) + '\n')

Validation

ModelForge automatically validates datasets before training:

✅ JSON syntax validation
✅ Required fields check
✅ Minimum size check (at least 10 examples)
✅ Field type validation

Common Validation Errors

Error: Missing required field 'output'

Fix: Ensure all examples have required fields

Error: Invalid JSON on line 42

Fix: Check line 42 for syntax errors (missing quotes, commas, etc.)

Error: Dataset too small (5 examples, minimum 10)

Fix: Add more examples to your dataset

Error: Field 'input' must be a string

Fix: Ensure all field values are strings, not numbers or objects

Advanced Features

Multi-line Text

Use \n for line breaks:

{"input": "Write a haiku", "output": "Code flows like water\nBugs hide in shadows unseen\nDebug, test, deploy"}

Special Characters

Escape special characters:

{"input": "What is JSON?", "output": "JSON uses \"quotes\" for strings and {\"key\": \"value\"} for objects"}

Unicode Support

Full UTF-8 support:

{"input": "Translate: Hello", "output": "你好 (Chinese), Hola (Spanish), Bonjour (French)"}

Long Context

No hard limit on length, but consider model's max sequence length:

{"input": "Summarize this article", "output": "Summary here"}

Sample Datasets

ModelForge includes sample datasets for testing:

# Located in: ModelForge/test_datasets/
low_text_generation.jsonl          # Text generation examples
low_summarization_train_set.jsonl  # Summarization examples
low_qa_train_set.jsonl             # QA examples

Download from repository:

curl -O https://raw.githubusercontent.com/forgeopus/modelforge/main/ModelForge/test_datasets/low_text_generation.jsonl

Best Practices

1. Data Splitting

ModelForge automatically splits data into train/validation sets:

{
  "eval_split": 0.2  // 20% for validation, 80% for training
}

2. Data Balancing

Ensure balanced representation:

Equal distribution of topics
Diverse input lengths
Varied complexity levels

3. Data Cleaning

Before creating JSONL:

Remove duplicates
Fix typos and grammar
Normalize formatting
Remove irrelevant examples

4. Iterative Improvement

Start with small dataset (100-500 examples)
Train and evaluate
Identify weak areas
Add targeted examples
Repeat

Troubleshooting

Dataset Won't Upload

Problem: Upload fails

Checks:

File is valid JSONL (one JSON object per line)
File size < 500MB
Proper UTF-8 encoding
No special characters in filename

Training Fails with Dataset Error

Problem: Training starts but fails immediately

Checks:

All required fields present
Fields are correct type (strings)
No empty values
No extremely long examples (> max_seq_length)

Poor Training Results

Problem: Model doesn't learn effectively

Solutions:

Add more examples (aim for 1,000+)
Improve data quality
Ensure examples are representative
Check for data leakage or duplicates

Next Steps

Configuration Guide - Learn about training parameters
Training Tasks - Understand different task types
Quick Start - Train your first model
Troubleshooting - Common issues

Good data is the foundation of good models! 📊

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset Formats

Overview

General Format

Task-Specific Formats

Text Generation

Summarization

Extractive Question Answering

Preference Data (DPO/RLHF)

Dataset Size Recommendations

Data Quality Guidelines

DO:

DON'T:

Creating Your Dataset

Method 1: Manual Creation

Method 2: Python Script

Method 3: Convert from CSV

Method 4: From HuggingFace Dataset

Validation

Common Validation Errors

Advanced Features

Multi-line Text

Special Characters

Unicode Support

Long Context

Sample Datasets

Best Practices

1. Data Splitting

2. Data Balancing

3. Data Cleaning

4. Iterative Improvement

Troubleshooting

Dataset Won't Upload

Training Fails with Dataset Error

Poor Training Results

Next Steps

FilesExpand file tree

dataset-formats.md

Latest commit

History

dataset-formats.md

File metadata and controls

Dataset Formats

Overview

General Format

Task-Specific Formats

Text Generation

Summarization

Extractive Question Answering

Preference Data (DPO/RLHF)

Dataset Size Recommendations

Data Quality Guidelines

DO:

DON'T:

Creating Your Dataset

Method 1: Manual Creation

Method 2: Python Script

Method 3: Convert from CSV

Method 4: From HuggingFace Dataset

Validation

Common Validation Errors

Advanced Features

Multi-line Text

Special Characters

Unicode Support

Long Context

Sample Datasets

Best Practices

1. Data Splitting

2. Data Balancing

3. Data Cleaning

4. Iterative Improvement

Troubleshooting

Dataset Won't Upload

Training Fails with Dataset Error

Poor Training Results

Next Steps