hardware-profiles.md

Hardware Profiles

Optimize ModelForge for your GPU with hardware-aware model recommendations and configurations.

Overview

Hardware profiles automatically detect your system capabilities and recommend optimal models and settings. This ensures you get the best performance without manual configuration.

How Hardware Profiles Work

System Scan
    ↓
Detect GPU VRAM + System RAM
    ↓
Classify into Profile (low_end/mid_range/high_end)
    ↓
Recommend Models + Settings
    ↓
Apply Optimizations

Available Profiles

ModelForge classifies hardware into three profiles based on GPU VRAM and system RAM:

1. Low End Profile

Hardware Requirements:

GPU VRAM: < 7.2 GB
OR: GPU VRAM < 15.2 GB AND System RAM < 15.2 GB

Typical Hardware:

NVIDIA GTX 1650 (4GB)
NVIDIA GTX 1660 (6GB)
NVIDIA RTX 3050 (4-6GB)
NVIDIA RTX A2000 (6GB)

Recommended Settings:

{
  "compute_specs": "low_end",
  "use_4bit": true,
  "per_device_train_batch_size": 1,
  "gradient_accumulation_steps": 8,
  "max_seq_length": 512
}

Recommended Models:

Task	Primary Model	Size	VRAM Usage
Text Generation	qwen/Qwen2.5-3B	3B params	~4-5 GB
Summarization	google/flan-t5-large	770M params	~3-4 GB
Question Answering	deepset/roberta-base-squad2	125M params	~2-3 GB

Optimization Tips:

✅ Use 4-bit quantization (use_4bit: true)
✅ Small batch size (1-2)
✅ Higher gradient accumulation (8-16)
✅ Shorter sequences (512-1024 tokens)
✅ QLoRA strategy for memory efficiency

2. Mid Range Profile

Hardware Requirements:

GPU VRAM: 7.2 - 15.2 GB
System RAM: ≥ 15.2 GB
OR: GPU VRAM ≥ 15.2 GB AND System RAM < 15.2 GB

Typical Hardware:

NVIDIA RTX 2070/2080 (8GB)
NVIDIA RTX 3060 Ti (8GB)
NVIDIA RTX 3070 (8GB)
NVIDIA RTX 4060 Ti (8-16GB)
NVIDIA RTX A4000 (12GB)

Recommended Settings:

{
  "compute_specs": "mid_range",
  "use_4bit": true,
  "per_device_train_batch_size": 2,
  "gradient_accumulation_steps": 4,
  "max_seq_length": 1024
}

Recommended Models:

Task	Primary Model	Size	VRAM Usage
Text Generation	mistralai/Mistral-Small-3.1-24B-Base-2503	24B params	~12-14 GB
Text Generation (alt)	meta-llama/Llama-3.1-8B-Instruct	8B params	~8-10 GB
Summarization	google/flan-t5-large	770M params	~4-5 GB
Question Answering	meta-llama/Llama-3.1-8B-Instruct	8B params	~8-10 GB

Optimization Tips:

✅ 4-bit quantization recommended
✅ Moderate batch size (2-4)
✅ Standard gradient accumulation (4-8)
✅ Medium sequences (1024-2048 tokens)
✅ Unsloth provider for 2x speedup
✅ Both SFT and QLoRA strategies work well

3. High End Profile

Hardware Requirements:

GPU VRAM: ≥ 15.2 GB
System RAM: ≥ 15.2 GB

Typical Hardware:

NVIDIA RTX 3080/3090 (10-24GB)
NVIDIA RTX 4080/4090 (12-24GB)
NVIDIA RTX A5000/A6000 (24-48GB)
NVIDIA Tesla V100 (16-32GB)
NVIDIA A100 (40-80GB)

Recommended Settings:

{
  "compute_specs": "high_end",
  "use_4bit": false,
  "bf16": true,
  "per_device_train_batch_size": 4,
  "gradient_accumulation_steps": 2,
  "max_seq_length": 2048
}

Recommended Models:

Task	Primary Model	Size	VRAM Usage
Text Generation	meta-llama/Llama-4-Maverick-17B-128E-Instruct	17B params	~16-18 GB
Text Generation (large)	qwen/Qwen2.5-32B	32B params	~20-24 GB
Summarization	meta-llama/Llama-4-Maverick-17B-128E-Instruct	17B params	~16-18 GB
Question Answering	qwen/Qwen2.5-32B	32B params	~20-24 GB

Optimization Tips:

✅ Optional 4-bit quantization (not required)
✅ Use BF16 precision on Ampere+ GPUs (RTX 30xx/40xx)
✅ Larger batch size (4-8)
✅ Lower gradient accumulation (2-4)
✅ Longer sequences (2048-4096 tokens)
✅ Unsloth provider highly recommended
✅ Can use advanced strategies (RLHF, DPO)

Profile Classification Rules

ModelForge uses these rules to classify your hardware:

if gpu_vram < 7.2 GB:
    profile = "low_end"
    
elif gpu_vram < 15.2 GB and ram < 15.2 GB:
    profile = "low_end"
    
elif gpu_vram < 15.2 GB and ram >= 15.2 GB:
    profile = "mid_range"
    
elif gpu_vram >= 15.2 GB and ram < 15.2 GB:
    profile = "mid_range"
    
else:  # gpu_vram >= 15.2 GB and ram >= 15.2 GB
    profile = "high_end"

Automatic Hardware Detection

Via UI

When you start training in the UI:

Select your task
Click "Detect Hardware"
ModelForge automatically:
- Scans GPU and RAM
- Classifies into profile
- Recommends optimal models
- Pre-fills configuration

Via API

curl -X POST http://localhost:8000/api/finetune/detect \
  -H "Content-Type: application/json" \
  -d '{"task": "text-generation"}'

Response:

{
  "hardware_specs": {
    "gpu_name": "NVIDIA RTX 3070",
    "gpu_memory_gb": 8.0,
    "ram_gb": 16.0,
    "cuda_version": "12.6"
  },
  "compute_profile": "mid_range",
  "recommended_model": "meta-llama/Llama-3.1-8B-Instruct",
  "possible_models": [
    "meta-llama/Llama-3.1-8B-Instruct",
    "qwen/Qwen2.5-7B",
    "mistralai/Mistral-Small-3.1-24B-Base-2503"
  ]
}

Manual Profile Selection

You can override automatic detection:

{
  "compute_specs": "mid_range",  // Force mid-range profile
  "model_name": "meta-llama/Llama-3.1-8B-Instruct",
  ...
}

Profile-Specific Optimizations

Memory Optimization by Profile

Profile	Quantization	Batch Size	Grad Accum	Max Seq Len
Low End	4-bit (required)	1	8-16	512-1024
Mid Range	4-bit (recommended)	2-4	4-8	1024-2048
High End	Optional	4-8	2-4	2048-4096

Provider Recommendations by Profile

Profile	Primary Provider	Secondary	Speedup
Low End	HuggingFace	-	1x
Mid Range	Unsloth	HuggingFace	2x
High End	Unsloth	HuggingFace	2x

Note: Unsloth requires Linux, WSL, or Docker. Not available on native Windows.

Strategy Recommendations by Profile

Profile	Recommended Strategy	Alternative
Low End	QLoRA	SFT
Mid Range	QLoRA or SFT	RLHF, DPO
High End	SFT, QLoRA, RLHF, DPO	Any

VRAM Usage Estimation

Formula

VRAM Usage ≈ Model Size × Precision Factor × Overhead Factor

Precision Factors

4-bit: ~0.5 GB per billion parameters
8-bit: ~1 GB per billion parameters
16-bit (FP16/BF16): ~2 GB per billion parameters
32-bit (FP32): ~4 GB per billion parameters

Example Calculations

7B model with 4-bit quantization:

7B × 0.5 GB/B × 1.5 (overhead) = ~5.25 GB VRAM

7B model with 16-bit precision:

7B × 2 GB/B × 1.5 (overhead) = ~21 GB VRAM

32B model with 4-bit quantization:

32B × 0.5 GB/B × 1.5 (overhead) = ~24 GB VRAM

Troubleshooting

Out of Memory (OOM) Errors

Error: CUDA out of memory

Solutions:

Reduce batch size: per_device_train_batch_size: 1
Increase gradient accumulation: gradient_accumulation_steps: 16
Enable 4-bit quantization: use_4bit: true
Reduce sequence length: max_seq_length: 512
Enable gradient checkpointing: gradient_checkpointing: true
Try a smaller model

Slow Training

Problem: Training is taking too long

Solutions:

Use Unsloth provider for 2x speedup (if on Linux/WSL)
Increase batch size if you have VRAM headroom
Reduce gradient accumulation steps
Use mixed precision (BF16 on RTX 30xx/40xx)
Consider a smaller model

Wrong Profile Detection

Problem: ModelForge detects wrong profile

Solutions:

Manually specify profile: "compute_specs": "mid_range"
Check GPU drivers are up to date
Verify CUDA is properly installed
Check nvidia-smi output matches expected VRAM

Best Practices

For Low End Hardware

✅ Always use 4-bit quantization
✅ Start with smallest recommended models
✅ Use QLoRA strategy
✅ Batch size = 1, gradient accumulation = 8-16
✅ Keep sequences short (512-1024)
✅ Close other GPU applications

For Mid Range Hardware

✅ Use 4-bit quantization for large models (7B+)
✅ Unsloth provider for best performance
✅ Batch size = 2-4
✅ Try both SFT and QLoRA strategies
✅ Medium sequences (1024-2048)

For High End Hardware

✅ Unsloth provider mandatory for speed
✅ Can skip quantization for small models
✅ Use BF16 on Ampere+ GPUs
✅ Larger batch sizes (4-8)
✅ Try advanced strategies (RLHF, DPO)
✅ Longer sequences (2048-4096)

Performance Comparison

Training Time Estimates (1000 examples, 3 epochs)

Model Size	Profile	Provider	Strategy	Time
3B	Low End	HuggingFace	QLoRA	~2 hours
7B	Mid Range	Unsloth	QLoRA	~45 min
7B	Mid Range	HuggingFace	QLoRA	~90 min
17B	High End	Unsloth	SFT	~60 min
32B	High End	Unsloth	QLoRA	~90 min

Times are approximate and vary based on exact hardware and configuration.

Next Steps

Configuration Guide - Detailed configuration options
Provider Overview - Choose HuggingFace or Unsloth
Training Strategies - Select optimal strategy
Performance Optimization - Fine-tune performance

Hardware profiles make ModelForge accessible to everyone! From 4GB to 80GB VRAM, we've got you covered.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hardware Profiles

Overview

How Hardware Profiles Work

Available Profiles

1. Low End Profile

2. Mid Range Profile

3. High End Profile

Profile Classification Rules

Automatic Hardware Detection

Via UI

Via API

Manual Profile Selection

Profile-Specific Optimizations

Memory Optimization by Profile

Provider Recommendations by Profile

Strategy Recommendations by Profile

VRAM Usage Estimation

Formula

Precision Factors

Example Calculations

Troubleshooting

Out of Memory (OOM) Errors

Slow Training

Wrong Profile Detection

Best Practices

For Low End Hardware

For Mid Range Hardware

For High End Hardware

Performance Comparison

Training Time Estimates (1000 examples, 3 epochs)

Next Steps

FilesExpand file tree

hardware-profiles.md

Latest commit

History

hardware-profiles.md

File metadata and controls

Hardware Profiles

Overview

How Hardware Profiles Work

Available Profiles

1. Low End Profile

2. Mid Range Profile

3. High End Profile

Profile Classification Rules

Automatic Hardware Detection

Via UI

Via API

Manual Profile Selection

Profile-Specific Optimizations

Memory Optimization by Profile

Provider Recommendations by Profile

Strategy Recommendations by Profile

VRAM Usage Estimation

Formula

Precision Factors

Example Calculations

Troubleshooting

Out of Memory (OOM) Errors

Slow Training

Wrong Profile Detection

Best Practices

For Low End Hardware

For Mid Range Hardware

For High End Hardware

Performance Comparison

Training Time Estimates (1000 examples, 3 epochs)

Next Steps