Optimize ModelForge for your GPU with hardware-aware model recommendations and configurations.
Hardware profiles automatically detect your system capabilities and recommend optimal models and settings. This ensures you get the best performance without manual configuration.
System Scan
↓
Detect GPU VRAM + System RAM
↓
Classify into Profile (low_end/mid_range/high_end)
↓
Recommend Models + Settings
↓
Apply Optimizations
ModelForge classifies hardware into three profiles based on GPU VRAM and system RAM:
Hardware Requirements:
- GPU VRAM: < 7.2 GB
- OR: GPU VRAM < 15.2 GB AND System RAM < 15.2 GB
Typical Hardware:
- NVIDIA GTX 1650 (4GB)
- NVIDIA GTX 1660 (6GB)
- NVIDIA RTX 3050 (4-6GB)
- NVIDIA RTX A2000 (6GB)
Recommended Settings:
{
"compute_specs": "low_end",
"use_4bit": true,
"per_device_train_batch_size": 1,
"gradient_accumulation_steps": 8,
"max_seq_length": 512
}Recommended Models:
| Task | Primary Model | Size | VRAM Usage |
|---|---|---|---|
| Text Generation | qwen/Qwen2.5-3B | 3B params | ~4-5 GB |
| Summarization | google/flan-t5-large | 770M params | ~3-4 GB |
| Question Answering | deepset/roberta-base-squad2 | 125M params | ~2-3 GB |
Optimization Tips:
- ✅ Use 4-bit quantization (
use_4bit: true) - ✅ Small batch size (1-2)
- ✅ Higher gradient accumulation (8-16)
- ✅ Shorter sequences (512-1024 tokens)
- ✅ QLoRA strategy for memory efficiency
Hardware Requirements:
- GPU VRAM: 7.2 - 15.2 GB
- System RAM: ≥ 15.2 GB
- OR: GPU VRAM ≥ 15.2 GB AND System RAM < 15.2 GB
Typical Hardware:
- NVIDIA RTX 2070/2080 (8GB)
- NVIDIA RTX 3060 Ti (8GB)
- NVIDIA RTX 3070 (8GB)
- NVIDIA RTX 4060 Ti (8-16GB)
- NVIDIA RTX A4000 (12GB)
Recommended Settings:
{
"compute_specs": "mid_range",
"use_4bit": true,
"per_device_train_batch_size": 2,
"gradient_accumulation_steps": 4,
"max_seq_length": 1024
}Recommended Models:
| Task | Primary Model | Size | VRAM Usage |
|---|---|---|---|
| Text Generation | mistralai/Mistral-Small-3.1-24B-Base-2503 | 24B params | ~12-14 GB |
| Text Generation (alt) | meta-llama/Llama-3.1-8B-Instruct | 8B params | ~8-10 GB |
| Summarization | google/flan-t5-large | 770M params | ~4-5 GB |
| Question Answering | meta-llama/Llama-3.1-8B-Instruct | 8B params | ~8-10 GB |
Optimization Tips:
- ✅ 4-bit quantization recommended
- ✅ Moderate batch size (2-4)
- ✅ Standard gradient accumulation (4-8)
- ✅ Medium sequences (1024-2048 tokens)
- ✅ Unsloth provider for 2x speedup
- ✅ Both SFT and QLoRA strategies work well
Hardware Requirements:
- GPU VRAM: ≥ 15.2 GB
- System RAM: ≥ 15.2 GB
Typical Hardware:
- NVIDIA RTX 3080/3090 (10-24GB)
- NVIDIA RTX 4080/4090 (12-24GB)
- NVIDIA RTX A5000/A6000 (24-48GB)
- NVIDIA Tesla V100 (16-32GB)
- NVIDIA A100 (40-80GB)
Recommended Settings:
{
"compute_specs": "high_end",
"use_4bit": false,
"bf16": true,
"per_device_train_batch_size": 4,
"gradient_accumulation_steps": 2,
"max_seq_length": 2048
}Recommended Models:
| Task | Primary Model | Size | VRAM Usage |
|---|---|---|---|
| Text Generation | meta-llama/Llama-4-Maverick-17B-128E-Instruct | 17B params | ~16-18 GB |
| Text Generation (large) | qwen/Qwen2.5-32B | 32B params | ~20-24 GB |
| Summarization | meta-llama/Llama-4-Maverick-17B-128E-Instruct | 17B params | ~16-18 GB |
| Question Answering | qwen/Qwen2.5-32B | 32B params | ~20-24 GB |
Optimization Tips:
- ✅ Optional 4-bit quantization (not required)
- ✅ Use BF16 precision on Ampere+ GPUs (RTX 30xx/40xx)
- ✅ Larger batch size (4-8)
- ✅ Lower gradient accumulation (2-4)
- ✅ Longer sequences (2048-4096 tokens)
- ✅ Unsloth provider highly recommended
- ✅ Can use advanced strategies (RLHF, DPO)
ModelForge uses these rules to classify your hardware:
if gpu_vram < 7.2 GB:
profile = "low_end"
elif gpu_vram < 15.2 GB and ram < 15.2 GB:
profile = "low_end"
elif gpu_vram < 15.2 GB and ram >= 15.2 GB:
profile = "mid_range"
elif gpu_vram >= 15.2 GB and ram < 15.2 GB:
profile = "mid_range"
else: # gpu_vram >= 15.2 GB and ram >= 15.2 GB
profile = "high_end"When you start training in the UI:
- Select your task
- Click "Detect Hardware"
- ModelForge automatically:
- Scans GPU and RAM
- Classifies into profile
- Recommends optimal models
- Pre-fills configuration
curl -X POST http://localhost:8000/api/finetune/detect \
-H "Content-Type: application/json" \
-d '{"task": "text-generation"}'Response:
{
"hardware_specs": {
"gpu_name": "NVIDIA RTX 3070",
"gpu_memory_gb": 8.0,
"ram_gb": 16.0,
"cuda_version": "12.6"
},
"compute_profile": "mid_range",
"recommended_model": "meta-llama/Llama-3.1-8B-Instruct",
"possible_models": [
"meta-llama/Llama-3.1-8B-Instruct",
"qwen/Qwen2.5-7B",
"mistralai/Mistral-Small-3.1-24B-Base-2503"
]
}You can override automatic detection:
{
"compute_specs": "mid_range", // Force mid-range profile
"model_name": "meta-llama/Llama-3.1-8B-Instruct",
...
}| Profile | Quantization | Batch Size | Grad Accum | Max Seq Len |
|---|---|---|---|---|
| Low End | 4-bit (required) | 1 | 8-16 | 512-1024 |
| Mid Range | 4-bit (recommended) | 2-4 | 4-8 | 1024-2048 |
| High End | Optional | 4-8 | 2-4 | 2048-4096 |
| Profile | Primary Provider | Secondary | Speedup |
|---|---|---|---|
| Low End | HuggingFace | - | 1x |
| Mid Range | Unsloth | HuggingFace | 2x |
| High End | Unsloth | HuggingFace | 2x |
Note: Unsloth requires Linux, WSL, or Docker. Not available on native Windows.
| Profile | Recommended Strategy | Alternative |
|---|---|---|
| Low End | QLoRA | SFT |
| Mid Range | QLoRA or SFT | RLHF, DPO |
| High End | SFT, QLoRA, RLHF, DPO | Any |
VRAM Usage ≈ Model Size × Precision Factor × Overhead Factor
- 4-bit: ~0.5 GB per billion parameters
- 8-bit: ~1 GB per billion parameters
- 16-bit (FP16/BF16): ~2 GB per billion parameters
- 32-bit (FP32): ~4 GB per billion parameters
7B model with 4-bit quantization:
7B × 0.5 GB/B × 1.5 (overhead) = ~5.25 GB VRAM
7B model with 16-bit precision:
7B × 2 GB/B × 1.5 (overhead) = ~21 GB VRAM
32B model with 4-bit quantization:
32B × 0.5 GB/B × 1.5 (overhead) = ~24 GB VRAM
Error: CUDA out of memory
Solutions:
- Reduce batch size:
per_device_train_batch_size: 1 - Increase gradient accumulation:
gradient_accumulation_steps: 16 - Enable 4-bit quantization:
use_4bit: true - Reduce sequence length:
max_seq_length: 512 - Enable gradient checkpointing:
gradient_checkpointing: true - Try a smaller model
Problem: Training is taking too long
Solutions:
- Use Unsloth provider for 2x speedup (if on Linux/WSL)
- Increase batch size if you have VRAM headroom
- Reduce gradient accumulation steps
- Use mixed precision (BF16 on RTX 30xx/40xx)
- Consider a smaller model
Problem: ModelForge detects wrong profile
Solutions:
- Manually specify profile:
"compute_specs": "mid_range" - Check GPU drivers are up to date
- Verify CUDA is properly installed
- Check
nvidia-smioutput matches expected VRAM
- ✅ Always use 4-bit quantization
- ✅ Start with smallest recommended models
- ✅ Use QLoRA strategy
- ✅ Batch size = 1, gradient accumulation = 8-16
- ✅ Keep sequences short (512-1024)
- ✅ Close other GPU applications
- ✅ Use 4-bit quantization for large models (7B+)
- ✅ Unsloth provider for best performance
- ✅ Batch size = 2-4
- ✅ Try both SFT and QLoRA strategies
- ✅ Medium sequences (1024-2048)
- ✅ Unsloth provider mandatory for speed
- ✅ Can skip quantization for small models
- ✅ Use BF16 on Ampere+ GPUs
- ✅ Larger batch sizes (4-8)
- ✅ Try advanced strategies (RLHF, DPO)
- ✅ Longer sequences (2048-4096)
| Model Size | Profile | Provider | Strategy | Time |
|---|---|---|---|---|
| 3B | Low End | HuggingFace | QLoRA | ~2 hours |
| 7B | Mid Range | Unsloth | QLoRA | ~45 min |
| 7B | Mid Range | HuggingFace | QLoRA | ~90 min |
| 17B | High End | Unsloth | SFT | ~60 min |
| 32B | High End | Unsloth | QLoRA | ~90 min |
Times are approximate and vary based on exact hardware and configuration.
- Configuration Guide - Detailed configuration options
- Provider Overview - Choose HuggingFace or Unsloth
- Training Strategies - Select optimal strategy
- Performance Optimization - Fine-tune performance
Hardware profiles make ModelForge accessible to everyone! From 4GB to 80GB VRAM, we've got you covered.