Troubleshooting guide for frequent ModelForge issues.
Symptom: torch.cuda.is_available() returns False
Solutions:
- Verify NVIDIA drivers:
nvidia-smi - Check CUDA installation:
nvcc --version - Reinstall PyTorch with correct CUDA version:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
Symptom: Installation fails with "requires Python 3.11"
Solution: Install Python 3.11:
# Linux
sudo apt install python3.11
# Or use pyenv
pyenv install 3.11.0Symptom: Training crashes with OOM error
Solutions (in order of preference):
- Use QLoRA strategy:
{"strategy": "qlora", "use_4bit": true} - Reduce batch size:
{"per_device_train_batch_size": 1} - Reduce sequence length:
{"max_seq_length": 1024} - Enable gradient checkpointing:
{"gradient_checkpointing": true} - Use smaller model
Solutions:
- Use Unsloth provider (2x faster)
- Use bf16 on Ampere+ GPUs:
{"bf16": true} - Increase batch size if VRAM allows
- Use NVMe SSD for dataset
Symptom: "Model X not found on HuggingFace Hub"
Solutions:
- Check model ID is correct
- Set HuggingFace token:
export HUGGINGFACE_TOKEN=your_token - For gated models, accept license on HuggingFace
Symptom: "Unsloth is not installed" on Windows
Solution: Unsloth requires Linux. Use WSL or Docker.
See Windows Installation for details.
Solutions:
- Update to latest Windows version
- Update NVIDIA drivers (525.60+)
- Ensure WSL 2:
wsl --status - Restart WSL:
wsl --shutdown
Symptom: "Missing required field 'output'"
Solution: Ensure all examples have required fields:
{"input": "text", "output": "text"}Symptom: "Invalid JSON on line X"
Solution: Validate JSON:
python -m json.tool dataset.jsonlSymptom: "Unknown provider 'unsloth'"
Solution: Install provider:
pip install unslothSymptom: "max_seq_length cannot be -1"
Solution: Set fixed value:
{"max_seq_length": 2048}Symptom: "Address already in use: 8000"
Solutions:
- Find process:
lsof -i :8000 - Kill process or use different port:
modelforge --port 8080
Solutions:
- Check ModelForge is running:
ps aux | grep modelforge - Check firewall settings
- Try localhost:
http://localhost:8000
Solutions:
- Use gradient checkpointing
- Use 4-bit quantization
- Reduce batch size
- Close other applications
Solutions:
- Use smaller model
- Reduce max_seq_length
- Use quantization
- Batch requests
Symptom: Validation error when starting training with DPO or RLHF strategy
Solution: DPO and RLHF strategies only support "task": "text-generation". Change your task:
{"strategy": "dpo", "task": "text-generation"}Symptom: Validation error when using Unsloth provider with summarization or QA
Solution: Unsloth only supports "task": "text-generation". Use "provider": "huggingface" for other tasks.
Symptom: ImportError: bitsandbytes is required for quantization
Solution: Install the quantization extra:
pip install modelforge-finetuning[quantization]Symptom: ImportError for questionary or rich when running modelforge cli
Solution: Install the CLI extra:
pip install modelforge-finetuning[cli]Symptom: torch.backends.mps.is_available() returns False
Solutions:
- Verify you're on macOS 12.3 or later
- Verify you have Apple Silicon (M1/M2/M3/M4/M5)
- Update PyTorch to latest version:
pip install --upgrade torch
- Check MPS build:
import torch print(f"MPS built: {torch.backends.mps.is_built()}")
Symptom: Error when trying to use Unsloth on macOS
Solution: Unsloth requires NVIDIA CUDA and is not compatible with Apple MPS. Use HuggingFace provider instead:
{
"provider": "huggingface",
"device": "mps"
}Symptom: Error or warning about quantization on MPS
Solution: bitsandbytes library doesn't support MPS. Disable quantization:
{
"use_4bit": false,
"use_8bit": false,
"fp16": true
}Note: ModelForge automatically disables quantization on MPS, but you may still see this warning.
Symptom: "MPS backend out of memory" error during training
Solutions (in order of preference):
- Use a smaller model (3B instead of 7B)
- Reduce
max_seq_length:{"max_seq_length": 512} - Reduce batch size:
{"per_device_train_batch_size": 1} - Enable gradient checkpointing:
{"gradient_checkpointing": true} - Close other applications to free unified memory
Symptom: Training takes much longer than expected
Expected Behavior: MPS is 3-5x slower than high-end NVIDIA GPUs, but still much faster than CPU.
Tips to improve speed:
- Use smaller models (1-3B parameters)
- Reduce
max_seq_lengthto 512 or 1024 - Disable gradient checkpointing if you have enough memory:
{"gradient_checkpointing": false} - Close other applications to free resources
Symptom: Operation not supported on MPS backend
Cause: Some PyTorch operations are not yet implemented for MPS.
Solutions:
- Update to the latest PyTorch version:
pip install --upgrade torch
- Try a different model architecture
- Fall back to CPU or use an NVIDIA GPU if available
Note: Report unsupported operations to PyTorch via their GitHub issues.
Symptom: Model fails to load or crashes on MPS
Solutions:
- Ensure you're using HuggingFace provider (not Unsloth)
- Disable quantization:
{"use_4bit": false, "use_8bit": false} - Try loading a different model (some architectures have better MPS support)
- Check you have enough unified memory for the model
Still having issues? Create an issue on GitHub.