common-issues.md

Common Issues and Solutions

Troubleshooting guide for frequent ModelForge issues.

Installation Issues

CUDA Not Available

Symptom: torch.cuda.is_available() returns False

Solutions:

Verify NVIDIA drivers: nvidia-smi
Check CUDA installation: nvcc --version

Reinstall PyTorch with correct CUDA version:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126

Python Version Issues

Symptom: Installation fails with "requires Python 3.11"

Solution: Install Python 3.11:

# Linux
sudo apt install python3.11

# Or use pyenv
pyenv install 3.11.0

Training Issues

CUDA Out of Memory

Symptom: Training crashes with OOM error

Solutions (in order of preference):

Use QLoRA strategy:
```
{"strategy": "qlora", "use_4bit": true}
```
Reduce batch size:
```
{"per_device_train_batch_size": 1}
```
Reduce sequence length:
```
{"max_seq_length": 1024}
```
Enable gradient checkpointing:
```
{"gradient_checkpointing": true}
```
Use smaller model

Training Very Slow

Solutions:

Use Unsloth provider (2x faster)
Use bf16 on Ampere+ GPUs:
```
{"bf16": true}
```
Increase batch size if VRAM allows
Use NVMe SSD for dataset

Model Not Found

Symptom: "Model X not found on HuggingFace Hub"

Solutions:

Check model ID is correct
Set HuggingFace token:
```
export HUGGINGFACE_TOKEN=your_token
```
For gated models, accept license on HuggingFace

Windows-Specific Issues

Unsloth Not Working

Symptom: "Unsloth is not installed" on Windows

Solution: Unsloth requires Linux. Use WSL or Docker.

See Windows Installation for details.

WSL GPU Not Detected

Solutions:

Update to latest Windows version
Update NVIDIA drivers (525.60+)
Ensure WSL 2: wsl --status
Restart WSL: wsl --shutdown

Dataset Issues

Dataset Validation Failed

Symptom: "Missing required field 'output'"

Solution: Ensure all examples have required fields:

{"input": "text", "output": "text"}

Invalid JSON

Symptom: "Invalid JSON on line X"

Solution: Validate JSON:

python -m json.tool dataset.jsonl

Provider Issues

Provider Not Found

Symptom: "Unknown provider 'unsloth'"

Solution: Install provider:

pip install unsloth

max_seq_length Error with Unsloth

Symptom: "max_seq_length cannot be -1"

Solution: Set fixed value:

{"max_seq_length": 2048}

API Issues

Port Already in Use

Symptom: "Address already in use: 8000"

Solutions:

Find process: lsof -i :8000
Kill process or use different port:
```
modelforge --port 8080
```

Connection Refused

Solutions:

Check ModelForge is running: ps aux | grep modelforge
Check firewall settings
Try localhost: http://localhost:8000

Performance Issues

High Memory Usage

Solutions:

Use gradient checkpointing
Use 4-bit quantization
Reduce batch size
Close other applications

Slow Inference

Solutions:

Use smaller model
Reduce max_seq_length
Use quantization
Batch requests

Strategy-Task Incompatibility

DPO/RLHF with Non-Text-Generation Task

Symptom: Validation error when starting training with DPO or RLHF strategy

Solution: DPO and RLHF strategies only support "task": "text-generation". Change your task:

{"strategy": "dpo", "task": "text-generation"}

Unsloth with Non-Text-Generation Task

Symptom: Validation error when using Unsloth provider with summarization or QA

Solution: Unsloth only supports "task": "text-generation". Use "provider": "huggingface" for other tasks.

Missing Optional Dependencies

bitsandbytes Not Installed

Symptom: ImportError: bitsandbytes is required for quantization

Solution: Install the quantization extra:

pip install modelforge-finetuning[quantization]

CLI Wizard Not Starting

Symptom: ImportError for questionary or rich when running modelforge cli

Solution: Install the CLI extra:

pip install modelforge-finetuning[cli]

More Help

Apple Silicon (MPS) Issues

MPS Not Available

Symptom: torch.backends.mps.is_available() returns False

Solutions:

Verify you're on macOS 12.3 or later
Verify you have Apple Silicon (M1/M2/M3/M4/M5)
Update PyTorch to latest version:
```
pip install --upgrade torch
```

Check MPS build:

import torch
print(f"MPS built: {torch.backends.mps.is_built()}")

"Unsloth provider is not supported on Apple MPS"

Symptom: Error when trying to use Unsloth on macOS

Solution: Unsloth requires NVIDIA CUDA and is not compatible with Apple MPS. Use HuggingFace provider instead:

{
  "provider": "huggingface",
  "device": "mps"
}

"4-bit quantization via bitsandbytes is not supported on MPS"

Symptom: Error or warning about quantization on MPS

Solution: bitsandbytes library doesn't support MPS. Disable quantization:

{
  "use_4bit": false,
  "use_8bit": false,
  "fp16": true
}

Note: ModelForge automatically disables quantization on MPS, but you may still see this warning.

MPS Backend Out of Memory

Symptom: "MPS backend out of memory" error during training

Solutions (in order of preference):

Use a smaller model (3B instead of 7B)
Reduce max_seq_length:
```
{"max_seq_length": 512}
```
Reduce batch size:
```
{"per_device_train_batch_size": 1}
```
Enable gradient checkpointing:
```
{"gradient_checkpointing": true}
```
Close other applications to free unified memory

MPS Training Very Slow

Symptom: Training takes much longer than expected

Expected Behavior: MPS is 3-5x slower than high-end NVIDIA GPUs, but still much faster than CPU.

Tips to improve speed:

Use smaller models (1-3B parameters)
Reduce max_seq_length to 512 or 1024
Disable gradient checkpointing if you have enough memory:
```
{"gradient_checkpointing": false}
```
Close other applications to free resources

"RuntimeError: MPS does not support..."

Symptom: Operation not supported on MPS backend

Cause: Some PyTorch operations are not yet implemented for MPS.

Solutions:

Update to the latest PyTorch version:
```
pip install --upgrade torch
```
Try a different model architecture
Fall back to CPU or use an NVIDIA GPU if available

Note: Report unsupported operations to PyTorch via their GitHub issues.

Model Loading Fails on MPS

Symptom: Model fails to load or crashes on MPS

Solutions:

Ensure you're using HuggingFace provider (not Unsloth)
Disable quantization:
```
{"use_4bit": false, "use_8bit": false}
```
Try loading a different model (some architectures have better MPS support)
Check you have enough unified memory for the model

Still having issues? Create an issue on GitHub.

FilesExpand file tree

common-issues.md

Latest commit

History

common-issues.md

File metadata and controls

Common Issues and Solutions

Installation Issues

CUDA Not Available

Python Version Issues

Training Issues

CUDA Out of Memory

Training Very Slow

Model Not Found

Windows-Specific Issues

Unsloth Not Working

WSL GPU Not Detected

Dataset Issues

Dataset Validation Failed

Invalid JSON

Provider Issues

Provider Not Found

max_seq_length Error with Unsloth

API Issues

Port Already in Use

Connection Refused

Performance Issues

High Memory Usage

Slow Inference

Strategy-Task Incompatibility

DPO/RLHF with Non-Text-Generation Task

Unsloth with Non-Text-Generation Task

Missing Optional Dependencies

bitsandbytes Not Installed

CLI Wizard Not Starting

More Help

Apple Silicon (MPS) Issues

MPS Not Available

"Unsloth provider is not supported on Apple MPS"

"4-bit quantization via bitsandbytes is not supported on MPS"

MPS Backend Out of Memory

MPS Training Very Slow

"RuntimeError: MPS does not support..."

Model Loading Fails on MPS