Troubleshooting¶
Solutions to common issues when using the fine-tuning tool.
Installation Issues¶
Issue: CUDA Not Available¶
Symptoms:
Causes:
- No NVIDIA GPU
- CUDA drivers not installed
- PyTorch installed without CUDA support
Solutions:
Check GPU:
Reinstall PyTorch with CUDA:
Verify installation:
Issue: Module Not Found Error¶
Symptoms:
Solution:
If issue persists:
Issue: Version Conflicts¶
Symptoms:
Solution:
Create fresh virtual environment:
python -m venv fresh_env
source fresh_env/bin/activate # Windows: fresh_env\Scripts\activate
pip install -r requirements.txt
Memory Issues¶
Issue: CUDA Out of Memory (OOM)¶
Symptoms:
Solutions (try in order):
1. Reduce batch size:
2. Reduce max sequence length:
3. Reduce LoRA rank:
4. Limit number of samples:
5. Enable gradient checkpointing (edit code):
6. Use smaller model:
Memory calculation formula:
Issue: CPU Out of Memory¶
Symptoms:
Solutions:
1. Limit dataset size:
2. Use streaming for large datasets:
Modify code to add streaming:
3. Increase system swap:
# Linux
sudo fallocate -l 16G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
Training Issues¶
Issue: Loss Not Decreasing¶
Symptoms:
Causes & Solutions:
1. Learning rate too low:
2. Model frozen: Check that LoRA is properly applied:
3. Insufficient training:
4. Data quality issues: - Check dataset has meaningful text - Verify columns are correctly detected - Ensure no empty or null values
Issue: Loss Diverging (NaN)¶
Symptoms:
Causes & Solutions:
1. Learning rate too high:
2. Gradient explosion:
Add gradient clipping (edit code):
3. Data issues: - Remove extreme outliers - Check for special characters causing issues - Normalize text inputs
Issue: Overfitting¶
Symptoms:
- Training loss decreases but validation would increase
- ROUGE scores decrease on new data
- Model outputs repetitive text
Solutions:
1. Increase dropout:
2. Reduce epochs:
3. Add more training data:
4. Reduce model capacity:
5. Early stopping (edit code):
Dataset Issues¶
Issue: No Text Columns Detected¶
Symptoms:
Solutions:
Check dataset structure:
Manual column specification (edit code around line 120):
text_columns = ["my_text_column", "my_content_column"]
tokenized_dataset, _ = finetuner.prepare_dataset(dataset, text_columns=text_columns)
Issue: Dataset Too Large¶
Symptoms: - Slow loading - Memory issues - Long preprocessing
Solutions:
1. Use selective file loading:
2. Limit samples aggressively:
3. Use streaming mode:
Modify dataset loading:
Issue: Column Names Not Recognized¶
Symptoms:
Tool doesn't detect your text columns properly.
Common column names recognized:
text,content,input,outputprompt,response,instructionquestion,answer,summary
Solution:
Rename your columns or modify detection logic (line 103):
Model Issues¶
Issue: Model Not Found¶
Symptoms:
Solution:
Verify model exists: - Check HuggingFace Models - Ensure exact name match (case-sensitive)
Common model names:
✅ gpt2
✅ facebook/opt-125m
✅ EleutherAI/pythia-410m
❌ GPT-2 (wrong case)
❌ opt-125m (missing organization)
Issue: Model Architecture Not Supported¶
Symptoms:
Solution:
Check supported architectures:
- ✅ GPT-2, GPT-Neo, GPT-J
- ✅ OPT, BLOOM, LLaMA
- ✅ T5, FLAN-T5
- ❌ BERT (requires different task type)
Manual target module specification:
Find module names:
Specify manually in setup_lora call.
Issue: Tokenizer Warnings¶
Symptoms:
Solution:
This is informational. To suppress:
Or truncate more aggressively.
Upload Issues¶
Issue: Authentication Failed¶
Symptoms:
Solutions:
1. Check token: - Get new token: https://huggingface.co/settings/tokens - Ensure "Write" permission enabled
2. Login via CLI:
3. Set environment variable:
Issue: Repository Already Exists¶
Symptoms:
Solutions:
1. Use existing repository:
2. Choose different name:
3. Delete old repository: - Go to repository settings on HuggingFace - Delete repository - Try again
Issue: Upload Failed¶
Symptoms:
Solutions:
1. Check internet connection
2. Retry upload: The tool supports resumable uploads.
3. Manual upload:
Performance Issues¶
Issue: Training Too Slow¶
Symptoms:
- < 1 iteration/second
- Hours for small datasets
Solutions:
1. Use GPU: Verify CUDA is enabled:
2. Reduce sequence length:
3. Increase batch size:
4. Use mixed precision: Automatically enabled on GPU (FP16).
5. Reduce dataset size for testing:
Issue: Poor Fine-tuning Results¶
Symptoms:
- ROUGE scores barely improve
- Model outputs generic responses
Solutions:
1. Increase model capacity:
2. Train longer:
3. Check data quality: - Ensure diverse, high-quality examples - Remove duplicates - Balance dataset
4. Use larger base model:
5. Increase training data:
Debugging Tips¶
Enable Verbose Logging¶
Monitor GPU Usage¶
# Real-time monitoring
watch -n 1 nvidia-smi
# Log to file
nvidia-smi --query-gpu=timestamp,memory.used,memory.free,utilization.gpu --format=csv -l 1 > gpu_log.csv
Check Model Size¶
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("gpt2")
print(f"Params: {model.num_parameters() / 1e6:.1f}M")
Validate Dataset¶
from datasets import load_dataset
dataset = load_dataset("your_dataset")
print(dataset)
print(dataset[0])
print(dataset.column_names)
Getting Help¶
If issues persist:
- Check logs: Review error messages carefully
- Search issues: GitHub Issues
- Open new issue: Include:
- Error message
- Configuration used
- System info (GPU, Python version)
- Steps to reproduce
Common Error Messages Reference¶
| Error | Likely Cause | Quick Fix |
|---|---|---|
| CUDA OOM | Memory exceeded | Reduce batch size |
| NaN loss | Learning rate too high | Reduce learning rate |
| No text columns | Column names not recognized | Check dataset structure |
| 401 Unauthorized | Invalid HF token | Re-login to HuggingFace |
| Connection timeout | Network issue | Retry upload |
| Module not found | Missing dependency | Reinstall requirements |
| Model not found | Wrong model name | Check spelling |
Next Steps¶
- Review Configuration Guide for optimization
- Check Examples for working configurations
- See API Reference for programmatic usage