Examples¶
Practical examples for common fine-tuning scenarios.
Example 1: Fine-tune GPT-2 on Custom Dialogue Data¶
Scenario¶
You have a JSONL file with conversational data and want to create a chatbot.
Dataset Format (dialogue.jsonl)¶
{"prompt": "Hello, how are you?", "response": "I'm doing great! How can I help you today?"}
{"prompt": "What's the weather like?", "response": "I don't have access to weather data, but you can check weather.com"}
Configuration¶
python xlmtec.py
# Interactive prompts:
Model name: gpt2
Output directory: ./dialogue_model
Dataset path: ./dialogue.jsonl
Limit samples: yes
Number of samples: 5000
Max sequence length: 256
LoRA r: 8
LoRA alpha: 32
LoRA dropout: 0.1
Epochs: 5
Batch size: 8
Learning rate: 2e-4
Upload to HuggingFace: no
Expected Results¶
📊 PERFORMANCE COMPARISON
Metric Base Model Fine-tuned Improvement
ROUGE1 0.1823 0.3245 +78.03%
ROUGE2 0.0912 0.2134 +133.99%
ROUGEL 0.1645 0.2987 +81.58%
Using the Fine-tuned Model¶
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
# Load base model
base_model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "./dialogue_model")
# Generate response
prompt = "Hello, how are you?"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
Example 2: Domain Adaptation for Medical Text¶
Scenario¶
Adapt a model to medical terminology and clinical notes.
Dataset¶
Using HuggingFace medical dataset.
Configuration¶
python xlmtec.py
# Interactive prompts:
Model name: facebook/opt-350m
Output directory: ./medical_model
Dataset name: medical_meadow_medical_flashcards
Dataset config: default
Split: train
Limit samples: yes
Number of samples: 20000
Max sequence length: 512
LoRA r: 16
LoRA alpha: 64
LoRA dropout: 0.15
Epochs: 3
Batch size: 4
Learning rate: 1e-4
Upload to HuggingFace: yes
Repo name: myusername/opt-medical-350m
Why These Settings?¶
- Higher rank (16): Medical domain requires learning specialized terminology
- Higher alpha (64): Stronger adaptation to domain-specific patterns
- More dropout (0.15): Medical text can be noisy, prevent overfitting
- Lower learning rate (1e-4): Conservative to preserve general knowledge
Example 3: Summarization Task¶
Scenario¶
Fine-tune for news article summarization.
Dataset Format (summaries.csv)¶
article,summary
"Long article text here...","Brief summary here..."
"Another article...","Another summary..."
Configuration¶
python xlmtec.py
# Interactive prompts:
Model name: google/flan-t5-base
Output directory: ./summarization_model
Dataset path: ./summaries.csv
Limit samples: yes
Number of samples: 15000
Max sequence length: 1024
LoRA r: 8
LoRA alpha: 32
LoRA dropout: 0.1
Epochs: 3
Batch size: 4
Learning rate: 2e-4
Data Preparation Tips¶
The tool auto-detects columns. For best results:
- Name columns clearly:
article,text,summary,content - Clean your data: Remove HTML tags, special characters
- Balance length: Keep articles similar length when possible
Example 4: Code Generation¶
Scenario¶
Fine-tune on code examples for Python code generation.
Dataset¶
Using HuggingFace code dataset with specific file selection.
Configuration¶
python xlmtec.py
# Interactive prompts:
Model name: Salesforce/codegen-350M-mono
Output directory: ./code_model
Dataset name: codeparrot/github-code
Dataset config: python
Load specific file: yes
File path: train-00000-of-00200.parquet
Number of samples: 10000
Max sequence length: 512
LoRA r: 8
LoRA alpha: 32
LoRA dropout: 0.05
Epochs: 2
Batch size: 8
Learning rate: 2e-4
Why These Settings?¶
- Lower dropout (0.05): Code has consistent structure, less noise
- Fewer epochs (2): Code datasets are large, less overfitting risk
- Specific file selection: Avoids loading entire 200-file dataset
Example 5: Question Answering¶
Scenario¶
Fine-tune on SQuAD-style Q&A data.
Configuration¶
python xlmtec.py
# Interactive prompts:
Model name: distilbert-base-uncased
Output directory: ./qa_model
Dataset name: squad
Dataset config: plain_text
Split: train
Limit samples: yes
Number of samples: 30000
Max sequence length: 384
LoRA r: 8
LoRA alpha: 32
LoRA dropout: 0.1
Epochs: 3
Batch size: 16
Learning rate: 3e-4
Inference Example¶
from transformers import pipeline
from peft import PeftModel, AutoModelForQuestionAnswering
# Load fine-tuned model
base_model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased")
model = PeftModel.from_pretrained(base_model, "./qa_model")
# Create QA pipeline
qa = pipeline("question-answering", model=model, tokenizer="distilbert-base-uncased")
# Ask question
context = "Paris is the capital of France. It is known for the Eiffel Tower."
question = "What is the capital of France?"
result = qa(question=question, context=context)
print(result['answer']) # "Paris"
Example 6: Multi-language Fine-tuning¶
Scenario¶
Fine-tune multilingual model for specific languages.
Configuration¶
python xlmtec.py
# Interactive prompts:
Model name: xlm-roberta-base
Output directory: ./multilingual_model
Dataset name: Helsinki-NLP/tatoeba
Dataset config: eng-spa
Split: train
Limit samples: yes
Number of samples: 25000
Max sequence length: 128
LoRA r: 16
LoRA alpha: 32
LoRA dropout: 0.1
Epochs: 4
Batch size: 8
Learning rate: 2e-4
Example 7: Sentiment Analysis¶
Scenario¶
Adapt model for sentiment classification.
Dataset Format (sentiment.jsonl)¶
{"text": "This product is amazing!", "sentiment": "positive"}
{"text": "Terrible experience, would not recommend.", "sentiment": "negative"}
Configuration¶
python xlmtec.py
# Interactive prompts:
Model name: roberta-base
Output directory: ./sentiment_model
Dataset path: ./sentiment.jsonl
Limit samples: yes
Number of samples: 10000
Max sequence length: 256
LoRA r: 4
LoRA alpha: 16
LoRA dropout: 0.1
Epochs: 5
Batch size: 16
Learning rate: 3e-4
Why These Settings?¶
- Lower rank (4): Sentiment is relatively simple classification
- Higher batch size (16): Shorter sequences allow larger batches
- More epochs (5): Smaller dataset benefits from more iterations
Example 8: Instruction Following¶
Scenario¶
Fine-tune on instruction-response pairs.
Dataset¶
Using large instruction dataset with selective loading.
Configuration¶
python xlmtec.py
# Interactive prompts:
Model name: EleutherAI/pythia-410m
Output directory: ./instruction_model
Dataset name: HuggingFaceH4/ultrachat_200k
Load specific file: yes
File path: data/train_sft-00000-of-00004.parquet
Number of samples: 15000
Max sequence length: 1024
LoRA r: 16
LoRA alpha: 64
LoRA dropout: 0.1
Epochs: 2
Batch size: 4
Learning rate: 1e-4
Upload to HuggingFace: yes
Repo name: myusername/pythia-instruction-410m
Private: no
Best Practices from Examples¶
Dataset Size Guidelines¶
- Experiments: 1,000-5,000 samples
- Development: 5,000-20,000 samples
- Production: 20,000+ samples
Model Selection Tips¶
- Start small: Test with GPT-2 or OPT-125M
- Match architecture: Use decoder models (GPT) for generation
- Consider license: Check model licensing for commercial use
- Resource awareness: Larger models need more VRAM
Common Pitfalls to Avoid¶
- ❌ Training on too few samples (< 500)
- ❌ Using max_length longer than needed (wastes memory)
- ❌ Setting rank too high for simple tasks (overfitting)
- ❌ Forgetting to validate results before uploading
- ❌ Not monitoring GPU memory usage
Performance Optimization¶
# For faster iteration during development:
- Use smaller model variants
- Limit samples to 1000-5000
- Reduce max_length
- Use rank 4-8
# For production quality:
- Use full dataset
- Increase rank to 16
- Train for more epochs
- Validate on held-out test set
Recipes Summary¶
| Use Case | Model | Rank | Samples | Max Length |
|---|---|---|---|---|
| Chatbot | GPT-2 | 8 | 5k | 256 |
| Medical | OPT-350M | 16 | 20k | 512 |
| Summarization | FLAN-T5 | 8 | 15k | 1024 |
| Code Gen | CodeGen | 8 | 10k | 512 |
| QA | DistilBERT | 8 | 30k | 384 |
| Multilingual | XLM-R | 16 | 25k | 128 |
| Sentiment | RoBERTa | 4 | 10k | 256 |
| Instructions | Pythia | 16 | 15k | 1024 |
Next Steps¶
- Learn about Configuration Options
- Check Troubleshooting for common issues
- Review API Reference for programmatic usage