Configuration Reference¶
All YAML config fields for every training method, distillation mode, and pruning command.
Common sections¶
Every config file that uses xlmtec train shares these top-level sections.
model¶
model:
name: "gpt2" # HuggingFace model id or local path
device: "auto" # "auto" | "cpu" | "cuda"
torch_dtype: "float32" # "float32" | "float16" | "bfloat16"
load_in_4bit: false # true for QLoRA
load_in_8bit: false
dataset¶
dataset:
source: local_file # local_file | huggingface_hub | csv | parquet
path: ./data/sample.jsonl
max_samples: 1000 # null = use all
text_columns: null # null = auto-detect
shuffle: true
tokenization¶
tokenization:
max_length: 256
truncation: true
padding: max_length # "max_length" | "longest" | false
training¶
training:
method: lora # see methods below
output_dir: ./outputs/run
num_epochs: 3
batch_size: 4
gradient_accumulation_steps: 4
learning_rate: 2.0e-4
warmup_ratio: 0.1
lr_scheduler_type: cosine
fp16: false
gradient_checkpointing: false
save_strategy: epoch # "epoch" | "steps" | "no"
logging_steps: 10
lora¶
Used by: lora, qlora, instruction_tuning, dpo.
lora:
r: 8 # rank — higher = more expressive, more memory
lora_alpha: 16 # scale factor (rule of thumb: 2× r)
lora_dropout: 0.05
target_modules: null # null = auto-detect; or ["c_attn", "c_proj"]
bias: none
Target module guidance:
| Model family | Recommended target_modules |
|---|---|
| GPT-2 | ["c_attn", "c_proj"] |
| LLaMA / Mistral | ["q_proj", "k_proj", "v_proj", "o_proj"] |
| BERT | ["query", "value"] |
| Auto | null |
Method-specific config¶
LoRA¶
Full example: examples/configs/lora_gpt2.yaml
QLoRA¶
Requires model.load_in_4bit: true.
model:
name: meta-llama/Llama-3.2-1B
load_in_4bit: true
torch_dtype: float16
training:
method: qlora
fp16: true
gradient_checkpointing: true
lora:
r: 16
lora_alpha: 32
target_modules: ["q_proj", "k_proj", "v_proj", "o_proj"]
Full example: examples/configs/qlora_llama.yaml
Full fine-tuning¶
No lora section required. Use only on small models (≤300M params) or with ≥24GB VRAM.
Full example: examples/configs/full_finetuning.yaml
Instruction tuning¶
Dataset must have instruction, input (optional), and response columns (alpaca format).
dataset:
path: ./data/instructions.jsonl
training:
method: instruction_tuning
lora:
r: 8
lora_alpha: 16
Full example: examples/configs/instruction_tuning.yaml
DPO¶
Requires pip install trl>=0.7.0. Dataset must have prompt, chosen, rejected columns.
dataset:
path: ./data/dpo_sample.jsonl
training:
method: dpo
learning_rate: 5.0e-5
lora:
r: 8
lora_alpha: 16
beta (preference shaping strength, default 0.1) is set via the Python API:
Full example: examples/configs/dpo.yaml
Distillation config¶
Response Distillation¶
Student model is trained to match the teacher's output distribution. Uses KL divergence blended with cross-entropy loss. No labelled data required beyond the training corpus.
training:
method: vanilla_distillation
output_dir: ./outputs/response_distillation
num_epochs: 3
batch_size: 4
learning_rate: 1.0e-4
distillation:
teacher_model_name: gpt2-medium # any HuggingFace model id
temperature: 2.0 # higher = softer probability distribution
alpha: 0.5 # loss = alpha×KL + (1-alpha)×CE
Tuning guide:
| Parameter | Effect |
|---|---|
temperature | Higher (2–4) softens teacher logits, shares more relative probability across tokens |
alpha | 1.0 = pure KL distillation; 0.0 = pure CE (standard training); 0.5 = balanced |
Full example: examples/configs/response_distillation.yaml
Feature Distillation¶
Extends response distillation with an additional MSE loss on intermediate hidden states. Provides stronger layer-level supervision, especially useful when the teacher and student share similar architecture.
training:
method: feature_distillation
output_dir: ./outputs/feature_distillation
num_epochs: 3
batch_size: 4
learning_rate: 1.0e-4
feature_distillation:
teacher_model_name: gpt2-medium
temperature: 2.0
alpha: 0.5 # KL divergence weight
beta: 0.3 # hidden-state MSE weight
feature_layers: null # null = auto-select 4 evenly-spaced student layers
Loss breakdown:
total_loss = alpha × KL(student || teacher)
+ (1 - alpha - beta) × CE(student, labels)
+ beta × MSE(student_hidden, teacher_hidden)
feature_layers accepts a list of student layer indices, e.g. [0, 4, 8, 11]. Each is mapped to the proportionally corresponding teacher layer. null auto-selects 4 evenly-spaced layers.
Tuning guide:
| Parameter | Effect |
|---|---|
beta | Higher = more hidden-state alignment; start at 0.1–0.3 |
feature_layers | Fewer layers = faster training; more = stronger supervision |
Full example: examples/configs/feature_distillation.yaml
Pruning config¶
Pruning commands (prune, wanda) use their own config objects, not PipelineConfig. They are standalone operations on a saved model — no training or lora sections.
Structured Pruning (xlmtec prune)¶
# examples/configs/structured_pruning.yaml
model_path: "./outputs/gpt2_lora"
pruning:
output_dir: "./outputs/gpt2_pruned"
sparsity: 0.3 # fraction of heads to zero (0.0–1.0)
method: "heads" # "heads" (attention) or "ffn" (feed-forward)
importance_metric: "magnitude"
min_heads_per_layer: 1 # safety floor — never collapse a layer entirely
Sparsity guidance:
sparsity | Effect |
|---|---|
| 0.1 | Light — minimal accuracy impact, small size reduction |
| 0.3 | Moderate — good balance (recommended starting point) |
| 0.5 | Aggressive — noticeable accuracy drop, significant speedup |
| 0.7+ | Extreme — only for heavily over-parameterised models |
Python API:
from xlmtec.core.types import PruningConfig
from xlmtec.trainers import StructuredPruner
config = PruningConfig(
output_dir=Path("./outputs/pruned"),
sparsity=0.3,
method="heads",
min_heads_per_layer=1,
)
result = StructuredPruner(model, tokenizer, config).prune()
Full example: examples/configs/structured_pruning.yaml
WANDA Pruning (xlmtec wanda)¶
# examples/configs/wanda.yaml
model_path: "./outputs/gpt2_lora"
wanda:
output_dir: "./outputs/gpt2_wanda"
sparsity: 0.5 # fraction of weights to zero (0.0–1.0)
n_calibration_samples: 128 # number of calibration forward passes
calibration_seq_len: 128 # token sequence length for calibration
use_row_wise: true # true = per-row threshold (recommended)
layer_types: null # null = auto (Linear + Conv1D)
Sparsity guidance:
sparsity | Effect |
|---|---|
| 0.3 | Light — minimal accuracy drop |
| 0.5 | Standard — matches original paper results on LLaMA/OPT |
| 0.6 | Aggressive — calibration data recommended |
| 0.7+ | Research-grade — significant accuracy loss expected |
use_row_wise: When true, the sparsity threshold is computed per output neuron row — each neuron retains the same fraction of its incoming weights. When false, a single global threshold is applied across the entire weight matrix.
Python API:
from xlmtec.core.types import WandaConfig
from xlmtec.trainers import WandaPruner
config = WandaConfig(
output_dir=Path("./outputs/wanda"),
sparsity=0.5,
n_calibration_samples=128,
use_row_wise=True,
)
result = WandaPruner(model, tokenizer, config).prune(calibration_input_ids=calib_ids)
Full example: examples/configs/wanda.yaml
Configuration recipes¶
Quick CPU experiment¶
model:
name: gpt2
torch_dtype: float32
dataset:
max_samples: 500
tokenization:
max_length: 128
training:
num_epochs: 1
batch_size: 2
gradient_accumulation_steps: 2
lora:
r: 4
lora_alpha: 8
Balanced quality — 8GB GPU¶
model:
name: gpt2-medium
torch_dtype: float16
training:
num_epochs: 3
batch_size: 4
gradient_accumulation_steps: 4
fp16: true
lora:
r: 8
lora_alpha: 32
Large model — limited VRAM (QLoRA)¶
model:
name: meta-llama/Llama-3.2-1B
load_in_4bit: true
torch_dtype: float16
training:
method: qlora
batch_size: 2
gradient_accumulation_steps: 8
fp16: true
gradient_checkpointing: true
lora:
r: 16
lora_alpha: 32
target_modules: ["q_proj", "k_proj", "v_proj", "o_proj"]