Checkpoint Resume¶
Resume a fine-tuning run from a saved checkpoint — no data re-processing, no starting over.
Quick start¶
# See what checkpoints exist and what would happen
xlmtec resume output/run1 --dry-run
# Resume from the latest checkpoint
xlmtec resume output/run1
# Resume from a specific checkpoint
xlmtec resume output/run1 --checkpoint checkpoint-500
# Resume and train for more epochs
xlmtec resume output/run1 --epochs 5
How it works¶
xlmtec resume reads the checkpoint-N directories written by the HuggingFace Trainer inside your output directory. It picks up training state (optimizer, scheduler, step count) from trainer_state.json and continues from exactly where training stopped.
No config file is needed — the original config is read from config.yaml inside the run directory.
Options¶
| Flag | Default | Description |
|---|---|---|
--checkpoint | latest | Name of checkpoint to resume from (e.g. checkpoint-500) |
--epochs | from config | Override total epochs for the continued run |
--dry-run | off | Validate checkpoint and show plan — do not train |
Dry run output¶
┌─ Checkpoint Resume Plan ──────────────────────────┐
│ Run dir : output/run1 │
│ Checkpoint : checkpoint-500 (step 500) │
│ Epochs done : 2 / 5 │
│ Epochs left : 3 │
│ Config : lora · gpt2 · lr=2e-4 │
└───────────────────────────────────────────────────┘
Remove --dry-run to start training.
Checkpoint discovery¶
xlmtec resume scans for any directory matching checkpoint-{N} inside the run dir, sorted by step number. --checkpoint latest (the default) picks the highest step.
Common scenarios¶
Training crashed mid-run
xlmtec resume output/run1 --dry-run # confirm latest checkpoint found
xlmtec resume output/run1 # resume from it
Want to train for more epochs than originally set
Resume from an earlier checkpoint (e.g. before overfitting started)
Troubleshooting¶
| Problem | Fix |
|---|---|
No checkpoints found | Ensure save_steps was set in training config |
checkpoint-N not found | Run xlmtec resume output/run1 --dry-run to list available checkpoints |
config.yaml missing | Resume requires the original run directory to be intact |