Checkpoint Resume¶

Resume a fine-tuning run from a saved checkpoint — no data re-processing, no starting over.

Quick start¶

# See what checkpoints exist and what would happen
xlmtec resume output/run1 --dry-run

# Resume from the latest checkpoint
xlmtec resume output/run1

# Resume from a specific checkpoint
xlmtec resume output/run1 --checkpoint checkpoint-500

# Resume and train for more epochs
xlmtec resume output/run1 --epochs 5

How it works¶

xlmtec resume reads the checkpoint-N directories written by the HuggingFace Trainer inside your output directory. It picks up training state (optimizer, scheduler, step count) from trainer_state.json and continues from exactly where training stopped.

No config file is needed — the original config is read from config.yaml inside the run directory.

Options¶

Flag	Default	Description
`--checkpoint`	latest	Name of checkpoint to resume from (e.g. `checkpoint-500`)
`--epochs`	from config	Override total epochs for the continued run
`--dry-run`	off	Validate checkpoint and show plan — do not train

Dry run output¶

┌─ Checkpoint Resume Plan ──────────────────────────┐
│  Run dir     : output/run1                        │
│  Checkpoint  : checkpoint-500  (step 500)         │
│  Epochs done : 2 / 5                              │
│  Epochs left : 3                                  │
│  Config      : lora · gpt2 · lr=2e-4              │
└───────────────────────────────────────────────────┘
Remove --dry-run to start training.

Checkpoint discovery¶

xlmtec resume scans for any directory matching checkpoint-{N} inside the run dir, sorted by step number. --checkpoint latest (the default) picks the highest step.

output/run1/
  checkpoint-250/
  checkpoint-500/    ← latest (default)
  config.yaml
  trainer_state.json

Common scenarios¶

Training crashed mid-run

xlmtec resume output/run1 --dry-run   # confirm latest checkpoint found
xlmtec resume output/run1             # resume from it

Want to train for more epochs than originally set

xlmtec resume output/run1 --epochs 10

Resume from an earlier checkpoint (e.g. before overfitting started)

xlmtec resume output/run1 --checkpoint checkpoint-250

Troubleshooting¶

Problem	Fix
`No checkpoints found`	Ensure `save_steps` was set in training config
`checkpoint-N not found`	Run `xlmtec resume output/run1 --dry-run` to list available checkpoints
`config.yaml missing`	Resume requires the original run directory to be intact