Summary
Add --stage generate to apr distill --config <yaml> for text-based synthetic data generation from a teacher model.
Background
Albor distillation uses text-based synthetic data (not logit KD) because the teacher (Qwen3-Coder-30B, vocab=151,936) and student (albor-350m, vocab=32,768) have incompatible vocabularies. The teacher generates Python completions from codeparrot prompts, and the student trains on the tokenized output with standard causal LM loss.
Design
- New config type
TextDistillConfig matching distill-30b.yaml schema (teacher/student/synthetic_data sections)
- New stage dispatch:
--stage generate in run_config_mode
- Implementation spawns
realizar serve --model <teacher.apr> --gpu as subprocess
- Reads prompts from JSONL, POSTs to
/generate endpoint, writes output JSONL
- Stops when
target_tokens budget is reached
Config fields used
teacher:
model: "/path/to/teacher.apr"
max_tokens: 256
temperature: 0.8
top_p: 0.95
gpu: true
synthetic_data:
prompts: "data/distill/prompts.jsonl"
output: "data/distill/synthetic.jsonl"
target_tokens: 500_000
min_completion_tokens: 10
Acceptance criteria
apr distill --config distill-30b.yaml --stage generate produces JSONL with prompt+completion pairs
- Progress logging every 10 prompts
- Graceful shutdown of realizar server subprocess
- Respects
target_tokens budget
Refs #61
Summary
Add
--stage generatetoapr distill --config <yaml>for text-based synthetic data generation from a teacher model.Background
Albor distillation uses text-based synthetic data (not logit KD) because the teacher (Qwen3-Coder-30B, vocab=151,936) and student (albor-350m, vocab=32,768) have incompatible vocabularies. The teacher generates Python completions from codeparrot prompts, and the student trains on the tokenized output with standard causal LM loss.
Design
TextDistillConfigmatchingdistill-30b.yamlschema (teacher/student/synthetic_data sections)--stage generateinrun_config_moderealizar serve --model <teacher.apr> --gpuas subprocess/generateendpoint, writes output JSONLtarget_tokensbudget is reachedConfig fields used
Acceptance criteria
apr distill --config distill-30b.yaml --stage generateproduces JSONL with prompt+completion pairstarget_tokensbudgetRefs #61