Skip to content

apr distill --stage generate: text-based synthetic data generation #455

@noahgift

Description

@noahgift

Summary

Add --stage generate to apr distill --config <yaml> for text-based synthetic data generation from a teacher model.

Background

Albor distillation uses text-based synthetic data (not logit KD) because the teacher (Qwen3-Coder-30B, vocab=151,936) and student (albor-350m, vocab=32,768) have incompatible vocabularies. The teacher generates Python completions from codeparrot prompts, and the student trains on the tokenized output with standard causal LM loss.

Design

  1. New config type TextDistillConfig matching distill-30b.yaml schema (teacher/student/synthetic_data sections)
  2. New stage dispatch: --stage generate in run_config_mode
  3. Implementation spawns realizar serve --model <teacher.apr> --gpu as subprocess
  4. Reads prompts from JSONL, POSTs to /generate endpoint, writes output JSONL
  5. Stops when target_tokens budget is reached

Config fields used

teacher:
  model: "/path/to/teacher.apr"
  max_tokens: 256
  temperature: 0.8
  top_p: 0.95
  gpu: true

synthetic_data:
  prompts: "data/distill/prompts.jsonl"
  output: "data/distill/synthetic.jsonl"
  target_tokens: 500_000
  min_completion_tokens: 10

Acceptance criteria

  • apr distill --config distill-30b.yaml --stage generate produces JSONL with prompt+completion pairs
  • Progress logging every 10 prompts
  • Graceful shutdown of realizar server subprocess
  • Respects target_tokens budget

Refs #61

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions