Skip to content

Model Lab: connect CodeWhale traces to open-weight fine-tuning/eval services #1977

@Hmbown

Description

@Hmbown

Goal

Let users turn their own CodeWhale coding sessions into curated datasets, eval suites, and optional fine-tuning jobs for open-weight models. CodeWhale becomes the harness where people can improve open models, not just consume them.

Where this fits in the stack

  • CodeWhale is the workbench — runs the agent, owns the traces, owns the consent surface.
  • Providers are where models run (DeepSeek, Hugging Face Inference Providers, OpenRouter, Novita, Fireworks, Together, Hyperbolic, SiliconFlow, NVIDIA NIM, plus local: vLLM, SGLang, Ollama, llama.cpp, TGI).
  • Worksets are curated optional capability packs — they extend the lab without bloating core.
  • Fin is the cheap evaluator / router / verifier seam (Goal mode in v0.8.43, generalized verifier preview in v0.8.46, eval pipeline in v0.9.0).
  • Goal mode (v0.8.43, v0.8.43: Goal mode — persistent objective/workflow surface #1976) is the long-running loop that turns model changes into measured improvements.

Model Lab is the user-facing surface that sits on top of all of that. It turns the workbench's accumulated traces into datasets, runs eval replays, hands curated data to a fine-tuning backend, and integrates open-source worksets without vendoring them into core.

Non-goals

  • No automatic upload. Nothing leaves the user's machine unless they explicitly ran an export command.
  • No hidden telemetry. Each workset declares its telemetry posture; off by default.
  • No silent hosted routing. Provider / backend selection is always explicit.
  • No "download random model and trust it." Every model installed to the lab goes through an explicit install with license + provenance shown.
  • No proprietary-model-first workflow. Open-weight models are the first-class target.
  • No provider lock-in. Hosted backends are pluggable; the export format is provider-neutral.
  • No vendoring of heavy GPU / Python / NVIDIA dependencies into CodeWhale core. Worksets install on-demand.

Provider vs Workset — the architectural split

  • Providers live in CodeWhale's provider abstraction (v0.8.47 work). They are where models run. They are not optional; they are how the agent talks to a model. Each one has a config block, an auth path, and parity coverage in v0.8.45.
  • Worksets live in Model Lab. They are optional capability packs that bring open-source ML tooling into the lab loop. Each one installs on-demand and declares license, telemetry posture, network egress, GPU / Python deps, and what data leaves the machine if any.

Hugging Face is BOTH — it is a first-class provider (Inference Providers / Router) AND the open-model registry workset (Hub API, model cards, datasets, adapters, Safetensors, Jobs). The two roles ship through different surfaces; they share auth (HF_TOKEN / HUGGINGFACE_API_KEY alias).

Architecture: Worksets as the integration shape

CodeWhale core stays lean. Worksets are curated optional packs.

/model-lab worksets list                # show installed + available
/model-lab worksets install hf
/model-lab worksets install unsloth
/model-lab worksets install nemo
/model-lab worksets install arcee
/model-lab worksets install serving
/model-lab worksets install eval
/model-lab worksets install observability
/model-lab worksets install training-infra
/model-lab worksets remove <pack>
/model-lab worksets info <pack>         # license, telemetry, network, deps

Initial worksets

Hugging Face Workset

The open-model operating system. Even though HF is also a first-class provider for inference (lands in v0.8.47), the workset is the registry / dataset / adapter / model-card layer.

  • Model discovery: search open-weight coding models, filter by license, base model, quantization, tool/reasoning support.
  • Model passport: show license, base model, context length, chat template, tool-call support, reasoning support, eval results, gated / private status. Pulls from HF Hub API + model cards.
  • Hub integration: private / public model repos, adapter upload, dataset upload, model-card generation.
  • Dataset path: export CodeWhale transcripts → redact → dataset → train / eval.
  • Runtime bridges: Transformers, PEFT, TRL, Accelerate, Safetensors, Inference Providers, LightEval.
  • HF Jobs (later): compute workflows for training/eval runs initiated from the lab.

Auth: HF_TOKEN (canonical) or HUGGINGFACE_API_KEY (alias). No upload without an explicit command. Boundary: gated models require explicit consent flow before any pull.

Docs: https://huggingface.co/docs/inference-providers · https://huggingface.co/docs/hub/main/api · https://huggingface.co/docs/hub/model-cards · https://huggingface.co/docs/hub/jobs

Unsloth Workset

Local fine-tuning, oriented around the "make Brother Whale better at my workflow" path.

  • SFT / LoRA / QLoRA as the first simple path.
  • Later: DPO / GRPO-style preference-based improvement loops.
  • Fin as preflight: checks dataset shape, license / provenance, VRAM estimate, model compatibility, and whether the result is actually deployable.
  • Output adapters can be pushed to the user's HF repo via the HF Workset.

Docs: https://unsloth.ai/docs

NeMo Workset (NVIDIA NeMo, Apache-2.0)

Boundary: no traces sent to NVIDIA Build or any hosted endpoint by default.

Arcee Workset (Arcee open-source catalog)

The "make your own model" path — merge, distill, evaluate, serve open-weight checkpoints.

  • Trinity model family — open-weight multi-turn / tool-use / structured-output (arcee-ai/Trinity-Large-Thinking, Trinity-Large-Preview, Trinity-Mini, Trinity-Nano-Preview).
  • MergeKit (https://github.com/arcee-ai/mergekit) — model-merging library. Build "my coding agent blend" from open checkpoints, run through CodeWhale evals.
  • DistillKit (https://github.com/arcee-ai/DistillKit) — distillation toolkit.
  • Spectrum + Arcee Fusion — advanced merge methods as recipes.
  • Coder / Caller / Maestro / Blitz lines — verify license + availability before first-class.
/model-lab arcee merge <recipe>
/model-lab arcee distill <recipe>
/model-lab arcee eval <model>
/model-lab arcee compare <model-a> <model-b>

Serving Workset

Local runtime recipes for self-hosted serving.

  • vLLM — high-throughput batched serving.
  • SGLang — structured generation, good for tool-use workloads.
  • TGI (Text Generation Inference) — HuggingFace's serving stack.
  • llama.cpp / Ollama — CPU / consumer-GPU path.

Each gets a recipe: install + config + smoke test + which CodeWhale provider config to use to talk to it.

Eval Workset

Reproducible eval harnesses and CodeWhale-specific replay evals.

  • SWE-bench — software-engineering benchmark.
  • Terminal-Bench — terminal-agent benchmark.
  • BFCL — Berkeley Function-Calling Leaderboard.
  • LiveCodeBench — live coding benchmark.
  • CodeWhale replay evals — exported traces become eval suites; baseline-vs-finetune comparisons.

Observability Workset

Trace export and analysis sinks (all opt-in).

  • Per-turn cost / latency / reasoning-token / tool-call / failure stats.
  • Phoenix-style local trace UI.
  • Opik / Langfuse-style export adapters (later).
  • Boundary: opt-in per sink, redaction review mandatory before any external sink.

Training Infra Workset

Hosted execution adapters for fine-tuning / eval runs.

  • Prime Intellect — distributed fine-tuning.
  • Tinker — managed fine-tuning service.
  • HF Jobs — compute workflows on the HF Hub.
  • RunPod / Lambda-style — on-demand GPU rental (later).
/model-lab finetune --provider prime-intellect
/model-lab finetune --provider tinker
/model-lab finetune --provider hf-jobs

Core surface (CodeWhale-native, no workset required)

/model-lab capture                      # mark current session as a candidate trace
/model-lab export                       # selected sessions → redacted JSONL on disk
/model-lab redact                       # local redaction review before any upload
/model-lab dataset --from successful-turns
/model-lab eval --against current
/model-lab promote --if-better          # only swap default model if eval clears bar

Each session export includes prompts, tool calls, diffs, test results, approvals, failures, and final outcome labels — enough to reconstruct the task and grade it. Format is provider-neutral.

Strong-shape future flow

/model-lab capture
/model-lab redact
/model-lab dataset --from successful-turns
/model-lab finetune --workset unsloth --base qwen-or-deepseek-or-glm
/model-lab eval --against current
/model-lab promote --if-better

Dependencies (what unblocks this work)

  • Goal mode + Fin wakeup (v0.8.43, v0.8.43: Goal mode — persistent objective/workflow surface #1976) — the loop that produces measurable outcomes.
  • HF as first-class provider (v0.8.47) — the provider abstraction that makes HF Inference Providers / Router a real config block.
  • Verifier preview (v0.8.46) — generalizes Fin to all claim-of-done events.
  • Eval pipeline (v0.9.0 sub-project B) — the measurement substrate that grades outcomes.

Until those land, Model Lab is design-only. After they land, the export → redact → eval → finetune surface (plus worksets) becomes implementable.

Sequencing

  • v0.8.43 → v0.9.0: stabilize the substrate. No Model Lab implementation yet; this issue stays as living design.
  • Post-v0.9.0 (target v0.10.0): scope a dedicated Model Lab release. First wave: HF + Unsloth + NeMo + Arcee + Serving + Eval. Hosted backends: Prime Intellect + Tinker + HF Jobs.
  • v0.10.x: Observability + Training Infra worksets + community-contributed adapters.

Safety boundary recap

  • Exports require an explicit command — no background telemetry, no session shipping, no auto-fine-tune.
  • Redaction review is mandatory before any upload.
  • Provider adapters and worksets are opt-in per backend / per pack.
  • Each workset declares license, telemetry, network egress, and dependency footprint at install time.
  • Gated HF models require explicit consent before any pull.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentationenhancementNew feature or request

    Projects

    Status
    Backlog

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions