Model Lab: connect CodeWhale traces to open-weight fine-tuning/eval services

## Goal

Let users turn their own CodeWhale coding sessions into curated datasets, eval suites, and optional fine-tuning jobs for open-weight models. CodeWhale becomes the harness where people can improve open models, not just consume them.

## Where this fits in the stack

- **CodeWhale** is the workbench — runs the agent, owns the traces, owns the consent surface.
- **Providers** are where models run (DeepSeek, Hugging Face Inference Providers, OpenRouter, Novita, Fireworks, Together, Hyperbolic, SiliconFlow, NVIDIA NIM, plus local: vLLM, SGLang, Ollama, llama.cpp, TGI).
- **Worksets** are curated optional capability packs — they extend the lab without bloating core.
- **Fin** is the cheap evaluator / router / verifier seam (Goal mode in v0.8.43, generalized verifier preview in v0.8.46, eval pipeline in v0.9.0).
- **Goal mode** (v0.8.43, #1976) is the long-running loop that turns model changes into measured improvements.

Model Lab is the user-facing surface that sits on top of all of that. It turns the workbench's accumulated traces into datasets, runs eval replays, hands curated data to a fine-tuning backend, and integrates open-source worksets without vendoring them into core.

## Non-goals

- **No automatic upload.** Nothing leaves the user's machine unless they explicitly ran an export command.
- **No hidden telemetry.** Each workset declares its telemetry posture; off by default.
- **No silent hosted routing.** Provider / backend selection is always explicit.
- **No "download random model and trust it."** Every model installed to the lab goes through an explicit install with license + provenance shown.
- **No proprietary-model-first workflow.** Open-weight models are the first-class target.
- **No provider lock-in.** Hosted backends are pluggable; the export format is provider-neutral.
- **No vendoring of heavy GPU / Python / NVIDIA dependencies into CodeWhale core.** Worksets install on-demand.

## Provider vs Workset — the architectural split

- **Providers** live in CodeWhale's provider abstraction (v0.8.47 work). They are where models run. They are not optional; they are how the agent talks to a model. Each one has a config block, an auth path, and parity coverage in v0.8.45.
- **Worksets** live in Model Lab. They are optional capability packs that bring open-source ML tooling into the lab loop. Each one installs on-demand and declares license, telemetry posture, network egress, GPU / Python deps, and what data leaves the machine if any.

**Hugging Face is BOTH** — it is a first-class provider (Inference Providers / Router) AND the open-model registry workset (Hub API, model cards, datasets, adapters, Safetensors, Jobs). The two roles ship through different surfaces; they share auth (`HF_TOKEN` / `HUGGINGFACE_API_KEY` alias).

## Architecture: Worksets as the integration shape

CodeWhale core stays lean. Worksets are curated optional packs.

```
/model-lab worksets list                # show installed + available
/model-lab worksets install hf
/model-lab worksets install unsloth
/model-lab worksets install nemo
/model-lab worksets install arcee
/model-lab worksets install serving
/model-lab worksets install eval
/model-lab worksets install observability
/model-lab worksets install training-infra
/model-lab worksets remove <pack>
/model-lab worksets info <pack>         # license, telemetry, network, deps
```

## Initial worksets

### Hugging Face Workset

The open-model operating system. Even though HF is also a first-class **provider** for inference (lands in v0.8.47), the **workset** is the registry / dataset / adapter / model-card layer.

- **Model discovery**: search open-weight coding models, filter by license, base model, quantization, tool/reasoning support.
- **Model passport**: show license, base model, context length, chat template, tool-call support, reasoning support, eval results, gated / private status. Pulls from HF Hub API + model cards.
- **Hub integration**: private / public model repos, adapter upload, dataset upload, model-card generation.
- **Dataset path**: export CodeWhale transcripts → redact → dataset → train / eval.
- **Runtime bridges**: Transformers, PEFT, TRL, Accelerate, Safetensors, Inference Providers, LightEval.
- **HF Jobs** (later): compute workflows for training/eval runs initiated from the lab.

Auth: `HF_TOKEN` (canonical) or `HUGGINGFACE_API_KEY` (alias). No upload without an explicit command. Boundary: gated models require explicit consent flow before any pull.

Docs: https://huggingface.co/docs/inference-providers · https://huggingface.co/docs/hub/main/api · https://huggingface.co/docs/hub/model-cards · https://huggingface.co/docs/hub/jobs

### Unsloth Workset

Local fine-tuning, oriented around the "make Brother Whale better at my workflow" path.

- SFT / LoRA / QLoRA as the first simple path.
- Later: DPO / GRPO-style preference-based improvement loops.
- Fin as preflight: checks dataset shape, license / provenance, VRAM estimate, model compatibility, and whether the result is actually deployable.
- Output adapters can be pushed to the user's HF repo via the HF Workset.

Docs: https://unsloth.ai/docs

### NeMo Workset (NVIDIA NeMo, Apache-2.0)

- **Data Designer** — synthetic dataset generation from seed traces (https://github.com/NVIDIA-NeMo/DataDesigner). Telemetry opt-in per their README.
- **Curator** — PII / secret redaction (`PiiModifier` over email, person, phone, URL, location). Useful before any trace export leaves the machine.
- **Evaluator** — reproducible model evaluation; OpenAI-compatible API; coding benchmarks; function-calling; long-context; agentic suites (https://github.com/NVIDIA-NeMo/Evaluator).
- **Guardrails** — optional policy rails on prompt injection, tool-use boundaries, data export (https://github.com/NVIDIA-NeMo/Guardrails).
- **Aligner** — later-stage post-training path (https://github.com/NVIDIA/NeMo-Aligner). Not a first-run default.

Boundary: no traces sent to NVIDIA Build or any hosted endpoint by default.

### Arcee Workset (Arcee open-source catalog)

The "make your own model" path — merge, distill, evaluate, serve open-weight checkpoints.

- **Trinity model family** — open-weight multi-turn / tool-use / structured-output (`arcee-ai/Trinity-Large-Thinking`, `Trinity-Large-Preview`, `Trinity-Mini`, `Trinity-Nano-Preview`).
- **MergeKit** (https://github.com/arcee-ai/mergekit) — model-merging library. Build "my coding agent blend" from open checkpoints, run through CodeWhale evals.
- **DistillKit** (https://github.com/arcee-ai/DistillKit) — distillation toolkit.
- **Spectrum + Arcee Fusion** — advanced merge methods as recipes.
- **Coder / Caller / Maestro / Blitz lines** — verify license + availability before first-class.

```
/model-lab arcee merge <recipe>
/model-lab arcee distill <recipe>
/model-lab arcee eval <model>
/model-lab arcee compare <model-a> <model-b>
```

### Serving Workset

Local runtime recipes for self-hosted serving.

- **vLLM** — high-throughput batched serving.
- **SGLang** — structured generation, good for tool-use workloads.
- **TGI** (Text Generation Inference) — HuggingFace's serving stack.
- **llama.cpp / Ollama** — CPU / consumer-GPU path.

Each gets a recipe: install + config + smoke test + which CodeWhale provider config to use to talk to it.

### Eval Workset

Reproducible eval harnesses and CodeWhale-specific replay evals.

- **SWE-bench** — software-engineering benchmark.
- **Terminal-Bench** — terminal-agent benchmark.
- **BFCL** — Berkeley Function-Calling Leaderboard.
- **LiveCodeBench** — live coding benchmark.
- **CodeWhale replay evals** — exported traces become eval suites; baseline-vs-finetune comparisons.

### Observability Workset

Trace export and analysis sinks (all opt-in).

- Per-turn cost / latency / reasoning-token / tool-call / failure stats.
- Phoenix-style local trace UI.
- Opik / Langfuse-style export adapters (later).
- Boundary: opt-in per sink, redaction review mandatory before any external sink.

### Training Infra Workset

Hosted execution adapters for fine-tuning / eval runs.

- **Prime Intellect** — distributed fine-tuning.
- **Tinker** — managed fine-tuning service.
- **HF Jobs** — compute workflows on the HF Hub.
- **RunPod / Lambda-style** — on-demand GPU rental (later).

```
/model-lab finetune --provider prime-intellect
/model-lab finetune --provider tinker
/model-lab finetune --provider hf-jobs
```

## Core surface (CodeWhale-native, no workset required)

```
/model-lab capture                      # mark current session as a candidate trace
/model-lab export                       # selected sessions → redacted JSONL on disk
/model-lab redact                       # local redaction review before any upload
/model-lab dataset --from successful-turns
/model-lab eval --against current
/model-lab promote --if-better          # only swap default model if eval clears bar
```

Each session export includes prompts, tool calls, diffs, test results, approvals, failures, and final outcome labels — enough to reconstruct the task and grade it. Format is provider-neutral.

## Strong-shape future flow

```
/model-lab capture
/model-lab redact
/model-lab dataset --from successful-turns
/model-lab finetune --workset unsloth --base qwen-or-deepseek-or-glm
/model-lab eval --against current
/model-lab promote --if-better
```

## Dependencies (what unblocks this work)

- **Goal mode + Fin wakeup** (v0.8.43, #1976) — the loop that produces measurable outcomes.
- **HF as first-class provider** (v0.8.47) — the provider abstraction that makes HF Inference Providers / Router a real config block.
- **Verifier preview** (v0.8.46) — generalizes Fin to all claim-of-done events.
- **Eval pipeline** (v0.9.0 sub-project B) — the measurement substrate that grades outcomes.

Until those land, Model Lab is design-only. After they land, the export → redact → eval → finetune surface (plus worksets) becomes implementable.

## Sequencing

- **v0.8.43 → v0.9.0**: stabilize the substrate. No Model Lab implementation yet; this issue stays as living design.
- **Post-v0.9.0 (target v0.10.0)**: scope a dedicated Model Lab release. First wave: HF + Unsloth + NeMo + Arcee + Serving + Eval. Hosted backends: Prime Intellect + Tinker + HF Jobs.
- **v0.10.x**: Observability + Training Infra worksets + community-contributed adapters.

## Safety boundary recap

- Exports require an explicit command — no background telemetry, no session shipping, no auto-fine-tune.
- Redaction review is mandatory before any upload.
- Provider adapters and worksets are opt-in per backend / per pack.
- Each workset declares license, telemetry, network egress, and dependency footprint at install time.
- Gated HF models require explicit consent before any pull.

## Related

- #1976 (v0.8.43) Goal mode — the loop this depends on.
- v0.8.46 verifier preview — the generalized Fin path.
- v0.8.47 milestone — HF as first-class provider lands here.
- v0.9.0 milestone — the eval pipeline that grades the outcomes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model Lab: connect CodeWhale traces to open-weight fine-tuning/eval services #1977

Goal

Where this fits in the stack

Non-goals

Provider vs Workset — the architectural split

Architecture: Worksets as the integration shape

Initial worksets

Hugging Face Workset

Unsloth Workset

NeMo Workset (NVIDIA NeMo, Apache-2.0)

Arcee Workset (Arcee open-source catalog)

Serving Workset

Eval Workset

Observability Workset

Training Infra Workset

Core surface (CodeWhale-native, no workset required)

Strong-shape future flow

Dependencies (what unblocks this work)

Sequencing

Safety boundary recap

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Model Lab: connect CodeWhale traces to open-weight fine-tuning/eval services #1977

Description

Goal

Where this fits in the stack

Non-goals

Provider vs Workset — the architectural split

Architecture: Worksets as the integration shape

Initial worksets

Hugging Face Workset

Unsloth Workset

NeMo Workset (NVIDIA NeMo, Apache-2.0)

Arcee Workset (Arcee open-source catalog)

Serving Workset

Eval Workset

Observability Workset

Training Infra Workset

Core surface (CodeWhale-native, no workset required)

Strong-shape future flow

Dependencies (what unblocks this work)

Sequencing

Safety boundary recap

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions