You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Let users turn their own CodeWhale coding sessions into curated datasets, eval suites, and optional fine-tuning jobs for open-weight models. CodeWhale becomes the harness where people can improve open models, not just consume them.
Where this fits in the stack
CodeWhale is the workbench — runs the agent, owns the traces, owns the consent surface.
Providers are where models run (DeepSeek, Hugging Face Inference Providers, OpenRouter, Novita, Fireworks, Together, Hyperbolic, SiliconFlow, NVIDIA NIM, plus local: vLLM, SGLang, Ollama, llama.cpp, TGI).
Worksets are curated optional capability packs — they extend the lab without bloating core.
Fin is the cheap evaluator / router / verifier seam (Goal mode in v0.8.43, generalized verifier preview in v0.8.46, eval pipeline in v0.9.0).
Model Lab is the user-facing surface that sits on top of all of that. It turns the workbench's accumulated traces into datasets, runs eval replays, hands curated data to a fine-tuning backend, and integrates open-source worksets without vendoring them into core.
Non-goals
No automatic upload. Nothing leaves the user's machine unless they explicitly ran an export command.
No hidden telemetry. Each workset declares its telemetry posture; off by default.
No silent hosted routing. Provider / backend selection is always explicit.
No "download random model and trust it." Every model installed to the lab goes through an explicit install with license + provenance shown.
No proprietary-model-first workflow. Open-weight models are the first-class target.
No provider lock-in. Hosted backends are pluggable; the export format is provider-neutral.
No vendoring of heavy GPU / Python / NVIDIA dependencies into CodeWhale core. Worksets install on-demand.
Provider vs Workset — the architectural split
Providers live in CodeWhale's provider abstraction (v0.8.47 work). They are where models run. They are not optional; they are how the agent talks to a model. Each one has a config block, an auth path, and parity coverage in v0.8.45.
Worksets live in Model Lab. They are optional capability packs that bring open-source ML tooling into the lab loop. Each one installs on-demand and declares license, telemetry posture, network egress, GPU / Python deps, and what data leaves the machine if any.
Hugging Face is BOTH — it is a first-class provider (Inference Providers / Router) AND the open-model registry workset (Hub API, model cards, datasets, adapters, Safetensors, Jobs). The two roles ship through different surfaces; they share auth (HF_TOKEN / HUGGINGFACE_API_KEY alias).
Architecture: Worksets as the integration shape
CodeWhale core stays lean. Worksets are curated optional packs.
The open-model operating system. Even though HF is also a first-class provider for inference (lands in v0.8.47), the workset is the registry / dataset / adapter / model-card layer.
Model discovery: search open-weight coding models, filter by license, base model, quantization, tool/reasoning support.
Model passport: show license, base model, context length, chat template, tool-call support, reasoning support, eval results, gated / private status. Pulls from HF Hub API + model cards.
Hub integration: private / public model repos, adapter upload, dataset upload, model-card generation.
HF Jobs (later): compute workflows for training/eval runs initiated from the lab.
Auth: HF_TOKEN (canonical) or HUGGINGFACE_API_KEY (alias). No upload without an explicit command. Boundary: gated models require explicit consent flow before any pull.
Boundary: no traces sent to NVIDIA Build or any hosted endpoint by default.
Arcee Workset (Arcee open-source catalog)
The "make your own model" path — merge, distill, evaluate, serve open-weight checkpoints.
Trinity model family — open-weight multi-turn / tool-use / structured-output (arcee-ai/Trinity-Large-Thinking, Trinity-Large-Preview, Trinity-Mini, Trinity-Nano-Preview).
MergeKit (https://github.com/arcee-ai/mergekit) — model-merging library. Build "my coding agent blend" from open checkpoints, run through CodeWhale evals.
Core surface (CodeWhale-native, no workset required)
/model-lab capture # mark current session as a candidate trace
/model-lab export # selected sessions → redacted JSONL on disk
/model-lab redact # local redaction review before any upload
/model-lab dataset --from successful-turns
/model-lab eval --against current
/model-lab promote --if-better # only swap default model if eval clears bar
Each session export includes prompts, tool calls, diffs, test results, approvals, failures, and final outcome labels — enough to reconstruct the task and grade it. Format is provider-neutral.
Goal
Let users turn their own CodeWhale coding sessions into curated datasets, eval suites, and optional fine-tuning jobs for open-weight models. CodeWhale becomes the harness where people can improve open models, not just consume them.
Where this fits in the stack
Model Lab is the user-facing surface that sits on top of all of that. It turns the workbench's accumulated traces into datasets, runs eval replays, hands curated data to a fine-tuning backend, and integrates open-source worksets without vendoring them into core.
Non-goals
Provider vs Workset — the architectural split
Hugging Face is BOTH — it is a first-class provider (Inference Providers / Router) AND the open-model registry workset (Hub API, model cards, datasets, adapters, Safetensors, Jobs). The two roles ship through different surfaces; they share auth (
HF_TOKEN/HUGGINGFACE_API_KEYalias).Architecture: Worksets as the integration shape
CodeWhale core stays lean. Worksets are curated optional packs.
Initial worksets
Hugging Face Workset
The open-model operating system. Even though HF is also a first-class provider for inference (lands in v0.8.47), the workset is the registry / dataset / adapter / model-card layer.
Auth:
HF_TOKEN(canonical) orHUGGINGFACE_API_KEY(alias). No upload without an explicit command. Boundary: gated models require explicit consent flow before any pull.Docs: https://huggingface.co/docs/inference-providers · https://huggingface.co/docs/hub/main/api · https://huggingface.co/docs/hub/model-cards · https://huggingface.co/docs/hub/jobs
Unsloth Workset
Local fine-tuning, oriented around the "make Brother Whale better at my workflow" path.
Docs: https://unsloth.ai/docs
NeMo Workset (NVIDIA NeMo, Apache-2.0)
PiiModifierover email, person, phone, URL, location). Useful before any trace export leaves the machine.Boundary: no traces sent to NVIDIA Build or any hosted endpoint by default.
Arcee Workset (Arcee open-source catalog)
The "make your own model" path — merge, distill, evaluate, serve open-weight checkpoints.
arcee-ai/Trinity-Large-Thinking,Trinity-Large-Preview,Trinity-Mini,Trinity-Nano-Preview).Serving Workset
Local runtime recipes for self-hosted serving.
Each gets a recipe: install + config + smoke test + which CodeWhale provider config to use to talk to it.
Eval Workset
Reproducible eval harnesses and CodeWhale-specific replay evals.
Observability Workset
Trace export and analysis sinks (all opt-in).
Training Infra Workset
Hosted execution adapters for fine-tuning / eval runs.
Core surface (CodeWhale-native, no workset required)
Each session export includes prompts, tool calls, diffs, test results, approvals, failures, and final outcome labels — enough to reconstruct the task and grade it. Format is provider-neutral.
Strong-shape future flow
Dependencies (what unblocks this work)
Until those land, Model Lab is design-only. After they land, the export → redact → eval → finetune surface (plus worksets) becomes implementable.
Sequencing
Safety boundary recap
Related