The problem.

Small and medium AI/ML research teams operate under constant pressure to ship new models on tight compute budgets and shared hardware. Inefficient experimentation — rediscovering known dead ends, running redundant ablations, re-trying approaches that have already failed internally — burns iteration cycles and tens to hundreds of thousands of dollars of compute per quarter. The institutional knowledge of "what's already been tried" lives scattered across W&B runs, GitHub repos, Notion docs, Slack threads, and tribal memory ("I think we tried that?"). Newly-published research that would have saved a week of work goes unread.

## The solution. Optima is the experiment intelligence layer that lives in your terminal. You type a research question; a small team of Claude agents pulls the relevant published papers and your team's own past experiments, docs, and results, then returns one actionable recommendation — what to try next, why, a concrete experiment spec (model, method, key hyperparameters), a compute-cost estimate with savings vs. the naive approach, and per-claim confidence with citations.

## How it works.

  1. A cheap Claude Haiku 4.5 intent pass routes the query and decides which evidence agents to run.
  2. Two Claude Sonnet 4.6 agents run concurrently (via asyncio.gather):
    • A Research Agent searches arXiv + the Semantic Scholar API with a curated local cache.
    • A Context Agent searches the team's internal store of past enterprise-level experiments and documents using a compact cached index plus keyword + per-id lookup tools.
  3. A Synthesis Agent (also Sonnet 4.6) is forced to emit a structured recommendation via tool use, with a code-enforced citation firewall: any citation pointing at evidence the agents didn't actually gather is dropped before reaching the user. A hallucinated reference cannot reach the output.
  4. The result renders as: a Decision Summary, Ranked Evidence with clickable arXiv / Semantic Scholar links, an Experiment Spec with cost estimate, and Claims & Confidence tagged 🟢 High / 🟡 Medium / 🔴 Low.

What was built.

  • Five-agent Claude system — intent (Haiku), research (Sonnet), context (Sonnet), synthesis (Sonnet), ingest (Haiku) — sharing a single async tool-use loop with one cache_control prompt-cache breakpoint and blocking tool calls offloaded via asyncio.to_thread.
  • Code-enforced citation firewall that drops any reference pointing at an ID not actually in the gathered evidence —so a hallucinated paper or experiment ID can't reach the output.
    • Live paper search with offline cache fallback — tries arXiv + Semantic Scholar, falls back to a curated local cache of ~24 real papers on any failure or 403 so the demo works anywhere.
    • Canonical relationship-aware schema for experiments — including parent / related experiment IDs so the context agent traces lineages rather than treating history as a bag of disconnected runs.
    • Haiku-powered CSV ingest that normalizes messy team CSVs (odd column names, free-text metrics, "55 dollars", mixed date formats) into the canonical schema via a forced tool call, validated by pydantic, idempotent on experiment_id.
    • CLI surface: optima "", optima init (industry + API keys), optima ingest , optima status, optima experiments, optima papers.
    • Onboarding flow with industry-tuned search injected at the paper-search wire level so it's guaranteed on every live AND cache query regardless of the agent's term choices.
    • Terminal UX: live spinner with elapsed time fed by orchestrator phase callbacks (intent → evidence → synthesis), and a dynamic ASCII banner with a query-seeded SHA-256 sparkline fingerprint that's unique per run.
    • 40 offline tests that pass with no API key and no network — mocked Anthropic client, simulated arXiv/S2 403s, citation-firewall verification, env-file upsert, every new command.

What I'd add next

-Have Mindfort's findings appear in Optima's context agent. This would streamline many of the work put towards security/compliance by ML teams. -Adding more migrations like KubeFlow, Azure ML, and Apache airflow.

Built With

  • anthropicsdk
  • pydantic
  • python
  • semanticscholar
Share this project:

Updates