A 0.5B model. A local-first framework. Specialized agents that fit on a Mac Mini.
Four agents, each tuned to a single cognitive task. Each handoff is a typed envelope, not a paragraph.
Synthesises market state, indicators, and external signals into a structured read of the setup.
A monolithic prompt that tries to analyse, evaluate risk, size, and execute conflates four different cognitive tasks. Each task has its own evaluation criteria; collapsing them produces decisions that are hard to inspect and harder to improve.
Specialised agents isolate failure modes. An over-confident Analyst is easy to detect when its output passes through a Risk agent that disagrees on a measurable axis. A monolithic model hides the same disagreement inside a single chain of thought.
Per-agent evaluation becomes a tractable subproblem rather than a holistic judgement.
{
"from": "Analyst",
"to": "Risk",
"schemaVersion": "1.0",
"payload": {
"marketState": "contested_breakout",
"technicalScore": 0.71,
"sentimentScore": -0.34,
"primaryDriver": "regulatory_overhang",
"evidence": [
{ "tool": "fetch_market_data", "weight": 0.4 },
{ "tool": "query_news_sentiment", "weight": 0.6 }
]
},
"trace": {
"agentTurnId": "t_0001",
"promptCharsIn": 4218,
"completionCharsOut": 1794,
"toolCallCount": 3
}
}Watch four agents reason through a representative scenario. Each turn streams thinking, tool calls, and structured output.
Mid-cap equity shows a textbook breakout on the daily chart while news flow is unambiguously negative following a regulatory inquiry.
Held-out scenarios across multiple seeds. The framework is the variable, the model is the variable, and the contribution is the combination. Higher is better.
The evaluation harness is still being built. Final scores will land alongside the methodology in the project's research write-up. The table below shows the comparison surface we plan to publish, not measurements.
| Scenario | Fin Nano + framework | Fin Nano alone | Claude Opus 4.7 + framework | GPT-5 + framework | Frontier (no framework) |
|---|---|---|---|---|---|
| Conflicting signals | tbd | tbd | tbd | tbd | tbd |
| Volatility spike | tbd | tbd | tbd | tbd | tbd |
| Low-conviction setup | tbd | tbd | tbd | tbd | tbd |
| Regime shift mid-trace | tbd | tbd | tbd | tbd | tbd |
| Tool-output ambiguity | tbd | tbd | tbd | tbd | tbd |
| Composite (held-out, n=240) | tbd | tbd | tbd | tbd | tbd |
Scenario list and comparison surface are illustrative of the planned harness. Numbers will publish alongside the methodology in the project's research write-up.
Categories where we expect the system to underperform its frontier baselines. Verified results will be reported with the evaluation.
When the underlying regime changes between Analyst and Execution, the system may rely on stale framing. A regime-detection step before re-entry could close some of this gap.
Tool calls that return marginally usable data (sentiment scores in the noise band, low-volume indicators) can get over-weighted by downstream agents.
When the scenario lacks a clean prior — no analog in the historical search, no consensus across indicators — the pipeline may over-reach rather than standing aside.
Conditional add/exit rules are harder to calibrate because their conversion rate in evaluation is harder to anchor than that of single-step decisions.
A 0.5B model fine-tuned for schema fluency may produce valid JSON in unfamiliar scenarios while the underlying reasoning is shallow. The framework's typed contracts protect downstream agents from malformed input but cannot detect this kind of confident-but-wrong output. Detection requires either a reward model (out of scope for v0.1) or human spot-checking.
Fin Nano and the BigBugAI Fin framework are in active development. The website shows the design; the artifacts are forthcoming.
| Component | Status | Target |
|---|---|---|
| Schema spec | in draft | this month |
| Fin Nano v0.1 | training pipeline ready | 2-3 weeks |
| BigBugAI Fin v0.1 | design locked | 6-8 weeks |
Targets are estimates. The site will be updated as components ship.