BigBugAI — Multi-agent financial reasoning

§ 01

Architecture

Four agents, each tuned to a single cognitive task. Each handoff is a typed envelope, not a paragraph.

Agent

Analyst

Model

Fin Nano (0.5B)

Output

AnalystReport

Role

Synthesises market state, indicators, and external signals into a structured read of the setup.

Tools

▸fetch_market_data
▸compute_indicator
▸query_news_sentiment
▸historical_analog_search

Why specialise?

Each handoff is auditable.

A monolithic prompt that tries to analyse, evaluate risk, size, and execute conflates four different cognitive tasks. Each task has its own evaluation criteria; collapsing them produces decisions that are hard to inspect and harder to improve.

Specialised agents isolate failure modes. An over-confident Analyst is easy to detect when its output passes through a Risk agent that disagrees on a measurable axis. A monolithic model hides the same disagreement inside a single chain of thought.

Per-agent evaluation becomes a tractable subproblem rather than a holistic judgement.

Message schema

Analyst → Risk

v1.0

{
  "from": "Analyst",
  "to": "Risk",
  "schemaVersion": "1.0",
  "payload": {
    "marketState": "contested_breakout",
    "technicalScore": 0.71,
    "sentimentScore": -0.34,
    "primaryDriver": "regulatory_overhang",
    "evidence": [
      { "tool": "fetch_market_data", "weight": 0.4 },
      { "tool": "query_news_sentiment", "weight": 0.6 }
    ]
  },
  "trace": {
    "agentTurnId": "t_0001",
    "promptCharsIn": 4218,
    "completionCharsOut": 1794,
    "toolCallCount": 3
  }
}

§ 02

Illustrative decision traces

Watch four agents reason through a representative scenario. Each turn streams thinking, tool calls, and structured output.

Scenario

Conflicting signals: technicals bullish, sentiment bearish

Mid-cap equity shows a textbook breakout on the daily chart while news flow is unambiguously negative following a regulatory inquiry.

§ 03

Evaluation

Held-out scenarios across multiple seeds. The framework is the variable, the model is the variable, and the contribution is the combination. Higher is better.

Benchmark numbers — coming soon

The evaluation harness is still being built. Final scores will land alongside the methodology in the project's research write-up. The table below shows the comparison surface we plan to publish, not measurements.

Scenario	Fin Nano + framework	Fin Nano alone	Claude Opus 4.7 + framework	GPT-5 + framework	Frontier (no framework)
Conflicting signals	tbd	tbd	tbd	tbd	tbd
Volatility spike	tbd	tbd	tbd	tbd	tbd
Low-conviction setup	tbd	tbd	tbd	tbd	tbd
Regime shift mid-trace	tbd	tbd	tbd	tbd	tbd
Tool-output ambiguity	tbd	tbd	tbd	tbd	tbd
Composite (held-out, n=240)	tbd	tbd	tbd	tbd	tbd

Scenario list and comparison surface are illustrative of the planned harness. Numbers will publish alongside the methodology in the project's research write-up.

Anticipated failure modes

Categories where we expect the system to underperform its frontier baselines. Verified results will be reported with the evaluation.

Hypothesis
Regime shifts within trace window
When the underlying regime changes between Analyst and Execution, the system may rely on stale framing. A regime-detection step before re-entry could close some of this gap.
Hypothesis
Tool output ambiguity
Tool calls that return marginally usable data (sentiment scores in the noise band, low-volume indicators) can get over-weighted by downstream agents.
Hypothesis
Under-specified scenarios
When the scenario lacks a clean prior — no analog in the historical search, no consensus across indicators — the pipeline may over-reach rather than standing aside.
Hypothesis
Calibration on conditional plans
Conditional add/exit rules are harder to calibrate because their conversion rate in evaluation is harder to anchor than that of single-step decisions.
Hypothesis
Schema brittleness in the small model
A 0.5B model fine-tuned for schema fluency may produce valid JSON in unfamiliar scenarios while the underlying reasoning is shallow. The framework's typed contracts protect downstream agents from malformed input but cannot detect this kind of confident-but-wrong output. Detection requires either a reward model (out of scope for v0.1) or human spot-checking.

§ 04

Status

Fin Nano and the BigBugAI Fin framework are in active development. The website shows the design; the artifacts are forthcoming.

Component	Status	Target
Schema spec	in draft	this month
Fin Nano v0.1	training pipeline ready	2-3 weeks
BigBugAI Fin v0.1	design locked	6-8 weeks

Targets are estimates. The site will be updated as components ship.