Inspiration

Most AI trading bots are black boxes. They output BUY or SELL and ask for blind trust. I wanted to build the opposite: a system that argues with itself out loud, publishes its reasoning as short videos, and studies recordings of its own thought process to improve over time.

The Arize track's framing, "ship agents that can self improve," matched exactly what I wanted to explore: an agent whose observability data is not just for debugging, but becomes a runtime input to its own evolution.

What it does

TradeCast is an autonomous, self-improving trading and media engine. It scans a watchlist through Alpaca, then launches a multi-agent debate. Bull, Bear, and Risk agents, powered by Gemini on Google ADK, analyze the same market snapshot and form independent convictions.

A Referee agent measures their disagreement through a metric called the "Signal Split" and delivers a verdict, which a deterministic Executor submits as a fixed $100 paper trade.

Every cycle is traced end-to-end inside Arize Phoenix through OpenInference.

The system then turns its strongest cycle of the day into a narrated YouTube Short using Playwright-rendered slides, Google Cloud TTS voiceover, and FFmpeg-based video assembly, then publishes it automatically.

Finally, a Reflector agent queries the system's own Phoenix traces through the Phoenix MCP server, combines them with YouTube engagement analytics, and proposes versioned prompt rewrites (vN+1.json) as reviewable diffs.

In our demo, the system notices it repeatedly became indecisive during high-conflict Bitcoin setups. After rewriting its Referee policy, the exact same market snapshot that previously produced HOLD now produces SELL.

Same market, different mind.

Read the full project story and architecture breakdown on Medium: https://medium.com/@michiyamamoto/i-built-an-ai-trading-system-that-argues-with-itself-then-rewrites-its-own-brain-e06853b22419

How we built it

The system evolved in layers:

  • We started with the ADK debate foundation using Bull, Bear, Risk, and Referee agents
  • Then added live market scanning and Alpaca paper-trading execution with a hard-locked $100 notional limit
  • Next came observability through Phoenix traces, OpenInference-instrumented ADK runners, and a custom MCP integration for direct trace introspection (get-spans)
  • After that, we built a closed-loop Reflector that evaluates trace history and proposes self-improving config diffs
  • We then added a multimedia pipeline using Jinja2, Playwright, matplotlib, Google Cloud TTS, and FFmpeg
  • Finally, we integrated automated YouTube publishing through the YouTube Data API with OAuth token rotation

A FastAPI + HTMX developer console ties everything together with a Reflection Lab for reviewing diffs, a cycle runner, video previews, and searchable debate history.

The codebase passes mypy --strict, ruff, and 126 hermetic offline tests.

Challenges we ran into

The hardest problem was making self-improvement legible.

A Reflector that vaguely "tunes prompts" is not convincing. We needed a before-and-after that even a non-engineer could immediately understand.

That led us to build a dedicated demo-evolution path where only the Referee's decision policy changes between configs, allowing the diff itself to tell the story.

Reliability for screen recording became another major challenge. LLM nondeterminism works against demos, so we pinned the market snapshot, ran the staged Referee at temperature 0, and added guardrails that cleanly fall back to fixtures if any API call fails.

Other challenges included:

  • keeping OpenInference spans coherent across both LLM agents and deterministic tools
  • handling OAuth token rotation for unattended YouTube uploads
  • enforcing paper-trading safety at multiple layers so no prompt rewrite could ever change order sizing

Accomplishments that we're proud of

The full closed loop works end-to-end:

market → debate → trade → video → publish → introspect → rewrite → measurably different behavior

The Reflector produces real unified diffs with human approve/reject controls instead of vague self-modification claims.

The Signal Split metric gives the system a vocabulary for its own uncertainty.

We are also proud that the engineering quality held up under hackathon pressure:

  • 126 hermetic offline tests
  • strict typing
  • deterministic safety rails
  • persistent on-screen disclaimers
  • hard-coded paper-trading enforcement that survives every self-rewrite

What we learned

Observability becomes much more powerful when the agent itself is the consumer.

Integrating Phoenix for our own debugging took an afternoon. Integrating it as a runtime tool the agent could query through MCP fundamentally changed the architecture of the system.

We also learned that disagreement between agents is signal, not noise. The most valuable behavioral insight the Reflector discovered was about how the system responds to its own internal conflict.

Finally, deterministic guardrails and LLM creativity turned out to be complements, not rivals. The more freedom we gave the Reflector over prompts, the more important hard-coded execution limits became.

What's next for TradeCast

Next, we want longer-horizon reflection loops where the Reflector evaluates weeks of traces and engagement data, with LLM-as-a-judge evaluations deciding which config versions get promoted.

We also want:

  • richer audience feedback loops such as comment sentiment and retention curves
  • additional asset classes
  • a public changelog where every self-rewrite is published alongside the generated videos

The goal is to make the system's learning history as transparent as its trades.

Built With

  • alpaca-api
  • arize-phoenix
  • chromium
  • fastapi
  • ffmpeg
  • gemini
  • google-adk
  • google-cloud-text-to-speech
  • htmx
  • jinja
  • matplotlib
  • mcp
  • mypy
  • oauth2
  • openinference
  • opentelemetry
  • phoenix-mcp
  • playwright
  • pytest
  • python
  • ruff
  • uvicorn
  • youtube-data-api
Share this project:

Updates