World Token Factory

Inspiration

Most operational risk tools live behind enterprise contracts, six-month onboarding cycles, and teams of consultants. We wanted to know: what if you could hand any business description to an AI and get a structured, evidence-backed risk audit in the time it takes to make a coffee?

We were also drawn to the tokenisation metaphor. A modern LLM thinks in tokens — discrete units of meaning. We asked: can you apply the same idea to a business? Break it into steps, each step into risk factors, and price each factor like a financial instrument — with a failure rate, an exposure range, and a confidence interval?

That became World Token Factory: a multimodal uncertainty engine that decomposes any business into its constituent risk tokens and lets AI agents evaluate them at whatever depth the situation demands.

## What it does

World Token Factory takes a plain-English business description and returns a live, interactive risk audit in three stages:

Decompose — An AI planner breaks the business into an ordered chain of operational steps (sourcing → production → distribution → …), then further decomposes each step into individual risk factors with initial probability estimates.

Analyse — The user selects a depth tier and a model, and agents run against each risk factor:

| Tier | Model sweet-spot | Token budget | What it does | |------|-----------------|-------------|--------------| | D1 — Quick Scan | Haiku | ~350 tok | Filename-level triage | | D2 — Research Brief | Sonnet | ~3 k tok | Reads files, embeds media | | D3 — Deep Run | Opus | ~200 k tok | Parallel sub-agents, full synthesis |

Report — Every risk factor produces a Failure Rate (FR) and a loss exposure range. The portfolio rolls up to a headline:

$$ \text{Total Exposure} = \sum_{i} \left[ L_i^{\text{low}},\ L_i^{\text{high}} \right] $$

$$ \text{FR}{\text{step}} = \frac{1}{|F|} \sum{f \in F} \text{FR}_f $$

Results stream live to the UI — maps zoom to the relevant geography, satellite imagery and video evidence appear as agents discover files, and an Executive Report renders a full narrative with sourced metrics.

## How we built it

Backend — Python + FastAPI. Each analysis tier is an async streaming endpoint that emits NDJSON events (step, file_found, signal, complete, token_update). At D3, risk factors run as parallel sub-agents orchestrated by a central planner.

Frontend — React + TypeScript + Vite. The UI is entirely event-driven: a single NDJSON stream drives map focus changes, live artifact cards, token counters, and risk metric updates simultaneously. Leaflet handles the geographic layer with coloured bounding-box overlays pinned to real coordinates for each scenario.

Multimodal evidence layer — As agents scan a business's artifact directory, every discovered file (file_found event) is immediately classified by extension and surfaced as a typed card: satellite images expand in-place, YouTube videos embed with a single click, documents open in a viewer, audio files play inline. The evidence arrives while the analysis is running, not after.

Model selection — Users choose between Claude Haiku 4.5, Sonnet 4.6, and Opus 4.6 at runtime. The depth tier and model are orthogonal controls, so you can run a D2 brief on Opus or a D3 deep-run on Haiku if token cost matters.

Sponsor integrations — We integrated eight sponsor tools across the stack: Railtracks, Senso, Nexla, DigitalOcean, Unkey, Augment, and Google AI.

## Challenges we ran into

Calibrating failure rates to reality. Our initial risk estimates looked plausible but were systematically underconfident. After researching real operational data — ERCOT curtailment records, PHMSA pipeline incident databases, NOAA hurricane strike frequencies — we found the model's priors were often 2–3× too optimistic. For example, the Waha Hub gas-price collapse risk (active constraint in 2023–24) should carry an FR above 0.80, not 0.68. We spent significant time grounding every figure against empirical base rates.

Streaming + map state coherence. The live map needs to respond to both user selection and agent events simultaneously, without stale-closure bugs or race conditions. Getting the React state model right — accumulating liveArtifacts from file_found events while also updating step risk scores — required careful use of functional updaters throughout.

Multimodal artifact routing. A single file_found event might refer to a .mp4, a GeoTIFF, a .url file pointing at YouTube, or a Parquet dataset. Each needs a completely different rendering component. Building a type-inference pipeline from extension alone, with correct fallbacks, was fiddlier than expected.

## What we learned

Token budgets are a UX primitive. Giving users visible control over depth (and therefore cost) dramatically changes how they interact with an AI system. The three-tier model is simple but powerful.
Streaming is a trust signal. Seeing evidence cards appear as the agent finds them — rather than waiting for a final report — made the system feel more legible and trustworthy during demos.
Failure rates are not vibes. Assigning a number like 0.28 to "hurricane strike probability" only means something if it's traceable to an empirical anchor (e.g., 4 Category-3+ GOM strikes in 13 years ≈ 31%). We built a habit of asking: what data source would a reinsurance actuary cite here?

## What's next

Live business ingestion — replace pre-baked scenarios with a real-time document ingestion pipeline (Nexla integration) so any company can upload their own artifacts.
D3 full parallel execution — the deep-run tier currently stubs; wiring up true concurrent sub-agent orchestration is the next milestone.
Comparative benchmarking — run the same scenario on Haiku vs Opus and surface the divergence, giving users a calibration signal for when the cheap model is good enough.

Built With

Updates

Clovis Vinant-Tang started this project — Mar 28, 2026 07:27 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.