Inspiration
Most operational risk tools live behind enterprise contracts, six-month onboarding cycles, and teams of consultants. We wanted to know: what if you could hand any business description to an AI and get a structured, evidence-backed risk audit in the time it takes to make a coffee?
We were also drawn to the tokenisation metaphor. A modern LLM thinks in tokens — discrete units of meaning. We asked: can you apply the same idea to a business? Break it into steps, each step into risk factors, and price each factor like a financial instrument — with a failure rate, an exposure range, and a confidence interval?
That became World Token Factory: a multimodal uncertainty engine that decomposes any business into its constituent risk tokens and lets AI agents evaluate them at whatever depth the situation demands.
## What it does
World Token Factory takes a plain-English business description and returns a live, interactive risk audit in three stages:
Decompose — An AI planner breaks the business into an ordered chain of operational steps (sourcing → production → distribution → …), then further decomposes each step into individual risk factors with initial probability estimates.
Analyse — The user selects a depth tier and a model, and agents run against each risk factor:
| Tier | Model sweet-spot | Token budget | What it does | |------|-----------------|-------------|--------------| | D1 — Quick Scan | Haiku | ~350 tok | Filename-level triage | | D2 — Research Brief | Sonnet | ~3 k tok | Reads files, embeds media | | D3 — Deep Run | Opus | ~200 k tok | Parallel sub-agents, full synthesis |
Report — Every risk factor produces a Failure Rate (FR) and a loss exposure range. The portfolio rolls up to a headline:
$$ \text{Total Exposure} = \sum_{i} \left[ L_i^{\text{low}},\ L_i^{\text{high}} \right] $$
$$ \text{FR}{\text{step}} = \frac{1}{|F|} \sum{f \in F} \text{FR}_f $$
Results stream live to the UI — maps zoom to the relevant geography, satellite imagery and video evidence appear as agents discover files, and an Executive Report renders a full narrative with sourced metrics.
## How we built it
Backend — Python + FastAPI. Each analysis tier is an async streaming
endpoint that emits NDJSON events (step, file_found, signal,
complete, token_update). At D3, risk factors run as parallel
sub-agents orchestrated by a central planner.
Frontend — React + TypeScript + Vite. The UI is entirely event-driven: a single NDJSON stream drives map focus changes, live artifact cards, token counters, and risk metric updates simultaneously. Leaflet handles the geographic layer with coloured bounding-box overlays pinned to real coordinates for each scenario.
Multimodal evidence layer — As agents scan a business's artifact
directory, every discovered file (file_found event) is immediately
classified by extension and surfaced as a typed card: satellite images
expand in-place, YouTube videos embed with a single click, documents open
in a viewer, audio files play inline. The evidence arrives while the
analysis is running, not after.
Model selection — Users choose between Claude Haiku 4.5, Sonnet 4.6, and Opus 4.6 at runtime. The depth tier and model are orthogonal controls, so you can run a D2 brief on Opus or a D3 deep-run on Haiku if token cost matters.
Sponsor integrations — We integrated eight sponsor tools across the stack: Railtracks, Senso, Nexla, DigitalOcean, Unkey, Augment, and Google AI.
## Challenges we ran into
Calibrating failure rates to reality. Our initial risk estimates looked plausible but were systematically underconfident. After researching real operational data — ERCOT curtailment records, PHMSA pipeline incident databases, NOAA hurricane strike frequencies — we found the model's priors were often 2–3× too optimistic. For example, the Waha Hub gas-price collapse risk (active constraint in 2023–24) should carry an FR above 0.80, not 0.68. We spent significant time grounding every figure against empirical base rates.
Streaming + map state coherence. The live map needs to respond to both
user selection and agent events simultaneously, without stale-closure
bugs or race conditions. Getting the React state model right — accumulating
liveArtifacts from file_found events while also updating step risk
scores — required careful use of functional updaters throughout.
Multimodal artifact routing. A single file_found event might refer to
a .mp4, a GeoTIFF, a .url file pointing at YouTube, or a Parquet
dataset. Each needs a completely different rendering component. Building a
type-inference pipeline from extension alone, with correct fallbacks, was
fiddlier than expected.
## What we learned
Token budgets are a UX primitive. Giving users visible control over depth (and therefore cost) dramatically changes how they interact with an AI system. The three-tier model is simple but powerful.
Streaming is a trust signal. Seeing evidence cards appear as the agent finds them — rather than waiting for a final report — made the system feel more legible and trustworthy during demos.
Failure rates are not vibes. Assigning a number like 0.28 to "hurricane strike probability" only means something if it's traceable to an empirical anchor (e.g., 4 Category-3+ GOM strikes in 13 years ≈ 31%). We built a habit of asking: what data source would a reinsurance actuary cite here?
## What's next
- Live business ingestion — replace pre-baked scenarios with a real-time document ingestion pipeline (Nexla integration) so any company can upload their own artifacts.
- D3 full parallel execution — the deep-run tier currently stubs; wiring up true concurrent sub-agent orchestration is the next milestone.
- Comparative benchmarking — run the same scenario on Haiku vs Opus and surface the divergence, giving users a calibration signal for when the cheap model is good enough.
Built With
- javascript
- pnpm
- python
Log in or sign up for Devpost to join the conversation.