graph
graph 2

Quorum — the AML triage brain that knows when not to guess

Quorum (Formerly RingFence) — Calibrated Anti-Money-Laundering Triage

🙋 First, the honest part

We were handed a dataset and a dare: don't just run ETL and flag the outliers, put real agents on it and make them reason. So we did exactly that. We didn't build a "fraud score." We built five agents that argue a case from raw transactions to a regulator-ready filing, share one memory, and on the one account where the evidence genuinely doesn't decide refuse to guess and hand it to a human. Everything below is the polished version, but that was the whole idea: go past anomaly detection, into judgment.

https://github.com/Alred-79/nycTechWeekHack

⚡ TL;DR

Quorum ingests a community bank's raw transaction file and runs a five-agent Find → Rank → Act → Ground → Explain pipeline over Cognee. In under a second it turns ~294 accounts / 5,000 transactions into a triage queue of 14: 9 escalate · 1 routed to human review · 4 cleared — with ring exposure reconciled to the cent at \$161,750.90 and a one-click, regulator-ready SAR memo.

What makes it different from every threshold tool:

🧠 A real probabilistic brain (PyMC). Agent 2 is an unsupervised two-component Bayesian mixture sampled with NUTS — it returns a full posterior with credible intervals, which is what lets the system express honest uncertainty instead of a fake-precise score.
🤝 Provable collaboration (Cognee). Each agent accretes fields onto one shared Case node; the next agent raises an error if the upstream fields are missing. The handoff is a hard data dependency, not a narrative.
🌐 Grounded in the real market (Geodo). A Domain Expert agent queries the live geodo.ai MCP through a safety gate to source the cost matrix and match the ring to a real, dated enforcement action — the "why this matters."
🙅 Calibrated abstention. When the credible interval straddles the decision threshold, Quorum routes to a human because the math says the review pays for itself.
🪤 Decoy resistance by construction. The planted "shared-device" trap that fools threshold tools is mathematically prevented from flagging.

🎯 The Product Brief we built against (Step 0)

User: an AML analyst at a community bank with ~3 minutes per case. Job: find the coordinated ring hiding below every alert threshold, without drowning in false positives. Our own success bar: (1) recover all 9 ring accounts, (2) clear the planted decoy, (3) abstain on the one genuinely ambiguous account, (4) reconcile the dollars to the cent, (5) every decision carries a human-readable reason — no bare scores.

We hit all five — and we wrote a test suite that enforces them on every commit. "Matches the brief," verifiable.

🕵️ What it does

Crestline Community Bank hands you a 90-day file: ~5,000 transactions across ~294 accounts. Hidden inside is a layering ring engineered to never cross a monitoring threshold — your rules engine caught none of it.

The haystack — 5,000 transactions, the ring hidden in the noise

The haystack. 5,000 transactions · 294 accounts. The dim cloud is normal banking; the red chains are the laundering ring. Click any node and the case opens inline — fully interactive, no server, no internet.

Drop the CSV into Quorum. Five agents run over one shared Cognee memory and collapse the noise to a focused queue. Three screens tell the story:

The product — KPIs and a fully-reasoned case dossier

~294 accounts → a queue of 14 in under a second. Every verdict carries its posterior, its expected-loss ledger, its signed signal contributions, and a decision log. No bare score anywhere.

The Queue — 14 surfaced accounts, each with its mule probability and uncertainty band, color-coded ESCALATE / REVIEW / CLEAR. Everything else auto-cleared. 9 escalate · 1 review · 4 cleared · \$161,750.90 reconciled.
Case detail — open a relay: a posterior at p = 0.995 with a tight interval far above τ = 0.05, the signals that fired, the expected-loss arithmetic behind the ESCALATE, and the source→relay→sink chains it sits in. No bare score anywhere.
- The decoy beat: an account shares a device with three others — the obvious flag. Quorum holds it at p ≈ 0 and CLEARS it. It didn't fall for the trap.
- The abstention beat: one account is fresh-cohort like the ring but has no transfers — genuinely ambiguous. Its interval straddles τ, so Quorum computes that a review pays for itself and routes it to REVIEW instead of guessing.

The needle — AC-0005 escalated at 99.5%

The needle. Click into the ring: the source pumps \$53,896 across 83 transfers into two relays at p = 99.5% — role, posterior, dollars, and signals fired, all inline.

Pipeline / Graph view — the same Case object after each agent touches it; you literally watch fields accrete. Then ask the Cognee knowledge graph in plain English — "which accounts are relays in the ring?" — and it answers. One more click downloads the SAR memo, reconciled edge-by-edge to \$161,750.90, with the learned closing rule appended.

🏆 The sponsors — and exactly how each one is load-bearing

Every sponsor below is in the critical path. Remove any one and a headline capability disappears — none of this is decorative.

🧠 PyMC Labs — the inference engine that decides

The masked two-component Bayesian mixture, marginalized for NUTS

Agent 2 (the Estimator, agents/ranker.py) is a real generative model: an unsupervised two-component Bayesian mixture (latent classes legit vs mule), each with its own vector of signal fire-rates φ. There are no labels in the data, so we don't train on "known mules" — we let PyMC's NUTS sampler (nutpie, 4 chains) infer the rates, the mixing weight, and every account's posterior probability of being a mule.

Three pieces of genuine Bayesian engineering — each produces a demo moment:

logsumexp marginalizes the hidden class analytically. NUTS can't sample discrete latents, so the class is integrated out in log-space and fed to pm.Potential as an exact marginal log-likelihood — the textbook-correct way to fit a mixture under HMC. We report R-hat (1.004) and 0 divergences on every run.
A masked likelihood — missing ≠ innocent. A pure sink physically can't fire automation or fresh_cohort; an applicability mask M treats those as missing data, not evidence of innocence. This is why sinks stay confident and why the lone ambiguous account comes back honestly uncertain with a wide, τ-straddling interval.
A skeptical decoy prior. device_shared gets the same weak prior in both classes, making identity co-occurrence non-discriminating by construction. The decoy cannot be driven to a flag by the math.

Three learned fingerprints — automation, burst cohort, structuring

Every signal is learned from the data, not hardcoded — and the Detector's three independent fingerprints all isolate the same ring.

Abstention and decoy-resistance aren't if statements — they fall out of the posterior. Remove PyMC and Quorum loses its defining behavior. (This is our submission for the PyMC Special Prize — full writeup in PYMC_JUDGES.md.)

🕸️ Cognee — the memory that makes the collaboration real

Cognee isn't a logging sink we bolted on; it's the substrate the whole pipeline runs over (cognee_client.py), in two layers:

Operational handoff store. Every account is one Case::AC-#### node. As each agent runs, it accretes new fields onto that same node, and every read/write is logged with its entity id. The dependency is enforced: the Estimator raises if the Detector's signals/dist_stats are absent; the Adjudicator raises if the Estimator's posterior is absent; the Domain Expert raises if no Case has an action. Agent N+1 demonstrably uses Agent N's output — or it refuses to run. That's the rubric's "real collaboration" criterion, made literal.
Semantic knowledge graph. At the end of a run each agent contributes a tagged layer (add(node_set=["quorum", "agent:detector", …])), and a single cognify() builds a knowledge graph carrying full multi-agent provenance — not just the final report, but who-found-what at every stage. It's then queryable in natural language and persisted to cognee_graph.json so the UI renders it instantly. A regulator can trace any conclusion back through the exact agent layers that produced it.

It degrades gracefully to a fast local store when no key is present, so the pipeline is never blocked.

🌐 Geodo — grounding the verdict in the real market

geodo.ai Digital Twin — grounding and MCP tool calls

Agent 4 (the Domain Expert, agents/domain_expert.py) reads the adjudicated ring's decisive_signals from Cognee and queries the live geodo.ai MCP through a hard safety gate (geo_client.py — read-only/allowlisted tools only, real outreach permanently denied, responses cached for deterministic judged runs). From Geodo's GTM Researcher it pulls the segment, the buyer personas, and the analyst labor-cost rate that sources the cost matrix (C_FP / C_REV) — so τ isn't a magic number, it's grounded in a real cost. It then matches the ring to a real, dated SAR-failure enforcement action against a peer institution: the "why now." All of it is written into Cognee as one MarketContext entity that the Reporter consumes — a second provable Geodo→Reporter handoff.

Because those dollars decide real escalation tradeoffs, we didn't leave them to the model alone: our teammate Luca, a data engineer at Deloitte who works on financial-crime / BSA data in consulting, pressure-tested the cost matrix and SAR-penalty assumptions against how a real BSA team actually spends per case — so the figures Geodo grounds the business case in reflect practice, not guesswork.

🎥 Trupeer — the demo, told in three minutes

We recorded the full walkthrough with Trupeer — CSV in, queue out, the decoy cleared, the abstention routed, the SAR memo downloaded, and the Cognee graph answering in plain English — so a judge sees the product operate cold without us narrating over their shoulder.

📊 Kaggle — the proving ground

The pipeline runs on the Kaggle-sourced Track 02 dataset (Crestline Community Bank: ~5,000 transactions, ~294 accounts). We keep the ground-truth oracle strictly separated from the detection code, and our tests prove the agents rediscover the answers from the data — not from the key.

🏗️ Architecture & the exact handoffs

A real Find → Rank → Act → Ground → Explain pipeline — five specialists, not one LLM in a loop. Data flows only through Cognee.

Crestline CSV → DuckDB (in-process) → COGNEE (shared Case nodes)
        │
        ▼
1. DETECTOR ─signals, dist_stats─▶ 2. ESTIMATOR ─p_mule, credible_interval─▶
3. ADJUDICATOR ─action, EVPI, decisive_signals─▶ 4. DOMAIN EXPERT ─MarketContext─▶
5. REPORTER ─SAR memo, closing_rule

Agent	Reads from Cognee	Computes	Writes back (the handoff)
1 · Detector	raw txns (DuckDB)	isolates the AC→AC transfer graph structurally; derives source/relay/sink roles; fires 7 signals; learns the fresh-cohort cutoff from the data (no magic 30 days)	`signals`, `dist_stats`
2 · Estimator	`signals`, `dist_stats`	PyMC mixture, NUTS	`p_mule`, `credible_interval`, `signal_contributions`
3 · Adjudicator	posterior + interval	Bayesian decision theory: τ = C_FP/(C_FP+C_FN); expected-loss argmin; abstains when the interval straddles τ and EVPI > review cost	`action`, `EVPI`, `decisive_signals`
4 · Domain Expert	adjudicated ring	queries geodo.ai MCP + FinCEN/OCC registry; sources the cost basis and the "why now"	`MarketContext`
5 · Reporter	the fully enriched Case set	edge-by-edge dollar reconciliation; FinCEN SAR memo; compiles the closing rule	`typology`, `closing_rule`, `memo_ref`

Decisions are deterministic (fixed seed + argmin), so the system is reproducible and never says "the model said so."

✅ How we map to the 25-point rubric

Criterion (5 pts)	How Quorum nails it
Functional agents on real data	Runs on the real Crestline CSV via DuckDB; a ground-truth oracle is kept separate and tests prove the pipeline rediscovers the answers from data.
Collaboration via Cognee	Field-accretion on shared `Case` nodes; downstream agents raise if upstream fields are missing. Provable, not narrated.
Matches the brief	All five Step-0 success conditions met and enforced by a green test suite.
Usable by non-technical operators	Three-screen Streamlit product + one-click downloadable SAR memo + English-language graph search. Drop a CSV, get a queue.
Transparent reasoning	Posterior + credible interval + expected-loss arithmetic + decisive signals on every case; the memo phrases (never invents) the computed facts.

🛠️ How we built it

Python 3.14, uv-managed. Pipeline: uv run main.py data/track02_fraud_watch.csv. Product: uv run streamlit run ui/app.py.
DuckDB — in-process analytical SQL over the transactions; zero server.
PyMC + nutpie + ArviZ — the Bayesian mixture, NUTS sampling, convergence diagnostics.
Cognee — shared memory + the cognified knowledge graph and NL search.
Geodo (geodo.ai MCP) — live GTM/market grounding behind a read-only safety gate.
Streamlit — the three-screen analyst product.
Bring-your-own LLM key (Anthropic / OpenAI / Groq) — every external layer degrades gracefully without one; keys never touch the repo.

🚧 Challenges we solved

Fitting a mixture under NUTS — discrete latent classes break HMC; the logsumexp marginalization + pm.Potential formulation made it sample cleanly (we removed a hard ordering constraint that injected ~280 divergences; informative priors pin the labels without it → 0 divergences, R-hat 1.004).
Structurally-missing signals — the masked likelihood was the unlock for both confident sinks and honest abstention.
Keeping agents honest — separating the ground-truth oracle from the detection code so the pipeline earns its answers, and the tests can prove it.
One async event loop for Cognee — buffering each agent's layer and flushing in a single loop with one cognify(), so async connections never bind to dead loops.

🔮 What's next

Streaming ingestion for live monitoring · the closed analyst-feedback learning loop (resolved dispositions become labels that update the priors and cost matrix online) · graph-native typology detection that surfaces emerging rings before they complete · multi-institution federated typology sharing via Cognee.

Every other tool ranks risk — and most just flag the decoy. Quorum surfaces the nine, refuses to guess on the tenth, ignores the trap, and shows you the math behind every call.