Quorum — the AML triage brain that knows when not to guess

🙋 First, the honest part
We were handed a dataset and a dare: don't just run ETL and flag the outliers, put real agents on it and make them reason. So we did exactly that. We didn't build a "fraud score." We built five agents that argue a case from raw transactions to a regulator-ready filing, share one memory, and on the one account where the evidence genuinely doesn't decide refuse to guess and hand it to a human. Everything below is the polished version, but that was the whole idea: go past anomaly detection, into judgment.
https://github.com/Alred-79/nycTechWeekHack
⚡ TL;DR
Quorum ingests a community bank's raw transaction file and runs a five-agent Find → Rank → Act → Ground → Explain pipeline over Cognee. In under a second it turns ~294 accounts / 5,000 transactions into a triage queue of 14: 9 escalate · 1 routed to human review · 4 cleared — with ring exposure reconciled to the cent at \$161,750.90 and a one-click, regulator-ready SAR memo.
What makes it different from every threshold tool:
- 🧠 A real probabilistic brain (PyMC). Agent 2 is an unsupervised two-component Bayesian mixture sampled with NUTS — it returns a full posterior with credible intervals, which is what lets the system express honest uncertainty instead of a fake-precise score.
- 🤝 Provable collaboration (Cognee). Each agent accretes fields onto one shared
Casenode; the next agent raises an error if the upstream fields are missing. The handoff is a hard data dependency, not a narrative. - 🌐 Grounded in the real market (Geodo). A Domain Expert agent queries the live geodo.ai MCP through a safety gate to source the cost matrix and match the ring to a real, dated enforcement action — the "why this matters."
- 🙅 Calibrated abstention. When the credible interval straddles the decision threshold, Quorum routes to a human because the math says the review pays for itself.
- 🪤 Decoy resistance by construction. The planted "shared-device" trap that fools threshold tools is mathematically prevented from flagging.
🎯 The Product Brief we built against (Step 0)
User: an AML analyst at a community bank with ~3 minutes per case. Job: find the coordinated ring hiding below every alert threshold, without drowning in false positives. Our own success bar: (1) recover all 9 ring accounts, (2) clear the planted decoy, (3) abstain on the one genuinely ambiguous account, (4) reconcile the dollars to the cent, (5) every decision carries a human-readable reason — no bare scores.
We hit all five — and we wrote a test suite that enforces them on every commit. "Matches the brief," verifiable.
🕵️ What it does
Crestline Community Bank hands you a 90-day file: ~5,000 transactions across ~294 accounts. Hidden inside is a layering ring engineered to never cross a monitoring threshold — your rules engine caught none of it.

The haystack. 5,000 transactions · 294 accounts. The dim cloud is normal banking; the red chains are the laundering ring. Click any node and the case opens inline — fully interactive, no server, no internet.
Drop the CSV into Quorum. Five agents run over one shared Cognee memory and collapse the noise to a focused queue. Three screens tell the story:

~294 accounts → a queue of 14 in under a second. Every verdict carries its posterior, its expected-loss ledger, its signed signal contributions, and a decision log. No bare score anywhere.
- The Queue — 14 surfaced accounts, each with its mule probability and uncertainty band, color-coded ESCALATE / REVIEW / CLEAR. Everything else auto-cleared. 9 escalate · 1 review · 4 cleared · \$161,750.90 reconciled.
- Case detail — open a relay: a posterior at p = 0.995 with a tight interval far above τ = 0.05, the signals that fired, the expected-loss arithmetic behind the ESCALATE, and the source→relay→sink chains it sits in. No bare score anywhere.
- The decoy beat: an account shares a device with three others — the obvious flag. Quorum holds it at p ≈ 0 and CLEARS it. It didn't fall for the trap.
- The abstention beat: one account is fresh-cohort like the ring but has no transfers — genuinely ambiguous. Its interval straddles τ, so Quorum computes that a review pays for itself and routes it to REVIEW instead of guessing.

The needle. Click into the ring: the source pumps \$53,896 across 83 transfers into two relays at p = 99.5% — role, posterior, dollars, and signals fired, all inline.
- Pipeline / Graph view — the same Case object after each agent touches it; you literally watch fields accrete. Then ask the Cognee knowledge graph in plain English — "which accounts are relays in the ring?" — and it answers. One more click downloads the SAR memo, reconciled edge-by-edge to \$161,750.90, with the learned closing rule appended.
🏆 The sponsors — and exactly how each one is load-bearing
Every sponsor below is in the critical path. Remove any one and a headline capability disappears — none of this is decorative.
🧠 PyMC Labs — the inference engine that decides

Agent 2 (the Estimator, agents/ranker.py) is a real generative model: an unsupervised two-component Bayesian mixture (latent classes legit vs mule), each with its own vector of signal fire-rates φ. There are no labels in the data, so we don't train on "known mules" — we let PyMC's NUTS sampler (nutpie, 4 chains) infer the rates, the mixing weight, and every account's posterior probability of being a mule.
Three pieces of genuine Bayesian engineering — each produces a demo moment:
logsumexpmarginalizes the hidden class analytically. NUTS can't sample discrete latents, so the class is integrated out in log-space and fed topm.Potentialas an exact marginal log-likelihood — the textbook-correct way to fit a mixture under HMC. We report R-hat (1.004) and 0 divergences on every run.- A masked likelihood — missing ≠ innocent. A pure sink physically can't fire
automationorfresh_cohort; an applicability maskMtreats those as missing data, not evidence of innocence. This is why sinks stay confident and why the lone ambiguous account comes back honestly uncertain with a wide, τ-straddling interval. - A skeptical decoy prior.
device_sharedgets the same weak prior in both classes, making identity co-occurrence non-discriminating by construction. The decoy cannot be driven to a flag by the math.

Every signal is learned from the data, not hardcoded — and the Detector's three independent fingerprints all isolate the same ring.
Abstention and decoy-resistance aren't if statements — they fall out of the posterior. Remove PyMC and Quorum loses its defining behavior. (This is our submission for the PyMC Special Prize — full writeup in PYMC_JUDGES.md.)
🕸️ Cognee — the memory that makes the collaboration real
Cognee isn't a logging sink we bolted on; it's the substrate the whole pipeline runs over (cognee_client.py), in two layers:
- Operational handoff store. Every account is one
Case::AC-####node. As each agent runs, it accretes new fields onto that same node, and every read/write is logged with its entity id. The dependency is enforced: the Estimator raises if the Detector'ssignals/dist_statsare absent; the Adjudicator raises if the Estimator's posterior is absent; the Domain Expert raises if no Case has anaction. Agent N+1 demonstrably uses Agent N's output — or it refuses to run. That's the rubric's "real collaboration" criterion, made literal. - Semantic knowledge graph. At the end of a run each agent contributes a tagged layer (
add(node_set=["quorum", "agent:detector", …])), and a singlecognify()builds a knowledge graph carrying full multi-agent provenance — not just the final report, but who-found-what at every stage. It's then queryable in natural language and persisted tocognee_graph.jsonso the UI renders it instantly. A regulator can trace any conclusion back through the exact agent layers that produced it.
It degrades gracefully to a fast local store when no key is present, so the pipeline is never blocked.
🌐 Geodo — grounding the verdict in the real market

Agent 4 (the Domain Expert, agents/domain_expert.py) reads the adjudicated ring's decisive_signals from Cognee and queries the live geodo.ai MCP through a hard safety gate (geo_client.py — read-only/allowlisted tools only, real outreach permanently denied, responses cached for deterministic judged runs). From Geodo's GTM Researcher it pulls the segment, the buyer personas, and the analyst labor-cost rate that sources the cost matrix (C_FP / C_REV) — so τ isn't a magic number, it's grounded in a real cost. It then matches the ring to a real, dated SAR-failure enforcement action against a peer institution: the "why now." All of it is written into Cognee as one MarketContext entity that the Reporter consumes — a second provable Geodo→Reporter handoff.
Because those dollars decide real escalation tradeoffs, we didn't leave them to the model alone: our teammate Luca, a data engineer at Deloitte who works on financial-crime / BSA data in consulting, pressure-tested the cost matrix and SAR-penalty assumptions against how a real BSA team actually spends per case — so the figures Geodo grounds the business case in reflect practice, not guesswork.
🎥 Trupeer — the demo, told in three minutes
We recorded the full walkthrough with Trupeer — CSV in, queue out, the decoy cleared, the abstention routed, the SAR memo downloaded, and the Cognee graph answering in plain English — so a judge sees the product operate cold without us narrating over their shoulder.
📊 Kaggle — the proving ground
The pipeline runs on the Kaggle-sourced Track 02 dataset (Crestline Community Bank: ~5,000 transactions, ~294 accounts). We keep the ground-truth oracle strictly separated from the detection code, and our tests prove the agents rediscover the answers from the data — not from the key.
🏗️ Architecture & the exact handoffs
A real Find → Rank → Act → Ground → Explain pipeline — five specialists, not one LLM in a loop. Data flows only through Cognee.
Crestline CSV → DuckDB (in-process) → COGNEE (shared Case nodes)
│
▼
1. DETECTOR ─signals, dist_stats─▶ 2. ESTIMATOR ─p_mule, credible_interval─▶
3. ADJUDICATOR ─action, EVPI, decisive_signals─▶ 4. DOMAIN EXPERT ─MarketContext─▶
5. REPORTER ─SAR memo, closing_rule
| Agent | Reads from Cognee | Computes | Writes back (the handoff) |
|---|---|---|---|
| 1 · Detector | raw txns (DuckDB) | isolates the AC→AC transfer graph structurally; derives source/relay/sink roles; fires 7 signals; learns the fresh-cohort cutoff from the data (no magic 30 days) | signals, dist_stats |
| 2 · Estimator | signals, dist_stats |
PyMC mixture, NUTS | p_mule, credible_interval, signal_contributions |
| 3 · Adjudicator | posterior + interval | Bayesian decision theory: τ = C_FP/(C_FP+C_FN); expected-loss argmin; abstains when the interval straddles τ and EVPI > review cost | action, EVPI, decisive_signals |
| 4 · Domain Expert | adjudicated ring | queries geodo.ai MCP + FinCEN/OCC registry; sources the cost basis and the "why now" | MarketContext |
| 5 · Reporter | the fully enriched Case set | edge-by-edge dollar reconciliation; FinCEN SAR memo; compiles the closing rule | typology, closing_rule, memo_ref |
Decisions are deterministic (fixed seed + argmin), so the system is reproducible and never says "the model said so."
✅ How we map to the 25-point rubric
| Criterion (5 pts) | How Quorum nails it |
|---|---|
| Functional agents on real data | Runs on the real Crestline CSV via DuckDB; a ground-truth oracle is kept separate and tests prove the pipeline rediscovers the answers from data. |
| Collaboration via Cognee | Field-accretion on shared Case nodes; downstream agents raise if upstream fields are missing. Provable, not narrated. |
| Matches the brief | All five Step-0 success conditions met and enforced by a green test suite. |
| Usable by non-technical operators | Three-screen Streamlit product + one-click downloadable SAR memo + English-language graph search. Drop a CSV, get a queue. |
| Transparent reasoning | Posterior + credible interval + expected-loss arithmetic + decisive signals on every case; the memo phrases (never invents) the computed facts. |
🛠️ How we built it
- Python 3.14,
uv-managed. Pipeline:uv run main.py data/track02_fraud_watch.csv. Product:uv run streamlit run ui/app.py. - DuckDB — in-process analytical SQL over the transactions; zero server.
- PyMC + nutpie + ArviZ — the Bayesian mixture, NUTS sampling, convergence diagnostics.
- Cognee — shared memory + the cognified knowledge graph and NL search.
- Geodo (geodo.ai MCP) — live GTM/market grounding behind a read-only safety gate.
- Streamlit — the three-screen analyst product.
- Bring-your-own LLM key (Anthropic / OpenAI / Groq) — every external layer degrades gracefully without one; keys never touch the repo.
🚧 Challenges we solved
- Fitting a mixture under NUTS — discrete latent classes break HMC; the
logsumexpmarginalization +pm.Potentialformulation made it sample cleanly (we removed a hard ordering constraint that injected ~280 divergences; informative priors pin the labels without it → 0 divergences, R-hat 1.004). - Structurally-missing signals — the masked likelihood was the unlock for both confident sinks and honest abstention.
- Keeping agents honest — separating the ground-truth oracle from the detection code so the pipeline earns its answers, and the tests can prove it.
- One async event loop for Cognee — buffering each agent's layer and flushing in a single loop with one
cognify(), so async connections never bind to dead loops.
🔮 What's next
Streaming ingestion for live monitoring · the closed analyst-feedback learning loop (resolved dispositions become labels that update the priors and cost matrix online) · graph-native typology detection that surfaces emerging rings before they complete · multi-institution federated typology sharing via Cognee.
Every other tool ranks risk — and most just flag the decoy. Quorum surfaces the nine, refuses to guess on the tenth, ignores the trap, and shows you the math behind every call.
Built With
- cognee
- geodo
- pymc
- python
Log in or sign up for Devpost to join the conversation.