Quorum — the AML triage brain that knows when not to guess

Quorum (Formerly RingFence) — Calibrated Anti-Money-Laundering Triage

🙋 First, the honest part

We were handed a dataset and a dare: don't just run ETL and flag the outliers, put real agents on it and make them reason. So we did exactly that. We didn't build a "fraud score." We built five agents that argue a case from raw transactions to a regulator-ready filing, share one memory, and on the one account where the evidence genuinely doesn't decide refuse to guess and hand it to a human. Everything below is the polished version, but that was the whole idea: go past anomaly detection, into judgment.


https://github.com/Alred-79/nycTechWeekHack

⚡ TL;DR

Quorum ingests a community bank's raw transaction file and runs a five-agent Find → Rank → Act → Ground → Explain pipeline over Cognee. In under a second it turns ~294 accounts / 5,000 transactions into a triage queue of 14: 9 escalate · 1 routed to human review · 4 cleared — with ring exposure reconciled to the cent at \$161,750.90 and a one-click, regulator-ready SAR memo.

What makes it different from every threshold tool:

  • 🧠 A real probabilistic brain (PyMC). Agent 2 is an unsupervised two-component Bayesian mixture sampled with NUTS — it returns a full posterior with credible intervals, which is what lets the system express honest uncertainty instead of a fake-precise score.
  • 🤝 Provable collaboration (Cognee). Each agent accretes fields onto one shared Case node; the next agent raises an error if the upstream fields are missing. The handoff is a hard data dependency, not a narrative.
  • 🌐 Grounded in the real market (Geodo). A Domain Expert agent queries the live geodo.ai MCP through a safety gate to source the cost matrix and match the ring to a real, dated enforcement action — the "why this matters."
  • 🙅 Calibrated abstention. When the credible interval straddles the decision threshold, Quorum routes to a human because the math says the review pays for itself.
  • 🪤 Decoy resistance by construction. The planted "shared-device" trap that fools threshold tools is mathematically prevented from flagging.

🎯 The Product Brief we built against (Step 0)

User: an AML analyst at a community bank with ~3 minutes per case. Job: find the coordinated ring hiding below every alert threshold, without drowning in false positives. Our own success bar: (1) recover all 9 ring accounts, (2) clear the planted decoy, (3) abstain on the one genuinely ambiguous account, (4) reconcile the dollars to the cent, (5) every decision carries a human-readable reason — no bare scores.

We hit all five — and we wrote a test suite that enforces them on every commit. "Matches the brief," verifiable.


🕵️ What it does

Crestline Community Bank hands you a 90-day file: ~5,000 transactions across ~294 accounts. Hidden inside is a layering ring engineered to never cross a monitoring threshold — your rules engine caught none of it.

The haystack — 5,000 transactions, the ring hidden in the noise

The haystack. 5,000 transactions · 294 accounts. The dim cloud is normal banking; the red chains are the laundering ring. Click any node and the case opens inline — fully interactive, no server, no internet.

Drop the CSV into Quorum. Five agents run over one shared Cognee memory and collapse the noise to a focused queue. Three screens tell the story:

The product — KPIs and a fully-reasoned case dossier

~294 accounts → a queue of 14 in under a second. Every verdict carries its posterior, its expected-loss ledger, its signed signal contributions, and a decision log. No bare score anywhere.

  • The Queue — 14 surfaced accounts, each with its mule probability and uncertainty band, color-coded ESCALATE / REVIEW / CLEAR. Everything else auto-cleared. 9 escalate · 1 review · 4 cleared · \$161,750.90 reconciled.
  • Case detail — open a relay: a posterior at p = 0.995 with a tight interval far above τ = 0.05, the signals that fired, the expected-loss arithmetic behind the ESCALATE, and the source→relay→sink chains it sits in. No bare score anywhere.
    • The decoy beat: an account shares a device with three others — the obvious flag. Quorum holds it at p ≈ 0 and CLEARS it. It didn't fall for the trap.
    • The abstention beat: one account is fresh-cohort like the ring but has no transfers — genuinely ambiguous. Its interval straddles τ, so Quorum computes that a review pays for itself and routes it to REVIEW instead of guessing.

The needle — AC-0005 escalated at 99.5%

The needle. Click into the ring: the source pumps \$53,896 across 83 transfers into two relays at p = 99.5% — role, posterior, dollars, and signals fired, all inline.

  • Pipeline / Graph view — the same Case object after each agent touches it; you literally watch fields accrete. Then ask the Cognee knowledge graph in plain English — "which accounts are relays in the ring?" — and it answers. One more click downloads the SAR memo, reconciled edge-by-edge to \$161,750.90, with the learned closing rule appended.

🏆 The sponsors — and exactly how each one is load-bearing

Every sponsor below is in the critical path. Remove any one and a headline capability disappears — none of this is decorative.

🧠 PyMC Labs — the inference engine that decides

The masked two-component Bayesian mixture, marginalized for NUTS

Agent 2 (the Estimator, agents/ranker.py) is a real generative model: an unsupervised two-component Bayesian mixture (latent classes legit vs mule), each with its own vector of signal fire-rates φ. There are no labels in the data, so we don't train on "known mules" — we let PyMC's NUTS sampler (nutpie, 4 chains) infer the rates, the mixing weight, and every account's posterior probability of being a mule.

Three pieces of genuine Bayesian engineering — each produces a demo moment:

  1. logsumexp marginalizes the hidden class analytically. NUTS can't sample discrete latents, so the class is integrated out in log-space and fed to pm.Potential as an exact marginal log-likelihood — the textbook-correct way to fit a mixture under HMC. We report R-hat (1.004) and 0 divergences on every run.
  2. A masked likelihood — missing ≠ innocent. A pure sink physically can't fire automation or fresh_cohort; an applicability mask M treats those as missing data, not evidence of innocence. This is why sinks stay confident and why the lone ambiguous account comes back honestly uncertain with a wide, τ-straddling interval.
  3. A skeptical decoy prior. device_shared gets the same weak prior in both classes, making identity co-occurrence non-discriminating by construction. The decoy cannot be driven to a flag by the math.

Three learned fingerprints — automation, burst cohort, structuring

Every signal is learned from the data, not hardcoded — and the Detector's three independent fingerprints all isolate the same ring.

Abstention and decoy-resistance aren't if statements — they fall out of the posterior. Remove PyMC and Quorum loses its defining behavior. (This is our submission for the PyMC Special Prize — full writeup in PYMC_JUDGES.md.)

🕸️ Cognee — the memory that makes the collaboration real

Cognee isn't a logging sink we bolted on; it's the substrate the whole pipeline runs over (cognee_client.py), in two layers:

  • Operational handoff store. Every account is one Case::AC-#### node. As each agent runs, it accretes new fields onto that same node, and every read/write is logged with its entity id. The dependency is enforced: the Estimator raises if the Detector's signals/dist_stats are absent; the Adjudicator raises if the Estimator's posterior is absent; the Domain Expert raises if no Case has an action. Agent N+1 demonstrably uses Agent N's output — or it refuses to run. That's the rubric's "real collaboration" criterion, made literal.
  • Semantic knowledge graph. At the end of a run each agent contributes a tagged layer (add(node_set=["quorum", "agent:detector", …])), and a single cognify() builds a knowledge graph carrying full multi-agent provenance — not just the final report, but who-found-what at every stage. It's then queryable in natural language and persisted to cognee_graph.json so the UI renders it instantly. A regulator can trace any conclusion back through the exact agent layers that produced it.

It degrades gracefully to a fast local store when no key is present, so the pipeline is never blocked.

🌐 Geodo — grounding the verdict in the real market

geodo.ai Digital Twin — grounding and MCP tool calls

Agent 4 (the Domain Expert, agents/domain_expert.py) reads the adjudicated ring's decisive_signals from Cognee and queries the live geodo.ai MCP through a hard safety gate (geo_client.py — read-only/allowlisted tools only, real outreach permanently denied, responses cached for deterministic judged runs). From Geodo's GTM Researcher it pulls the segment, the buyer personas, and the analyst labor-cost rate that sources the cost matrix (C_FP / C_REV) — so τ isn't a magic number, it's grounded in a real cost. It then matches the ring to a real, dated SAR-failure enforcement action against a peer institution: the "why now." All of it is written into Cognee as one MarketContext entity that the Reporter consumes — a second provable Geodo→Reporter handoff.

Because those dollars decide real escalation tradeoffs, we didn't leave them to the model alone: our teammate Luca, a data engineer at Deloitte who works on financial-crime / BSA data in consulting, pressure-tested the cost matrix and SAR-penalty assumptions against how a real BSA team actually spends per case — so the figures Geodo grounds the business case in reflect practice, not guesswork.

🎥 Trupeer — the demo, told in three minutes

We recorded the full walkthrough with Trupeer — CSV in, queue out, the decoy cleared, the abstention routed, the SAR memo downloaded, and the Cognee graph answering in plain English — so a judge sees the product operate cold without us narrating over their shoulder.

📊 Kaggle — the proving ground

The pipeline runs on the Kaggle-sourced Track 02 dataset (Crestline Community Bank: ~5,000 transactions, ~294 accounts). We keep the ground-truth oracle strictly separated from the detection code, and our tests prove the agents rediscover the answers from the data — not from the key.


🏗️ Architecture & the exact handoffs

A real Find → Rank → Act → Ground → Explain pipeline — five specialists, not one LLM in a loop. Data flows only through Cognee.

Crestline CSV → DuckDB (in-process) → COGNEE (shared Case nodes)
        │
        ▼
1. DETECTOR ─signals, dist_stats─▶ 2. ESTIMATOR ─p_mule, credible_interval─▶
3. ADJUDICATOR ─action, EVPI, decisive_signals─▶ 4. DOMAIN EXPERT ─MarketContext─▶
5. REPORTER ─SAR memo, closing_rule
Agent Reads from Cognee Computes Writes back (the handoff)
1 · Detector raw txns (DuckDB) isolates the AC→AC transfer graph structurally; derives source/relay/sink roles; fires 7 signals; learns the fresh-cohort cutoff from the data (no magic 30 days) signals, dist_stats
2 · Estimator signals, dist_stats PyMC mixture, NUTS p_mule, credible_interval, signal_contributions
3 · Adjudicator posterior + interval Bayesian decision theory: τ = C_FP/(C_FP+C_FN); expected-loss argmin; abstains when the interval straddles τ and EVPI > review cost action, EVPI, decisive_signals
4 · Domain Expert adjudicated ring queries geodo.ai MCP + FinCEN/OCC registry; sources the cost basis and the "why now" MarketContext
5 · Reporter the fully enriched Case set edge-by-edge dollar reconciliation; FinCEN SAR memo; compiles the closing rule typology, closing_rule, memo_ref

Decisions are deterministic (fixed seed + argmin), so the system is reproducible and never says "the model said so."


✅ How we map to the 25-point rubric

Criterion (5 pts) How Quorum nails it
Functional agents on real data Runs on the real Crestline CSV via DuckDB; a ground-truth oracle is kept separate and tests prove the pipeline rediscovers the answers from data.
Collaboration via Cognee Field-accretion on shared Case nodes; downstream agents raise if upstream fields are missing. Provable, not narrated.
Matches the brief All five Step-0 success conditions met and enforced by a green test suite.
Usable by non-technical operators Three-screen Streamlit product + one-click downloadable SAR memo + English-language graph search. Drop a CSV, get a queue.
Transparent reasoning Posterior + credible interval + expected-loss arithmetic + decisive signals on every case; the memo phrases (never invents) the computed facts.

🛠️ How we built it

  • Python 3.14, uv-managed. Pipeline: uv run main.py data/track02_fraud_watch.csv. Product: uv run streamlit run ui/app.py.
  • DuckDB — in-process analytical SQL over the transactions; zero server.
  • PyMC + nutpie + ArviZ — the Bayesian mixture, NUTS sampling, convergence diagnostics.
  • Cognee — shared memory + the cognified knowledge graph and NL search.
  • Geodo (geodo.ai MCP) — live GTM/market grounding behind a read-only safety gate.
  • Streamlit — the three-screen analyst product.
  • Bring-your-own LLM key (Anthropic / OpenAI / Groq) — every external layer degrades gracefully without one; keys never touch the repo.

🚧 Challenges we solved

  • Fitting a mixture under NUTS — discrete latent classes break HMC; the logsumexp marginalization + pm.Potential formulation made it sample cleanly (we removed a hard ordering constraint that injected ~280 divergences; informative priors pin the labels without it → 0 divergences, R-hat 1.004).
  • Structurally-missing signals — the masked likelihood was the unlock for both confident sinks and honest abstention.
  • Keeping agents honest — separating the ground-truth oracle from the detection code so the pipeline earns its answers, and the tests can prove it.
  • One async event loop for Cognee — buffering each agent's layer and flushing in a single loop with one cognify(), so async connections never bind to dead loops.

🔮 What's next

Streaming ingestion for live monitoring · the closed analyst-feedback learning loop (resolved dispositions become labels that update the priors and cost matrix online) · graph-native typology detection that surfaces emerging rings before they complete · multi-institution federated typology sharing via Cognee.


Every other tool ranks risk — and most just flag the decoy. Quorum surfaces the nine, refuses to guess on the tenth, ignores the trap, and shows you the math behind every call.

Built With

Share this project:

Updates