SynthGuard

Inspiration

Synthetic financial data generation is becoming the norm. Quants are routinely using generative AI to create synthetic equity returns for backtesting, stress testing, and privacy. But there is a massive law in how we evaluate these models: we test them on averages.

Real markets aren't average. They operate in distinct regimes, and market physics change depending on the state of the world:

  • Volatility clustering dominates crisis periods, but vanishes in calm markets.
  • Fat tails (black swan risks) grow significantly heavier under stress.
  • Leverage effects (the asymmetric panic response) emerge strongly in high-volatility regimes, but disappear when markets are quiet.

The Problem: Current benchmarks test data unconditionally—collapsing all these regimes together. A generator might output sequences that look mathematically flawless on average, but completely fail to simulate the physics of a real market crash.

Furthermore, when existing tools do flag a failure, they just spit out a single pass/fail score. They offer zero insight into what structural property broke or where it happened. Neither approach gives a practitioner the actual information needed to fix their model.

We built SynthGuard to close that gap.

The Core Innovation: Synthetic Footprint Index (SFI)

The central contribution of SynthGuard is a novel metric we developed for this project: the Synthetic Footprint Index (SFI).

The SFI is a calibrated score in [0, 1] that quantifies how much of an artificial footprint a synthetic return sequence leaves behind under adversarial scrutiny:

  • SFI → 0: The sequence is indistinguishable from real market data. No detectable synthetic footprint.
  • SFI → 1: The sequence carries clear, learnable artefacts. A trained adversary can reliably identify it as machine-generated.

The SFI is produced by a five-member ensemble of Temporal Convolutional Networks (TCNs), each trained independently on 20 years of real equity return data. Raw ensemble probabilities are post-hoc calibrated via Platt scaling on a held-out validation set, converting an uncalibrated logit into a metrically grounded index. An SFI of 0.12, for example, genuinely means the sequence passes adversarial scrutiny ~88% of the time under the discriminator's learned representation of real market dynamics.

Critically, the SFI is regime-conditioned throughout. Each TCN receives a two-channel input — the return sequence and an integer-encoded regime label — so the discriminator judges realism relative to the market state the generator claims to be representing, not in aggregate. This is what separates the SFI from a plain discriminator score: a generator that produces realistic low-volatility sequences is not penalised for behaviour in high-volatility regimes it was never trained on, and a generator that looks good on average but collapses structurally under stress receives a high SFI in that regime rather than having its failure averaged away.

The SFI is the primary verdict the system returns. Every other component — the statistical test suite, the Claude diagnostic — exists to explain what is driving it.

What SynthGuard Does

A user uploads a batch of synthetic return sequences (252 trading days each) alongside regime labels, and receives three independent signals synthesised into one diagnostic workflow:

  1. The SFI — a per-sequence Synthetic Footprint Index score and a batch-level summary, rendered immediately as a verdict banner and probability histogram.
  2. A 7-test statistical suite — run per regime, covering the full canonical set of financial stylised facts, displayed as a regime × test heatmap.
  3. A Claude-generated diagnosis — a streamed natural language report that interprets both signals, identifies dominant failure modes by regime, ranks them by severity, and provides specific generator improvement recommendations. Users can ask follow-up questions in a persistent chat session.

SynthGuard also ships a built-in generator panel (six models, from GBM to Score Diffusion) and a filter that extracts the subset of uploaded sequences with the lowest SFI scores — those that fool the discriminator — for use as hard negatives in adversarial training.

Technical Architecture

┌─────────────────────────────────────────────────────────┐
│                     React 18 SPA                        │
│   Audit Tab │ Generate Tab │ Filter Tab │ Settings      │
│   CSV Upload (Web Worker) │ SSE Stream │ Dark Theme UI  │
└────────────────────────┬────────────────────────────────┘
                         │  HTTP REST + SSE
┌────────────────────────▼────────────────────────────────┐
│                  FastAPI 0.110 Backend                  │
│  /api/audit/batch  │  /api/generate/* │  /api/filter/* │
│  /api/diagnosis/* │  /api/stream/{id} │  /api/files/* │
├──────────┬──────────┬──────────┬───────┴─────┬──────────┤
│ Model    │ Stat     │ TCN /    │ Claude      │ Job /    │
│ Registry │ Tests    │ SFI      │ Proxy       │ File Mgr │
│ (startup)│ (7 tests)│ Ensemble │ (streaming) │          │
└──────────┴──────────┴──────────┴─────────────┴──────────┘
                         │
┌────────────────────────▼────────────────────────────────┐
│              Trained Artifact Store (16 files)          │
│  GMM · Markov · GBM/GARCH/Heston/TFT/CVAE/Diffusion   │
│  TCN ×5 · Platt scaler · Regime benchmark stats         │
└─────────────────────────────────────────────────────────┘

GMM Regime Segmenter

The regime model is a K=3 Gaussian Mixture Model fit on a feature vector derived from 20 years of VIX and equity data. Features include exponentially-weighted VIX, 21-day percentage change in VIX EMA, and 21-day change in the 10-year Treasury yield. Components are deterministically labelled by ascending mean VIX level: low_vol, mid_vol, high_vol. A first-order Markov chain is estimated from the label sequence and stored as a transition matrix, used by all six built-in generators to produce regime-consistent synthetic sequences.

SFI — TCN Ensemble

Each of the five TCN ensemble members uses the following architecture:

  • Input: (T=252, C=2) — returns channel + integer-encoded regime channel
  • Backbone: 6 dilated causal convolutional blocks, dilation schedule [1, 2, 4, 8, 16, 32], giving a receptive field of 184 timesteps
  • Block structure: CausalConv1dGroupNormReLUDropout(0.1)CausalConv1dGroupNormReLU → residual
  • Hidden channels: 64
  • Head: AdaptiveAvgPool1d(1)Linear(64, 1) — binary logit

The five member logits are averaged after sigmoid activation. The resulting ensemble probability is passed through a Platt scaler (logistic regression fit on a held-out calibration set) to produce the final SFI. Training uses binary cross-entropy on a balanced dataset: 50,000 real sequences against 50,000 synthetic sequences generated by the six built-in models plus 12 degenerate generators.

Statistical Test Suite

The test suite runs 7 tests per regime, comparing each sequence's statistics against benchmarks derived from real data in that regime:

# Test Stylised Fact Probed
1 Hill Tail Index Heavy tails — real returns have Hill indices in [2, 5]
2 Jarque-Bera Departure from Gaussianity (skewness + excess kurtosis)
3 Ljung-Box (raw) Absence of linear autocorrelation in returns
4 Ljung-Box (squared) Volatility clustering — autocorrelation in r²
5 ARCH-LM Conditional heteroskedasticity (Engle 1982)
6 Leverage Effect Asymmetric vol response to negative vs. positive returns
7 ACF of Absolute Returns Long-memory in volatility — slow decay of autocorrelation

Each test returns a pass/fail verdict and a quantitative deviation from the regime-specific benchmark. Results populate a regime × test heatmap in the UI. The leverage effect test uses a regime-stratified permutation bootstrap rather than a parametric null, because the empirical distribution of the leverage correlation differs substantially across regimes.

Claude Integration

Claude (claude-sonnet-4-5) receives a structured context object containing the SFI distribution, the per-regime per-test statistical results, and the real-data benchmark baselines. A strict epistemic hierarchy is enforced in the system prompt:

  1. The SFI is the primary verdict. Claude does not override it.
  2. Statistical tests are explanatory evidence for what is driving the SFI — not co-equal signals.
  3. Claude's role is to synthesise and translate, not to independently adjudicate realism.

The five-section structured diagnosis covers: executive summary → dominant failure modes ranked by severity → SFI-statistics alignment discussion → regime-level breakdown → generator improvement recommendations. The diagnosis streams live to the UI via SSE, and a persistent follow-up chat session retains full diagnostic context.

Built-in Generators

Generator Class Key Properties
GBM Classical SDE Regime-conditional drift and vol; Gaussian; no memory
GARCH(1,1) Classical time-series Volatility clustering via ARCH dynamics
Heston Stochastic vol SDE Mean-reverting variance; built-in leverage effect
TFT Deep learning Attention-based; Mixture-of-Logistics output head
CVAE Deep generative Regime-conditioned latent space; non-Gaussian distributions
Score Diffusion Deep generative DDPM with classifier-free guidance on regime

All six use the same autoregressive wrapper: a rolling 30-day window drives step-by-step generation, with regime sequences drawn from the trained Markov chain.

How We Built It

The full training pipeline is a single script that runs end-to-end in approximately 30–60 minutes. It fetches 20 years of S&P 500 constituent returns and VIX data from Yahoo Finance, fits the GMM regime segmenter and Markov chain, trains all six generators, builds a 100,000-sequence discriminator training set, trains five independent TCN ensemble members, fits the Platt calibration scaler on a held-out split, computes regime-stratified benchmark statistics from real data, and serialises all 16 artefacts. The backend is FastAPI with a two-step POST→SSE pattern: every long-running operation returns a job_id immediately, and the client opens a native EventSource on /api/stream/{job_id} to receive typed events. PyTorch inference runs in a thread pool executor with asyncio.run_coroutine_threadsafe for thread-safe writes back to the async event loop. The frontend uses three-wave progressive rendering in the audit view — SFI verdict first, then the statistical heatmap, then the streaming Claude diagnosis — so the most important result is never blocked on the slowest computation. Large CSV uploads are parsed off the main thread by a dedicated Web Worker.

Challenges

  • Thread safety: PyTorch inference in a thread pool executor needs to write results back to async queues. Getting asyncio.run_coroutine_threadsafe right across concurrent jobs, with a multiplexed SSE endpoint routing events to the correct client connection, required careful design of the job store from the start.
  • SFI calibration: Raw ensemble probabilities from a TCN trained on real-world imbalanced data are systematically biased. Platt scaling corrected this, but required a calibration set constructed carefully to avoid distribution leakage from training. We also found that regime-conditioning the TCN input improved SFI reliability specifically in the high_vol regime — without it, the discriminator was too optimistic about stress-regime sequences from generators that looked good in low_vol.
  • Leverage effect test: No standard parametric null distribution fits the leverage correlation across all three regimes simultaneously. A regime-stratified permutation bootstrap was the only statistically defensible option, which made the leverage test the bottleneck in the statistical suite — directly motivating the three-wave UI design.

What's Next

The most important near-term extension is automatic regime labelling at inference time: currently users supply regime labels alongside their sequences. Auto-labelling via the trained GMM would remove this friction entirely for users whose generators do not natively produce regime metadata.

We also want to expose the SFI threshold as a configurable parameter in the filter tab, so users can tune the precision-recall trade-off when extracting low-SFI sequences for downstream use.

Longer term, the SFI framework generalises to other asset classes — FX, rates, credit spreads — and to conditional generation tasks beyond regime, such as conditioning on macro state or yield curve shape. The TCN architecture and Platt calibration pipeline are asset-class agnostic; what changes is the training data and the regime segmenter features. Extending SynthGuard to these settings would make the SFI a general-purpose realism benchmark for any domain where temporal sequences must pass both statistical and adversarial scrutiny simultaneously.

Built With

Share this project:

Updates