SynthGuard
Inspiration
Synthetic financial data generation is becoming the norm. Quants are routinely using generative AI to create synthetic equity returns for backtesting, stress testing, and privacy. But there is a massive law in how we evaluate these models: we test them on averages.
Real markets aren't average. They operate in distinct regimes, and market physics change depending on the state of the world:
- Volatility clustering dominates crisis periods, but vanishes in calm markets.
- Fat tails (black swan risks) grow significantly heavier under stress.
- Leverage effects (the asymmetric panic response) emerge strongly in high-volatility regimes, but disappear when markets are quiet.
The Problem: Current benchmarks test data unconditionally—collapsing all these regimes together. A generator might output sequences that look mathematically flawless on average, but completely fail to simulate the physics of a real market crash.
Furthermore, when existing tools do flag a failure, they just spit out a single pass/fail score. They offer zero insight into what structural property broke or where it happened. Neither approach gives a practitioner the actual information needed to fix their model.
We built SynthGuard to close that gap.
The Core Innovation: Synthetic Footprint Index (SFI)
The central contribution of SynthGuard is a novel metric we developed for this project: the Synthetic Footprint Index (SFI).
The SFI is a calibrated score in [0, 1] that quantifies how much of an artificial footprint a synthetic return sequence leaves behind under adversarial scrutiny:
- SFI → 0: The sequence is indistinguishable from real market data. No detectable synthetic footprint.
- SFI → 1: The sequence carries clear, learnable artefacts. A trained adversary can reliably identify it as machine-generated.
The SFI is produced by a five-member ensemble of Temporal Convolutional Networks (TCNs), each trained independently on 20 years of real equity return data. Raw ensemble probabilities are post-hoc calibrated via Platt scaling on a held-out validation set, converting an uncalibrated logit into a metrically grounded index. An SFI of 0.12, for example, genuinely means the sequence passes adversarial scrutiny ~88% of the time under the discriminator's learned representation of real market dynamics.
Critically, the SFI is regime-conditioned throughout. Each TCN receives a two-channel input — the return sequence and an integer-encoded regime label — so the discriminator judges realism relative to the market state the generator claims to be representing, not in aggregate. This is what separates the SFI from a plain discriminator score: a generator that produces realistic low-volatility sequences is not penalised for behaviour in high-volatility regimes it was never trained on, and a generator that looks good on average but collapses structurally under stress receives a high SFI in that regime rather than having its failure averaged away.
The SFI is the primary verdict the system returns. Every other component — the statistical test suite, the Claude diagnostic — exists to explain what is driving it.
What SynthGuard Does
A user uploads a batch of synthetic return sequences (252 trading days each) alongside regime labels, and receives three independent signals synthesised into one diagnostic workflow:
- The SFI — a per-sequence Synthetic Footprint Index score and a batch-level summary, rendered immediately as a verdict banner and probability histogram.
- A 7-test statistical suite — run per regime, covering the full canonical set of financial stylised facts, displayed as a regime × test heatmap.
- A Claude-generated diagnosis — a streamed natural language report that interprets both signals, identifies dominant failure modes by regime, ranks them by severity, and provides specific generator improvement recommendations. Users can ask follow-up questions in a persistent chat session.
SynthGuard also ships a built-in generator panel (six models, from GBM to Score Diffusion) and a filter that extracts the subset of uploaded sequences with the lowest SFI scores — those that fool the discriminator — for use as hard negatives in adversarial training.
Technical Architecture
┌─────────────────────────────────────────────────────────┐
│ React 18 SPA │
│ Audit Tab │ Generate Tab │ Filter Tab │ Settings │
│ CSV Upload (Web Worker) │ SSE Stream │ Dark Theme UI │
└────────────────────────┬────────────────────────────────┘
│ HTTP REST + SSE
┌────────────────────────▼────────────────────────────────┐
│ FastAPI 0.110 Backend │
│ /api/audit/batch │ /api/generate/* │ /api/filter/* │
│ /api/diagnosis/* │ /api/stream/{id} │ /api/files/* │
├──────────┬──────────┬──────────┬───────┴─────┬──────────┤
│ Model │ Stat │ TCN / │ Claude │ Job / │
│ Registry │ Tests │ SFI │ Proxy │ File Mgr │
│ (startup)│ (7 tests)│ Ensemble │ (streaming) │ │
└──────────┴──────────┴──────────┴─────────────┴──────────┘
│
┌────────────────────────▼────────────────────────────────┐
│ Trained Artifact Store (16 files) │
│ GMM · Markov · GBM/GARCH/Heston/TFT/CVAE/Diffusion │
│ TCN ×5 · Platt scaler · Regime benchmark stats │
└─────────────────────────────────────────────────────────┘
GMM Regime Segmenter
The regime model is a K=3 Gaussian Mixture Model fit on a feature vector derived from 20 years of VIX and equity data. Features include exponentially-weighted VIX, 21-day percentage change in VIX EMA, and 21-day change in the 10-year Treasury yield. Components are deterministically labelled by ascending mean VIX level: low_vol, mid_vol, high_vol. A first-order Markov chain is estimated from the label sequence and stored as a transition matrix, used by all six built-in generators to produce regime-consistent synthetic sequences.
SFI — TCN Ensemble
Each of the five TCN ensemble members uses the following architecture:
- Input:
(T=252, C=2)— returns channel + integer-encoded regime channel - Backbone: 6 dilated causal convolutional blocks, dilation schedule
[1, 2, 4, 8, 16, 32], giving a receptive field of 184 timesteps - Block structure:
CausalConv1d→GroupNorm→ReLU→Dropout(0.1)→CausalConv1d→GroupNorm→ReLU→ residual - Hidden channels: 64
- Head:
AdaptiveAvgPool1d(1)→Linear(64, 1)— binary logit
The five member logits are averaged after sigmoid activation. The resulting ensemble probability is passed through a Platt scaler (logistic regression fit on a held-out calibration set) to produce the final SFI. Training uses binary cross-entropy on a balanced dataset: 50,000 real sequences against 50,000 synthetic sequences generated by the six built-in models plus 12 degenerate generators.
Statistical Test Suite
The test suite runs 7 tests per regime, comparing each sequence's statistics against benchmarks derived from real data in that regime:
| # | Test | Stylised Fact Probed |
|---|---|---|
| 1 | Hill Tail Index | Heavy tails — real returns have Hill indices in [2, 5] |
| 2 | Jarque-Bera | Departure from Gaussianity (skewness + excess kurtosis) |
| 3 | Ljung-Box (raw) | Absence of linear autocorrelation in returns |
| 4 | Ljung-Box (squared) | Volatility clustering — autocorrelation in r² |
| 5 | ARCH-LM | Conditional heteroskedasticity (Engle 1982) |
| 6 | Leverage Effect | Asymmetric vol response to negative vs. positive returns |
| 7 | ACF of Absolute Returns | Long-memory in volatility — slow decay of autocorrelation |
Each test returns a pass/fail verdict and a quantitative deviation from the regime-specific benchmark. Results populate a regime × test heatmap in the UI. The leverage effect test uses a regime-stratified permutation bootstrap rather than a parametric null, because the empirical distribution of the leverage correlation differs substantially across regimes.
Claude Integration
Claude (claude-sonnet-4-5) receives a structured context object containing the SFI distribution, the per-regime per-test statistical results, and the real-data benchmark baselines. A strict epistemic hierarchy is enforced in the system prompt:
- The SFI is the primary verdict. Claude does not override it.
- Statistical tests are explanatory evidence for what is driving the SFI — not co-equal signals.
- Claude's role is to synthesise and translate, not to independently adjudicate realism.
The five-section structured diagnosis covers: executive summary → dominant failure modes ranked by severity → SFI-statistics alignment discussion → regime-level breakdown → generator improvement recommendations. The diagnosis streams live to the UI via SSE, and a persistent follow-up chat session retains full diagnostic context.
Built-in Generators
| Generator | Class | Key Properties |
|---|---|---|
| GBM | Classical SDE | Regime-conditional drift and vol; Gaussian; no memory |
| GARCH(1,1) | Classical time-series | Volatility clustering via ARCH dynamics |
| Heston | Stochastic vol SDE | Mean-reverting variance; built-in leverage effect |
| TFT | Deep learning | Attention-based; Mixture-of-Logistics output head |
| CVAE | Deep generative | Regime-conditioned latent space; non-Gaussian distributions |
| Score Diffusion | Deep generative | DDPM with classifier-free guidance on regime |
All six use the same autoregressive wrapper: a rolling 30-day window drives step-by-step generation, with regime sequences drawn from the trained Markov chain.
How We Built It
The full training pipeline is a single script that runs end-to-end in approximately 30–60 minutes. It fetches 20 years of S&P 500 constituent returns and VIX data from Yahoo Finance, fits the GMM regime segmenter and Markov chain, trains all six generators, builds a 100,000-sequence discriminator training set, trains five independent TCN ensemble members, fits the Platt calibration scaler on a held-out split, computes regime-stratified benchmark statistics from real data, and serialises all 16 artefacts. The backend is FastAPI with a two-step POST→SSE pattern: every long-running operation returns a job_id immediately, and the client opens a native EventSource on /api/stream/{job_id} to receive typed events. PyTorch inference runs in a thread pool executor with asyncio.run_coroutine_threadsafe for thread-safe writes back to the async event loop. The frontend uses three-wave progressive rendering in the audit view — SFI verdict first, then the statistical heatmap, then the streaming Claude diagnosis — so the most important result is never blocked on the slowest computation. Large CSV uploads are parsed off the main thread by a dedicated Web Worker.
Challenges
- Thread safety: PyTorch inference in a thread pool executor needs to write results back to async queues. Getting
asyncio.run_coroutine_threadsaferight across concurrent jobs, with a multiplexed SSE endpoint routing events to the correct client connection, required careful design of the job store from the start. - SFI calibration: Raw ensemble probabilities from a TCN trained on real-world imbalanced data are systematically biased. Platt scaling corrected this, but required a calibration set constructed carefully to avoid distribution leakage from training. We also found that regime-conditioning the TCN input improved SFI reliability specifically in the
high_volregime — without it, the discriminator was too optimistic about stress-regime sequences from generators that looked good inlow_vol. - Leverage effect test: No standard parametric null distribution fits the leverage correlation across all three regimes simultaneously. A regime-stratified permutation bootstrap was the only statistically defensible option, which made the leverage test the bottleneck in the statistical suite — directly motivating the three-wave UI design.
What's Next
The most important near-term extension is automatic regime labelling at inference time: currently users supply regime labels alongside their sequences. Auto-labelling via the trained GMM would remove this friction entirely for users whose generators do not natively produce regime metadata.
We also want to expose the SFI threshold as a configurable parameter in the filter tab, so users can tune the precision-recall trade-off when extracting low-SFI sequences for downstream use.
Longer term, the SFI framework generalises to other asset classes — FX, rates, credit spreads — and to conditional generation tasks beyond regime, such as conditioning on macro state or yield curve shape. The TCN architecture and Platt calibration pipeline are asset-class agnostic; what changes is the training data and the regime segmenter features. Extending SynthGuard to these settings would make the SFI a general-purpose realism benchmark for any domain where temporal sequences must pass both statistical and adversarial scrutiny simultaneously.
Built With
- claude-sonnet-4-5
- fastapi
- python
- pytorch
- react
- scikit-learn
- scipy
- statsmodels
- vite
- yfinance
Log in or sign up for Devpost to join the conversation.