A two-stage transformer pipeline for predicting price trends in futures candle data. A classifier detects whether the market is trending, then a regressor predicts the magnitude of the next high/low/close deltas during those windows -- filtering noise by only predicting when the market is moving with conviction.
- Deep Learning: PyTorch, transformer encoder architecture, custom positional encoding, learnable CLS token
- Optimization: AdamW, linear warmup, cosine annealing with warm restarts, early stopping
- Signal Processing: EWMA velocity/acceleration, asinh normalization, cyclical time encoding
- Infrastructure: Multi-GPU grid search and seed search via
multiprocessing.Poolwith per-processCUDA_VISIBLE_DEVICESisolation - Data: Sliding-window dataset design, window-relative normalization, custom trend labeling heuristics
- Architecture: Transformer encoder with learnable CLS token projected through a linear head. 8 layers, 8 heads, 320-dim embeddings, 768-dim FFN. Same backbone for both tasks; output head differs (1 logit for classification, 3 for regression).
- Preprocessing: Custom trend detection via EWMA velocity/acceleration heuristics. Features include asinh-normalized diffs, cyclical time encoding, log-normalized raw values, and EWMA momentum signals. 15-dimensional feature vector per timestep.
- Training: AdamW with linear warmup + cosine annealing with warm restarts. Multi-GPU grid search and seed search for hyperparameter tuning.
- Data: Futures OHLCV, 90/10 train/test split, 480-step sliding windows with OHLC normalized relative to window start.
- Results: Classifier achieves 76.8% accuracy with a 13.3% false positive rate. Loss converges from ~0.75 to ~0.45 BCE.
main_interface.py -- Interactive entry point. Trains the full pipeline (p), classifier only (c), or regressor only (r). Debug mode enables NaN/Inf validation at every transformer layer. Graph mode generates trend overlays and loss curves.
main_finetune.py -- Multi-GPU hyperparameter search:
-
grid_search()-- Sweeps LR ($5 \times 10^{-6}$ to$1 \times 10^{-5}$ ), weight decay ($0.003$ to$0.01$ ), warmup steps ($7000$ to$12000$ ), dispatching each combination to a separate GPU. -
seed_search()-- 10 random seeds for both tasks to measure initialization variance.
hyperparams.py -- Single source of truth for all hyperparameters. Finetuning scripts mutate this module at runtime before launching training.
Raw OHLCV + datetime data is expanded into a 15-dimensional feature vector per timestep:
- OHLC diffs: First-order differences for open, high, low, close
-
Asinh-normalized diffs:
$\text{asinh}(\Delta h), \text{asinh}(\Delta l), \text{asinh}(\Delta c)$ -- compresses large moves without killing signal like standard normalization -
Log-normalized raw values:
$\log(1 + \text{open}), \log(1 + \text{volume})$ -- scale-invariant price and volume representations - Volume percent change: Relative volume shift between consecutive candles
-
Cyclical time encoding:
$\sin(2\pi t / 86400), \cos(2\pi t / 86400)$ -- encodes time-of-day without midnight discontinuity -
EWMA velocity and acceleration: Smoothed first and second derivatives of close price diffs (
$\alpha = 0.3$ ) - Trend mask: Binary label from the custom trend detection algorithm
Each 480-step window gets OHLC normalized relative to the first candle:
The trend classifier in preprocess/classifier.py is a hand-crafted heuristic -- no ML:
- Compute non-overlapping 5-candle average velocities
- Roll a 24-label window (
$120$ candles) of cumulative absolute movement - Track EWMA velocity/acceleration across label windows
- Trend starts when velocity-to-window ratio exceeds gate (
$0.012$ ) or velocity/acceleration disagree in sign - Trend confirmed after 3 steps only if acceleration aligns with velocity and exceeds
$12%$ of velocity magnitude - Trend ends when both velocity and acceleration flip sign
This produces a binary mask that the classifier learns to replicate and the regressor uses to select training windows.
| Component | Specification |
|---|---|
| Input projection |
Linear(15, 320) with LayerNorm(15)
|
| Positional encoding | Sinusoidal, max length 481 (window + CLS) |
| CLS token | Learnable, truncated normal init ( |
| Transformer encoder | 8 layers, 8 heads, 768 FFN dim, 0.1 dropout |
| Output head |
Linear(320, 1) for classifier, Linear(320, 3) for regressor |
| Loss |
BCEWithLogitsLoss (classifier), MSELoss (regressor) |
The regressor predicts asinh-scaled high/low/close deltas, inverted with
Linear warmup for 9,000 steps, then cosine annealing with warm restarts (
- Accuracy: 76.8% on the held-out test set
- False positive rate: 13.3% -- rarely calls "trending" when flat; critical for downstream use since false positives trigger the regressor on noise
- False negative rate: 37.5% -- misses a third of trends, but missing a trade is cheaper than entering a bad one
- Validation loss: 0.481 (BCE)
- Two-stage over direct prediction: A single model predicting every tick learns to output "roughly the same as last close" since that minimizes MSE in a noisy series. The classifier/regressor split forces the system to first identify signal, then predict magnitude.
- Asinh over standard normalization: Large price moves get clipped toward zero under standard normalization. Asinh compresses outliers while preserving their relative magnitude.
- Window-relative OHLC normalization: Without it, the model overfits to absolute price levels -- performs well on one price range, fails on another.
- Warm restarts for non-stationary data: Financial data changes regime. Periodic LR resets prevent the model from getting stuck in optima that were good for earlier market conditions.
-
Seed search validation: 10-seed sweep confirmed classifier results are stable (
$\pm 2%$ accuracy variance).
- Heuristic labels: The classifier learns from hand-tuned labels, not ground truth. A hidden Markov model or regime detection approach would be more principled.
- No feature ablation: The 15 features were chosen via intuition and iteration. Likely redundant features exist, and useful ones are missing (order book data, inter-market correlations).
- 480-step window cost: Large context window limits batch size on smaller GPUs. A hierarchical architecture compressing local patterns before full-window attention would be more efficient.
All hyperparameters in hyperparams.py:
WINDOW_SIZE = 480 # Sliding window length
BATCH_SIZE = 64
LEARNING_RATE = 6e-6 # Peak LR
WEIGHT_DECAY = 0.05 # L2 regularization
WARMUP = 9000 # Linear warmup steps
T0 = 10000 # Initial cosine period
T_MULT = 2 # Period multiplier per restart
VEL_ALPHA = 0.3 # EWMA smoothing for velocity
ACCEL_ALPHA = 0.3 # EWMA smoothing for accelerationCandle-Predictor/
main_interface.py # Interactive training entry point
main_pipeline.py # Core train/eval orchestration
main_preprocess.py # Raw data -> feature-engineered CSV
main_finetune.py # Multi-GPU grid search and seed search
hyperparams.py # Central hyperparameter config
pipeline/
model_t.py # TransformerBCE architecture
train.py # Training loop, evaluation, early stopping
loader.py # WindowLoader sliding-window dataset
preprocess/
classifier.py # EWMA trend detection heuristic
graphing/
graph.py # Visualization utilities
graphs/ # Generated training plots
docs/
images/ # README figures






