Stock Forecasting Training Workspace

This repository contains a stock forecasting experimentation workspace built around a reusable Python training pipeline. It supports multiple model families, time-based evaluation, experiment tracking, champion model retraining, and live scoring against the latest available data snapshot.

The code is organized under stock_forecasting/, with supporting data files in data/, generated outputs in artifacts/, and analysis notebooks in the repository root.

Supported Models

The training pipeline currently supports the following model options:

lightgbm: Gradient-boosted decision tree baseline for structured lagged features.
lstm: Recurrent neural network with separate price and news feature streams.
timexer: Transformer-style sequence model for price data with optional exogenous news features.
hf_patchtst: Hugging Face PatchTST-based time-series model.

Supported Tasks and Horizons

Tasks:
- regression
- classification
Default forecast horizons:
- 1 trading day
- 5 trading days
- 21 trading days

The default training configuration is oriented toward regression experiments with walk-forward evaluation.

Data Inputs

By default, the workspace expects the following input files:

Price data: data/dates_on_left_stock_data.csv
News and sentiment data: data/news_all_sentiment.csv

When price_news mode is used, news features are aggregated by ticker and date, then shifted by --news-lag-days to reduce leakage risk when precise article timestamps are not available.

Evaluation Approach

The training pipeline uses time-based splits only. Random shuffling is not used for model evaluation.

Supported evaluation modes:

walkforward (default)
holdout

For regression runs, the reported metrics include:

MAE
RMSE
directional accuracy
Spearman information coefficient
Pearson correlation
mean daily Spearman information coefficient
top-bottom decile spread

For classification runs, the reported metrics include:

accuracy
balanced accuracy
precision
recall
F1 score

Installation

Install the required Python packages with:

pip install -r requirements-training.txt

The repository does not currently define a packaged installation step; the command-line examples below assume you are running from the repository root.

Training a Single Experiment

Examples:

python -m stock_forecasting.train --model lightgbm --modalities price_news --horizon 1
python -m stock_forecasting.train --model lstm --modalities price_news --horizon 5
python -m stock_forecasting.train --model timexer --modalities price_news --horizon 21
python -m stock_forecasting.train --model hf_patchtst --modalities price_news --horizon 21 --lookback 60

Useful training arguments include:

--model: lightgbm, lstm, timexer, or hf_patchtst
--task: regression or classification
--modalities: price or price_news
--horizon: 1, 5, or 21
--lookback: sequence length used for feature construction
--news-lag-days: lag applied to aggregated news features
--eval-mode: walkforward or holdout

Running Experiment Grids

To run a Python-managed experiment grid:

python -m stock_forecasting.run_experiments --task regression

This command writes per-run outputs under artifacts/ and also produces consolidated experiment summaries.

Shell-Based Training Sweeps

The repository also includes Bash helpers for broader experiment sweeps:

run_training_suite.sh
run_transformer_suite.sh

Example usage on a Unix-like shell:

bash run_training_suite.sh
PROFILE=quick bash run_training_suite.sh
PROFILE=transformers bash run_training_suite.sh
bash run_transformer_suite.sh

The default run_training_suite.sh profile is competition, which expands to a broader sweep across:

models
horizons
modalities
news lags
seeds
horizon-specific lookback windows

Additional environment variables supported by the script include:

MODELS
HORIZONS
MODALITIES
TASKS
NEWS_LAGS
SEEDS
LOOKBACKS
SKIP_EXISTING
INSTALL_DEPS

On Windows, the Python module entry points are the more portable option unless you are running these scripts inside a Bash-compatible environment.

Experiment Artifacts

Each experiment run is written to its own directory under artifacts/. Depending on model type, outputs include:

config.json
feature_metadata.json
summary.csv
summary.json
per-fold test_predictions.csv
model files:
- model.txt for LightGBM
- model.pt and standardizer.npz for neural models

The run naming convention is derived from model, horizon, modality, task, lookback, news lag, evaluation mode, and seed.

Champion Model Retraining

After reviewing experiment results, you can retrain a selected configuration into a deployable artifact under champions/.

Examples:

python -m stock_forecasting.train_champion --run-name lightgbm_h1_price_news_regression_lb30_lag1_walkforward_s7
python -m stock_forecasting.train_champion --task regression

Champion artifacts include:

config.json
feature_metadata.json
selection.json
champion_summary.json
validation_predictions.csv
model file outputs appropriate to the selected model family

Live Scoring

Use a saved champion artifact to score the latest available data snapshot:

python -m stock_forecasting.predict_live --artifact-dir champions/YOUR_CHAMPION_NAME

Optional date override:

python -m stock_forecasting.predict_live --artifact-dir champions/YOUR_CHAMPION_NAME --as-of-date 2024-11-20

The scoring command writes a ranked CSV file inside the selected champion directory by default.

Notebooks

The repository includes analysis notebooks in the project root, including:

compare_results.ipynb
horizon_evaluation_report.ipynb

These notebooks are intended for post-training comparison and reporting based on the contents of artifacts/.

Repository Structure

FinCoders/
|-- artifacts/
|-- data/
|-- stock_forecasting/
|-- compare_results.ipynb
|-- horizon_evaluation_report.ipynb
|-- requirements-training.txt
|-- run_training_suite.sh
|-- run_transformer_suite.sh
`-- README.md

Notes

The current implementation uses aggregated news and sentiment features rather than raw-text language model embeddings.
GPU execution is supported when torch detects CUDA and --device cuda is used; otherwise training falls back to CPU.
The default command-line device setting is cuda, but the training code automatically uses CPU when CUDA is unavailable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stock Forecasting Training Workspace

Supported Models

Supported Tasks and Horizons

Data Inputs

Evaluation Approach

Installation

Training a Single Experiment

Running Experiment Grids

Shell-Based Training Sweeps

Experiment Artifacts

Champion Model Retraining

Live Scoring

Notebooks

Repository Structure

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
PleaseWork		PleaseWork
stockNewsAnalysis		stockNewsAnalysis
stock_forecasting		stock_forecasting
.gitignore		.gitignore
README.md		README.md
check.ipynb		check.ipynb
compare_results.ipynb		compare_results.ipynb
horizon_evaluation_report.ipynb		horizon_evaluation_report.ipynb
requirements-training.txt		requirements-training.txt
run_training_suite.sh		run_training_suite.sh
run_transformer_suite.sh		run_transformer_suite.sh

Folders and files

Latest commit

History

Repository files navigation

Stock Forecasting Training Workspace

Supported Models

Supported Tasks and Horizons

Data Inputs

Evaluation Approach

Installation

Training a Single Experiment

Running Experiment Grids

Shell-Based Training Sweeps

Experiment Artifacts

Champion Model Retraining

Live Scoring

Notebooks

Repository Structure

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages