This repository contains a stock forecasting experimentation workspace built around a reusable Python training pipeline. It supports multiple model families, time-based evaluation, experiment tracking, champion model retraining, and live scoring against the latest available data snapshot.
The code is organized under stock_forecasting/, with supporting data files in data/, generated outputs in artifacts/, and analysis notebooks in the repository root.
The training pipeline currently supports the following model options:
lightgbm: Gradient-boosted decision tree baseline for structured lagged features.lstm: Recurrent neural network with separate price and news feature streams.timexer: Transformer-style sequence model for price data with optional exogenous news features.hf_patchtst: Hugging Face PatchTST-based time-series model.
- Tasks:
regressionclassification
- Default forecast horizons:
1trading day5trading days21trading days
The default training configuration is oriented toward regression experiments with walk-forward evaluation.
By default, the workspace expects the following input files:
- Price data:
data/dates_on_left_stock_data.csv - News and sentiment data:
data/news_all_sentiment.csv
When price_news mode is used, news features are aggregated by ticker and date, then shifted by --news-lag-days to reduce leakage risk when precise article timestamps are not available.
The training pipeline uses time-based splits only. Random shuffling is not used for model evaluation.
Supported evaluation modes:
walkforward(default)holdout
For regression runs, the reported metrics include:
- MAE
- RMSE
- directional accuracy
- Spearman information coefficient
- Pearson correlation
- mean daily Spearman information coefficient
- top-bottom decile spread
For classification runs, the reported metrics include:
- accuracy
- balanced accuracy
- precision
- recall
- F1 score
Install the required Python packages with:
pip install -r requirements-training.txtThe repository does not currently define a packaged installation step; the command-line examples below assume you are running from the repository root.
Examples:
python -m stock_forecasting.train --model lightgbm --modalities price_news --horizon 1
python -m stock_forecasting.train --model lstm --modalities price_news --horizon 5
python -m stock_forecasting.train --model timexer --modalities price_news --horizon 21
python -m stock_forecasting.train --model hf_patchtst --modalities price_news --horizon 21 --lookback 60Useful training arguments include:
--model:lightgbm,lstm,timexer, orhf_patchtst--task:regressionorclassification--modalities:priceorprice_news--horizon:1,5, or21--lookback: sequence length used for feature construction--news-lag-days: lag applied to aggregated news features--eval-mode:walkforwardorholdout
To run a Python-managed experiment grid:
python -m stock_forecasting.run_experiments --task regressionThis command writes per-run outputs under artifacts/ and also produces consolidated experiment summaries.
The repository also includes Bash helpers for broader experiment sweeps:
run_training_suite.shrun_transformer_suite.sh
Example usage on a Unix-like shell:
bash run_training_suite.sh
PROFILE=quick bash run_training_suite.sh
PROFILE=transformers bash run_training_suite.sh
bash run_transformer_suite.shThe default run_training_suite.sh profile is competition, which expands to a broader sweep across:
- models
- horizons
- modalities
- news lags
- seeds
- horizon-specific lookback windows
Additional environment variables supported by the script include:
MODELSHORIZONSMODALITIESTASKSNEWS_LAGSSEEDSLOOKBACKSSKIP_EXISTINGINSTALL_DEPS
On Windows, the Python module entry points are the more portable option unless you are running these scripts inside a Bash-compatible environment.
Each experiment run is written to its own directory under artifacts/. Depending on model type, outputs include:
config.jsonfeature_metadata.jsonsummary.csvsummary.json- per-fold
test_predictions.csv - model files:
model.txtfor LightGBMmodel.ptandstandardizer.npzfor neural models
The run naming convention is derived from model, horizon, modality, task, lookback, news lag, evaluation mode, and seed.
After reviewing experiment results, you can retrain a selected configuration into a deployable artifact under champions/.
Examples:
python -m stock_forecasting.train_champion --run-name lightgbm_h1_price_news_regression_lb30_lag1_walkforward_s7
python -m stock_forecasting.train_champion --task regressionChampion artifacts include:
config.jsonfeature_metadata.jsonselection.jsonchampion_summary.jsonvalidation_predictions.csv- model file outputs appropriate to the selected model family
Use a saved champion artifact to score the latest available data snapshot:
python -m stock_forecasting.predict_live --artifact-dir champions/YOUR_CHAMPION_NAMEOptional date override:
python -m stock_forecasting.predict_live --artifact-dir champions/YOUR_CHAMPION_NAME --as-of-date 2024-11-20The scoring command writes a ranked CSV file inside the selected champion directory by default.
The repository includes analysis notebooks in the project root, including:
compare_results.ipynbhorizon_evaluation_report.ipynb
These notebooks are intended for post-training comparison and reporting based on the contents of artifacts/.
FinCoders/
|-- artifacts/
|-- data/
|-- stock_forecasting/
|-- compare_results.ipynb
|-- horizon_evaluation_report.ipynb
|-- requirements-training.txt
|-- run_training_suite.sh
|-- run_transformer_suite.sh
`-- README.md
- The current implementation uses aggregated news and sentiment features rather than raw-text language model embeddings.
- GPU execution is supported when
torchdetects CUDA and--device cudais used; otherwise training falls back to CPU. - The default command-line device setting is
cuda, but the training code automatically uses CPU when CUDA is unavailable.