Skip to content

Kapil-Iyer/PlainCents

Repository files navigation

PlainCents

Personal Finance Analytics Dashboard

A Python ML pipeline that ingests raw bank CSV exports, categorizes transactions via K-Means clustering, forecasts 3-month spending via Random Forest regression, and presents analytics through a 6-table SQLite data warehouse and PowerBI dashboard.

Python 3.11+ · scikit-learn · SQLite · PowerBI


What It Does

PlainCents loads bank CSV exports (TD, RBC, Scotiabank), normalizes dates and columns, and assigns each transaction to one of eight expense categories using unsupervised K-Means clustering. A Random Forest model then forecasts spending for the next three months per category using walk-forward validation. Results are written to a local SQLite warehouse; portfolio holdings are tracked with cache-first yfinance price lookups. The same data supports an automated Matplotlib PDF report and a PowerBI dashboard for interactive exploration.


Architecture

Raw Bank CSV
  → ingest.py       (clean, normalize dates)
  → cluster.py      (K-Means, 8 categories)
  → forecast.py     (Random Forest, 3-month)
  → portfolio.py    (yfinance, cache-first TTL)
  → database.py     (6-table SQLite warehouse)
  → report.py       (Matplotlib PDF — Phase 7)
  → powerbi_export  (CSVs → PowerBI — Phase 8)

Tech Stack

Tool Role
Python Pipeline, ML, DB operations
pandas / NumPy Ingestion, feature engineering
scikit-learn K-Means clustering, Random Forest
SQLite 6-table local data warehouse
yfinance Portfolio price fetching (cache-first)
Matplotlib PDF report — Phase 7
PowerBI Interactive dashboard — Phase 8

Model Performance

  • K-Means categorization: 90% accuracy on 40-transaction held-out set (ARI = 0.81 on 779 transactions).
  • Pipeline MAPE: 29.4% (predictions use K-Means-assigned labels).
  • Forecast model MAPE: 15.7% (ground-truth labels) — 13.7pp gap is quantified clustering contamination, not model error.
  • Walk-forward validation: Expanding window, no shuffle, 24-month dataset.
  • GridSearchCV best params: max_depth=5, n_estimators=50, min_samples_leaf=3.

Pipeline MAPE reflects real-world K-Means label noise. The true forecast model MAPE of 15.7% was measured by separating clustering contamination from model error (diagnostic in forecast.py).


Challenges & Measures (Phases 1–6)

  • Clustering (Phase 2): Initial accuracy ~22%. Improved via L2-normalized TF-IDF, category-distinct synthetic merchants, bigrams, and n_clusters=12. ARI (0.81 on 779 rows) added for full-dataset validation.
  • Forecast (Phase 3): 68.8% MAPE on 12 months from data scarcity. Addressed by 24-month synthetic data with temporal structure, lag_1_spend feature, and tuned RF in walk-forward. Test-row extraction rewritten to compute features from training history only (no leakage).
  • Database (Phase 4): price_cache needed UNIQUE(ticker) for UPSERT; monthly_summary UPSERT key is month only.
  • Portfolio (Phase 5): Cache-first with 1-hour TTL; defensive yfinance handling (NaN, None, <=0); datetime.now() only for consistency.
  • Orchestration (Phase 6): Single conn and SESSION_ID; 5c one row per month with total next-month forecast; 5d join on (category, forecast_month) with cold-start skip when no prior predictions.

Setup

1. Install dependencies

pip install -r requirements.txt

2. Seed demo data

python db/seed_synthetic_data.py

Populates all 6 SQLite tables with synthetic demo data. Run before connecting PowerBI if you want pre-filled tables. All synthetic data is clearly labelled and must not be reported as real model performance metrics.

First-time only: Run python -m pipeline.cluster once to train and save the K-Means model (or use an existing models/kmeans_model.pkl). Then python main.py can run the full pipeline.

3. Run the pipeline

python main.py

File Structure

PlainCents/
├── main.py
├── config.py
├── requirements.txt
├── README.md
├── data/
│   ├── raw/                 # Bank CSV exports; synthetic_24mo.csv here
│   ├── processed/
│   └── exports/             # PowerBI CSVs + PDF reports (Phase 7/8)
├── pipeline/
│   ├── ingest.py
│   ├── features.py
│   ├── cluster.py
│   ├── forecast.py
│   └── portfolio.py
├── db/
│   ├── schema.sql
│   ├── database.py
│   └── seed_synthetic_data.py
├── models/
│   ├── kmeans_model.pkl     # Generated by pipeline.cluster
│   └── rf_model.pkl         # Generated by pipeline.forecast / main.py
├── viz/
│   ├── report.py            # Matplotlib PDF (Phase 7)
│   └── powerbi_export.py    # Phase 8
├── powerbi/                 # PowerBI assets
├── scripts/
│   ├── generate_synthetic_24mo.py
│   └── generate_synthetic_12mo.py
└── tests/
    └── test_pipeline.py

Database Schema

6-table SQLite schema (plaincents.db):

  • transactions — Categorized bank rows (session_id, date, merchant, amount, category, cluster_id).
  • predictions — RF 3-month forecasts per category (session_id, category, month_offset, forecast_month, predicted_amount).
  • portfolio — Holdings snapshots with live P&L (session_id, ticker, shares, avg_cost, current_price, pnl).
  • price_cache — yfinance TTL cache (ticker UNIQUE, current_price, fetched_at; 1-hour TTL).
  • monthly_summary — Monthly spend + portfolio value (month UNIQUE; UPSERT on re-run).
  • forecast_vs_actual — Monitoring: predicted vs actual per category per month.

Data Privacy

⚠️ Data Privacy
Never commit real bank data to this repo. data/raw/ is gitignored. Only synthetic data (e.g. synthetic_24mo.csv) may be committed — for demo use only. plaincents.db is gitignored. Seed script data is synthetic and must not be reported as real model performance metrics.


Build Status

Phase Status
Phase 1: Ingest ✅ Complete
Phase 2: K-Means Clustering ✅ Complete
Phase 3: RF Forecasting ✅ Complete
Phase 4: SQLite Schema + Seed ✅ Complete
Phase 5: Portfolio + Cache ✅ Complete
Phase 6: main.py Orchestration ✅ Complete
Phase 7: Matplotlib PDF Report 🔄 In Progress
Phase 8: PowerBI Dashboard 🔄 In Progress
Phase 9: Tests + Deploy 🔄 In Progress

Author

Kapil Iyer
Bachelor of Honors Mathematics, University of Waterloo Applied Mathematics (Scientific ML) and Statistics, Computing Minor


GitHub · LinkedIn · Portfolio


About

Personal Finance and Investment Analytics PowerBI Dashboard powered by classical ML: K-Means Clustering and Radom Forest Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages