Personal Finance Analytics Dashboard
A Python ML pipeline that ingests raw bank CSV exports, categorizes transactions via K-Means clustering, forecasts 3-month spending via Random Forest regression, and presents analytics through a 6-table SQLite data warehouse and PowerBI dashboard.
Python 3.11+ · scikit-learn · SQLite · PowerBI
PlainCents loads bank CSV exports (TD, RBC, Scotiabank), normalizes dates and columns, and assigns each transaction to one of eight expense categories using unsupervised K-Means clustering. A Random Forest model then forecasts spending for the next three months per category using walk-forward validation. Results are written to a local SQLite warehouse; portfolio holdings are tracked with cache-first yfinance price lookups. The same data supports an automated Matplotlib PDF report and a PowerBI dashboard for interactive exploration.
Raw Bank CSV
→ ingest.py (clean, normalize dates)
→ cluster.py (K-Means, 8 categories)
→ forecast.py (Random Forest, 3-month)
→ portfolio.py (yfinance, cache-first TTL)
→ database.py (6-table SQLite warehouse)
→ report.py (Matplotlib PDF — Phase 7)
→ powerbi_export (CSVs → PowerBI — Phase 8)
| Tool | Role |
|---|---|
| Python | Pipeline, ML, DB operations |
| pandas / NumPy | Ingestion, feature engineering |
| scikit-learn | K-Means clustering, Random Forest |
| SQLite | 6-table local data warehouse |
| yfinance | Portfolio price fetching (cache-first) |
| Matplotlib | PDF report — Phase 7 |
| PowerBI | Interactive dashboard — Phase 8 |
- K-Means categorization: 90% accuracy on 40-transaction held-out set (ARI = 0.81 on 779 transactions).
- Pipeline MAPE: 29.4% (predictions use K-Means-assigned labels).
- Forecast model MAPE: 15.7% (ground-truth labels) — 13.7pp gap is quantified clustering contamination, not model error.
- Walk-forward validation: Expanding window, no shuffle, 24-month dataset.
- GridSearchCV best params: max_depth=5, n_estimators=50, min_samples_leaf=3.
Pipeline MAPE reflects real-world K-Means label noise. The true forecast model MAPE of 15.7% was measured by separating clustering contamination from model error (diagnostic in forecast.py).
- Clustering (Phase 2): Initial accuracy ~22%. Improved via L2-normalized TF-IDF, category-distinct synthetic merchants, bigrams, and n_clusters=12. ARI (0.81 on 779 rows) added for full-dataset validation.
- Forecast (Phase 3): 68.8% MAPE on 12 months from data scarcity. Addressed by 24-month synthetic data with temporal structure, lag_1_spend feature, and tuned RF in walk-forward. Test-row extraction rewritten to compute features from training history only (no leakage).
- Database (Phase 4): price_cache needed UNIQUE(ticker) for UPSERT; monthly_summary UPSERT key is month only.
- Portfolio (Phase 5): Cache-first with 1-hour TTL; defensive yfinance handling (NaN, None, <=0); datetime.now() only for consistency.
- Orchestration (Phase 6): Single conn and SESSION_ID; 5c one row per month with total next-month forecast; 5d join on (category, forecast_month) with cold-start skip when no prior predictions.
pip install -r requirements.txt
python db/seed_synthetic_data.py
Populates all 6 SQLite tables with synthetic demo data. Run before connecting PowerBI if you want pre-filled tables. All synthetic data is clearly labelled and must not be reported as real model performance metrics.
First-time only: Run python -m pipeline.cluster once to train and save the K-Means model (or use an existing models/kmeans_model.pkl). Then python main.py can run the full pipeline.
python main.py
PlainCents/
├── main.py
├── config.py
├── requirements.txt
├── README.md
├── data/
│ ├── raw/ # Bank CSV exports; synthetic_24mo.csv here
│ ├── processed/
│ └── exports/ # PowerBI CSVs + PDF reports (Phase 7/8)
├── pipeline/
│ ├── ingest.py
│ ├── features.py
│ ├── cluster.py
│ ├── forecast.py
│ └── portfolio.py
├── db/
│ ├── schema.sql
│ ├── database.py
│ └── seed_synthetic_data.py
├── models/
│ ├── kmeans_model.pkl # Generated by pipeline.cluster
│ └── rf_model.pkl # Generated by pipeline.forecast / main.py
├── viz/
│ ├── report.py # Matplotlib PDF (Phase 7)
│ └── powerbi_export.py # Phase 8
├── powerbi/ # PowerBI assets
├── scripts/
│ ├── generate_synthetic_24mo.py
│ └── generate_synthetic_12mo.py
└── tests/
└── test_pipeline.py
6-table SQLite schema (plaincents.db):
- transactions — Categorized bank rows (session_id, date, merchant, amount, category, cluster_id).
- predictions — RF 3-month forecasts per category (session_id, category, month_offset, forecast_month, predicted_amount).
- portfolio — Holdings snapshots with live P&L (session_id, ticker, shares, avg_cost, current_price, pnl).
- price_cache — yfinance TTL cache (ticker UNIQUE, current_price, fetched_at; 1-hour TTL).
- monthly_summary — Monthly spend + portfolio value (month UNIQUE; UPSERT on re-run).
- forecast_vs_actual — Monitoring: predicted vs actual per category per month.
⚠️ Data Privacy
Never commit real bank data to this repo.data/raw/is gitignored. Only synthetic data (e.g.synthetic_24mo.csv) may be committed — for demo use only.plaincents.dbis gitignored. Seed script data is synthetic and must not be reported as real model performance metrics.
| Phase | Status |
|---|---|
| Phase 1: Ingest | ✅ Complete |
| Phase 2: K-Means Clustering | ✅ Complete |
| Phase 3: RF Forecasting | ✅ Complete |
| Phase 4: SQLite Schema + Seed | ✅ Complete |
| Phase 5: Portfolio + Cache | ✅ Complete |
| Phase 6: main.py Orchestration | ✅ Complete |
| Phase 7: Matplotlib PDF Report | 🔄 In Progress |
| Phase 8: PowerBI Dashboard | 🔄 In Progress |
| Phase 9: Tests + Deploy | 🔄 In Progress |
Kapil Iyer
Bachelor of Honors Mathematics, University of Waterloo
Applied Mathematics (Scientific ML) and Statistics, Computing Minor