PlainCents

Personal Finance Analytics Dashboard

A Python ML pipeline that ingests raw bank CSV exports, categorizes transactions via K-Means clustering, forecasts 3-month spending via Random Forest regression, and presents analytics through a 6-table SQLite data warehouse and PowerBI dashboard.

Python 3.11+ · scikit-learn · SQLite · PowerBI

What It Does

PlainCents loads bank CSV exports (TD, RBC, Scotiabank), normalizes dates and columns, and assigns each transaction to one of eight expense categories using unsupervised K-Means clustering. A Random Forest model then forecasts spending for the next three months per category using walk-forward validation. Results are written to a local SQLite warehouse; portfolio holdings are tracked with cache-first yfinance price lookups. The same data supports an automated Matplotlib PDF report and a PowerBI dashboard for interactive exploration.

Architecture

Raw Bank CSV
  → ingest.py       (clean, normalize dates)
  → cluster.py      (K-Means, 8 categories)
  → forecast.py     (Random Forest, 3-month)
  → portfolio.py    (yfinance, cache-first TTL)
  → database.py     (6-table SQLite warehouse)
  → report.py       (Matplotlib PDF — Phase 7)
  → powerbi_export  (CSVs → PowerBI — Phase 8)

Tech Stack

Tool	Role
Python	Pipeline, ML, DB operations
pandas / NumPy	Ingestion, feature engineering
scikit-learn	K-Means clustering, Random Forest
SQLite	6-table local data warehouse
yfinance	Portfolio price fetching (cache-first)
Matplotlib	PDF report — Phase 7
PowerBI	Interactive dashboard — Phase 8

Model Performance

K-Means categorization: 90% accuracy on 40-transaction held-out set (ARI = 0.81 on 779 transactions).
Pipeline MAPE: 29.4% (predictions use K-Means-assigned labels).
Forecast model MAPE: 15.7% (ground-truth labels) — 13.7pp gap is quantified clustering contamination, not model error.
Walk-forward validation: Expanding window, no shuffle, 24-month dataset.
GridSearchCV best params: max_depth=5, n_estimators=50, min_samples_leaf=3.

Pipeline MAPE reflects real-world K-Means label noise. The true forecast model MAPE of 15.7% was measured by separating clustering contamination from model error (diagnostic in forecast.py).

Challenges & Measures (Phases 1–6)

Clustering (Phase 2): Initial accuracy ~22%. Improved via L2-normalized TF-IDF, category-distinct synthetic merchants, bigrams, and n_clusters=12. ARI (0.81 on 779 rows) added for full-dataset validation.
Forecast (Phase 3): 68.8% MAPE on 12 months from data scarcity. Addressed by 24-month synthetic data with temporal structure, lag_1_spend feature, and tuned RF in walk-forward. Test-row extraction rewritten to compute features from training history only (no leakage).
Database (Phase 4): price_cache needed UNIQUE(ticker) for UPSERT; monthly_summary UPSERT key is month only.
Portfolio (Phase 5): Cache-first with 1-hour TTL; defensive yfinance handling (NaN, None, <=0); datetime.now() only for consistency.
Orchestration (Phase 6): Single conn and SESSION_ID; 5c one row per month with total next-month forecast; 5d join on (category, forecast_month) with cold-start skip when no prior predictions.

Setup

1. Install dependencies

pip install -r requirements.txt

2. Seed demo data

python db/seed_synthetic_data.py

Populates all 6 SQLite tables with synthetic demo data. Run before connecting PowerBI if you want pre-filled tables. All synthetic data is clearly labelled and must not be reported as real model performance metrics.

First-time only: Run python -m pipeline.cluster once to train and save the K-Means model (or use an existing models/kmeans_model.pkl). Then python main.py can run the full pipeline.

3. Run the pipeline

python main.py

File Structure

PlainCents/
├── main.py
├── config.py
├── requirements.txt
├── README.md
├── data/
│   ├── raw/                 # Bank CSV exports; synthetic_24mo.csv here
│   ├── processed/
│   └── exports/             # PowerBI CSVs + PDF reports (Phase 7/8)
├── pipeline/
│   ├── ingest.py
│   ├── features.py
│   ├── cluster.py
│   ├── forecast.py
│   └── portfolio.py
├── db/
│   ├── schema.sql
│   ├── database.py
│   └── seed_synthetic_data.py
├── models/
│   ├── kmeans_model.pkl     # Generated by pipeline.cluster
│   └── rf_model.pkl         # Generated by pipeline.forecast / main.py
├── viz/
│   ├── report.py            # Matplotlib PDF (Phase 7)
│   └── powerbi_export.py    # Phase 8
├── powerbi/                 # PowerBI assets
├── scripts/
│   ├── generate_synthetic_24mo.py
│   └── generate_synthetic_12mo.py
└── tests/
    └── test_pipeline.py

Database Schema

6-table SQLite schema (plaincents.db):

transactions — Categorized bank rows (session_id, date, merchant, amount, category, cluster_id).
predictions — RF 3-month forecasts per category (session_id, category, month_offset, forecast_month, predicted_amount).
portfolio — Holdings snapshots with live P&L (session_id, ticker, shares, avg_cost, current_price, pnl).
price_cache — yfinance TTL cache (ticker UNIQUE, current_price, fetched_at; 1-hour TTL).
monthly_summary — Monthly spend + portfolio value (month UNIQUE; UPSERT on re-run).
forecast_vs_actual — Monitoring: predicted vs actual per category per month.

Data Privacy

⚠️ Data Privacy
Never commit real bank data to this repo. data/raw/ is gitignored. Only synthetic data (e.g. synthetic_24mo.csv) may be committed — for demo use only. plaincents.db is gitignored. Seed script data is synthetic and must not be reported as real model performance metrics.

Build Status

Phase	Status
Phase 1: Ingest	✅ Complete
Phase 2: K-Means Clustering	✅ Complete
Phase 3: RF Forecasting	✅ Complete
Phase 4: SQLite Schema + Seed	✅ Complete
Phase 5: Portfolio + Cache	✅ Complete
Phase 6: main.py Orchestration	✅ Complete
Phase 7: Matplotlib PDF Report	🔄 In Progress
Phase 8: PowerBI Dashboard	🔄 In Progress
Phase 9: Tests + Deploy	🔄 In Progress

Author

Kapil Iyer
Bachelor of Honors Mathematics, University of Waterloo Applied Mathematics (Scientific ML) and Statistics, Computing Minor

GitHub · LinkedIn · Portfolio

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PlainCents

What It Does

Architecture

Tech Stack

Model Performance

Challenges & Measures (Phases 1–6)

Setup

1. Install dependencies

2. Seed demo data

3. Run the pipeline

File Structure

Database Schema

Data Privacy

Build Status

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.cursor/rules		.cursor/rules
PRD_and_Buildplan		PRD_and_Buildplan
data		data
db		db
docs		docs
models		models
pipeline		pipeline
powerbi		powerbi
scripts		scripts
tests		tests
viz		viz
.gitignore		.gitignore
README.md		README.md
config.py		config.py
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

PlainCents

What It Does

Architecture

Tech Stack

Model Performance

Challenges & Measures (Phases 1–6)

Setup

1. Install dependencies

2. Seed demo data

3. Run the pipeline

File Structure

Database Schema

Data Privacy

Build Status

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages