A living archive for project reasoning, model evaluation, system design,
risk workflows, reproducible experiments, and technical notes behind my data science portfolio.
$list experiments --status published --tag evaluation
Can an underwriting model know when to defer uncertain decisions to human review?
Method
Train a probability model, evaluate calibration, select abstention thresholds, validate inputs, and expose review decisions through generated artifacts and a dashboard.
How can fraud-risk scores become analyst-review decisions instead of just model outputs?
Method
Generate realistic synthetic transactions, train a fraud-risk pipeline, search cost-sensitive thresholds, produce policy artifacts, and score transactions for analyst review.
Evidence
ROC/PR curves, Brier score, threshold policy files, scored CSVs, reason codes, dashboard helpers, unit tests, and CI smoke workflows.
Can retrieval paths make RAG answers more explainable, grounded, and easier to inspect?
Method
Chunk documents, embed text, build an in-memory knowledge graph, retrieve with vector and graph signals, and expose inspectable answers through API/UI layers.
Evidence
Retrieval metrics, golden-query evaluation, citations, graph paths, API contracts, tests, CI, and optional LLM fallback behavior.
What it proves
Applied AI architecture, retrieval evaluation, system thinking, and explainable AI output design.
Which synthetic data method better preserves the behavior of real tabular data?
Method
Generate synthetic data with Copula and VAE methods, then compare outputs against real data using statistical, visual, privacy, and ML-utility diagnostics.
Evidence
Distribution overlap, categorical similarity, correlation difference, boundary violations, nearest-neighbor privacy proxy, PCA plots, pairplots, and quality summaries.
What it proves
Model comparison, evaluation design, data quality reasoning, CLI workflows, and transparent reporting.
real dataschema detectiongeneratorsynthetic dataquality metricsplotsreport
When does a hybrid recommender actually beat simple recommendation baselines?
Method
Generate deterministic synthetic movies and ratings, train content/collaborative/hybrid recommenders, compare against baselines, and tune hybrid alpha by NDCG.
Can SQL features and interpretable regression support site-selection decisions?
Method
Build location-level features in SQLite, train regression models, compare against baselines, validate performance, and score candidate locations with risk notes.