DataSage

Inspiration

Many people have data but get stuck at “what do I do next?”—especially when tools feel either too technical (Python notebooks) or too limited (basic spreadsheet filters). We wanted a single app that makes analysis conversational and visual, without needing to leave your machine.

What it does

Uploads CSV/Excel datasets and previews them in a table.
Generates summaries and quick statistics.
Suggests and renders charts.
Detects anomalies/outliers.
Cleans data (missing values, formats, etc.).
Trains ML models for regression/classification and shows results.
Lets you ask questions in natural language and answers using RAG grounded in your uploaded dataset.

How we built it

Frontend: React + Vite inside an Electron desktop shell
Backend: Python FastAPI server that handles file ingest, pandas-based analysis, chart/anomaly/cleaning services, and ML training.
LLM layer: Integrated local LLM via Ollama and Gemini 3 API for chat.
RAG: Builds embeddings from dataset chunks and retrieves the most relevant parts for each question to ground responses.
Packaging: electron-builder generates a Windows NSIS installer and bundles backend resources.

Challenges we ran into

Getting “chat with data” to be reliable: session state, grounding context, and keeping responses relevant.
Aligning backend response shapes with frontend rendering for training results.
Handling mixed data types (numeric + categorical) for ML training without crashes.
Development workflow issues like backend port conflicts during Electron dev runs.
Packaging constraints: shipping a desktop app that depends on Python/LLM components cleanly on Windows.

Accomplishments that we're proud of

A complete end-to-end desktop experience: upload → explore → visualize → train → chat.
True RAG grounding so answers come from the dataset instead of generic responses.
A modular backend (controllers/services/models) that’s easy to extend.
ML results surfaced in the UI with clearer “score + accuracy-style percentage” display.

What we learned

RAG quality is mostly about good chunking + retrieval, not just calling an LLM.
“Works on my machine” isn’t enough for desktop: process management, ports, and packaging matter a lot.
ML pipelines need preprocessing (encoding/NA handling) to be usable on real-world datasets.
Tight contracts between backend JSON and frontend UI prevent silent failures.

What's next for DataSage

Make the installer fully self-contained by bundling a Python runtime + dependencies(no manual setup).
Improve RAG with persistent vector stores (disk-backed) and dataset versioning.
Add better evaluation and explainability for models (feature importance, confusion matrix).
Add dataset provenance + “analysis report export” for sharing insights.
Optional cloud/edge mode: keep local-first but allow users to plug in cloud compute including Gemini for higher-quality reasoning when desired.