Inspiration
Many people have data but get stuck at “what do I do next?”—especially when tools feel either too technical (Python notebooks) or too limited (basic spreadsheet filters). We wanted a single app that makes analysis conversational and visual, without needing to leave your machine.
What it does
- Uploads CSV/Excel datasets and previews them in a table.
- Generates summaries and quick statistics.
- Suggests and renders charts.
- Detects anomalies/outliers.
- Cleans data (missing values, formats, etc.).
- Trains ML models for regression/classification and shows results.
- Lets you ask questions in natural language and answers using RAG grounded in your uploaded dataset.
How we built it
- Frontend: React + Vite inside an Electron desktop shell
- Backend: Python FastAPI server that handles file ingest, pandas-based analysis, chart/anomaly/cleaning services, and ML training.
- LLM layer: Integrated local LLM via Ollama and Gemini 3 API for chat.
- RAG: Builds embeddings from dataset chunks and retrieves the most relevant parts for each question to ground responses.
- Packaging: electron-builder generates a Windows NSIS installer and bundles backend resources.
Challenges we ran into
- Getting “chat with data” to be reliable: session state, grounding context, and keeping responses relevant.
- Aligning backend response shapes with frontend rendering for training results.
- Handling mixed data types (numeric + categorical) for ML training without crashes.
- Development workflow issues like backend port conflicts during Electron dev runs.
- Packaging constraints: shipping a desktop app that depends on Python/LLM components cleanly on Windows.
Accomplishments that we're proud of
- A complete end-to-end desktop experience: upload → explore → visualize → train → chat.
- True RAG grounding so answers come from the dataset instead of generic responses.
- A modular backend (controllers/services/models) that’s easy to extend.
- ML results surfaced in the UI with clearer “score + accuracy-style percentage” display.
What we learned
- RAG quality is mostly about good chunking + retrieval, not just calling an LLM.
- “Works on my machine” isn’t enough for desktop: process management, ports, and packaging matter a lot.
- ML pipelines need preprocessing (encoding/NA handling) to be usable on real-world datasets.
- Tight contracts between backend JSON and frontend UI prevent silent failures.
What's next for DataSage
- Make the installer fully self-contained by bundling a Python runtime + dependencies(no manual setup).
- Improve RAG with persistent vector stores (disk-backed) and dataset versioning.
- Add better evaluation and explainability for models (feature importance, confusion matrix).
- Add dataset provenance + “analysis report export” for sharing insights.
- Optional cloud/edge mode: keep local-first but allow users to plug in cloud compute including Gemini for higher-quality reasoning when desired.
Log in or sign up for Devpost to join the conversation.