** SYNTHIA** – Autonomous Multi-Agent Framework for Data Generation, Cleansing, and Exploratory Analysis
Inspiration
Data preparation continues to be one of the most time-consuming and labor-intensive stages in the machine learning lifecycle. Manual data cleaning, exploratory analysis, and synthetic data generation often require specialized skills, significant engineering effort, and repetitive coding.
Our inspiration stemmed from a single question:
"Can we fully automate the process of converting a user intent or prompt into a high-quality, clean, and insight-ready dataset?"
SYNTHIA was born out of the ambition to create a fully autonomous system that bridges the gap between raw business requirements and analysis-ready tabular data—driven by LLMs and generative modeling techniques.
What It Does
SYNTHIA is a modular, autonomous, multi-agent framework that performs end-to-end processing of structured tabular data:
- Generates realistic tabular datasets from natural language prompts using ** Groq's LLaMA 3–70B via LangChain.**
- Cleans datasets by imputing missing values, removing duplicates, and filtering outliers using Isolation Forest.
- Synthesizes new data samples using ** CTGAN (Conditional Tabular GAN), enabling privacy-preserving augmentation.**
- Performs ** prompt-driven Exploratory Data Analysis (EDA),** automatically generating valid Python code for analysis and visualization.
- Produces visual plots and statistical summaries in ** HTML format (base64-encoded),** ready for integration into dashboards or reporting tools.
- Includes a feedback agent that evaluates insight quality and autonomously determines whether regeneration is required.
How We Built It
We developed SYNTHIA using a service-oriented architecture composed of five Python-based agents:
sample_gen_agent: Uses LLMs to convert user prompts or CSV files into structured pandas DataFrames.gap_det_agent: Applies statistical imputation, deduplication, and anomaly filtering for data quality enhancement.gan_agent: Leverages the SDV library’s CTGAN to generate synthetic records from preprocessed datasets.eda_agent: Interprets user queries using LLMs to generate executable EDA code with pandas, seaborn, and matplotlib.feed_agent: Evaluates analytical relevance and correctness of insights, deciding whether to re-trigger the generation pipeline.
Technologies used include:
- ** LangChain + Groq API (LLaMA 3-70B)** for natural language to data/code generation
- ** SDV’s CTGAN** for synthetic tabular modeling
- matplotlib/seaborn for data visualization
- Python’s exec with isolated context and stdout capture for safe code execution
- HTML rendering using base64 encoding for portability
Challenges We Faced
- Ensuring syntactically valid and contextually accurate Python code from LLM-generated responses
- Handling schema mismatch and column-case sensitivity between prompts and DataFrame structures
- Managing non-determinism in generative models while ensuring analytical reproducibility
- Designing a robust feedback loop to handle null outputs, empty plots, or failed analysis
- Balancing computational efficiency with flexibility across varying dataset sizes
Accomplishments We're Proud Of
- Delivered a fully autonomous, modular pipeline from user prompt to high-quality, visualized insights
- Successfully integrated classical statistical methods with LLM-driven generative workflows
- Enabled plot and summary rendering without reliance on front-end libraries
- Achieved robust execution through exception-handling, sandboxed environments, and feedback-based regeneration
- Built an architecture that is production-ready, portable, and extensible across industries and datasets
What We Learned
- Advanced prompt engineering for structured and analytical outputs
- Best practices in synthetic data modeling with ** CTGAN and class balancing**
- Multi-agent orchestration design for AI-driven data workflows
- Secure sandboxing for AI-generated Python execution
- Practical trade-offs between LLM creativity and deterministic code generation
What's Next for SYNTHIA
- Extend support to multimodal datasets (e.g., time series, geospatial, image-derived tabular data)
- Deploy as an interactive web application via Streamlit Cloud or Hugging Face Spaces
- Integrate with enterprise data lakes and cloud platforms (e.g., BigQuery, AWS S3, Snowflake)
- Enable auto-generation of shareable PDF reports and dashboard-ready insights
- Implement reinforcement-based feedback for adaptive insight refinement
** NOTE : We had initially requested 300 AWS credits as per the hackathon guidelines and submitted the request form through the official link. However, due to the early closure of the credit distribution or unavailability of confirmation, we were unable to proceed with AWS-hosted deployment.
As a result, we deployed SYNTHIA on Hugging Face Spaces (zero-cost tier), ensuring continued public accessibility, reproducibility, and live demonstration capability without compromising performance or functionality.
**
Built by Team ** (x2X Coders) - SYNTHIA ** for the AWS & Impetus GenAI Hackathon 2025 — accelerating the future of data science automation.
Built With
- base64
- ctgan-(sdv)
- dotenv
- hugging-face-spaces
- io
- javascript
- langchain
- llama-3-(groq-api)
- matplotlib
- pandas
- python
- re
- scikit-learn
- seaborn
- streamlit
- streamlit-cloud
Log in or sign up for Devpost to join the conversation.