SYNTHIA – Autonomous Multi-Agent Framework for Data Generation, Cleansing, and Exploratory Analysis

Inspiration

Data preparation continues to be one of the most time-consuming and labor-intensive stages in the machine learning lifecycle. Manual data cleaning, exploratory analysis, and synthetic data generation often require specialized skills, significant engineering effort, and repetitive coding.

Our inspiration stemmed from a single question:

"Can we fully automate the process of converting a user intent or prompt into a high-quality, clean, and insight-ready dataset?"

SYNTHIA was born out of the ambition to create a fully autonomous system that bridges the gap between raw business requirements and analysis-ready tabular data—driven by LLMs and generative modeling techniques.

What It Does

SYNTHIA is a modular, autonomous, multi-agent framework that performs end-to-end processing of structured tabular data:

Generates realistic tabular datasets from natural language prompts using ** Groq's LLaMA 3–70B via LangChain.**
Cleans datasets by imputing missing values, removing duplicates, and filtering outliers using Isolation Forest.
Synthesizes new data samples using ** CTGAN (Conditional Tabular GAN), enabling privacy-preserving augmentation.**
Performs ** prompt-driven Exploratory Data Analysis (EDA),** automatically generating valid Python code for analysis and visualization.
Produces visual plots and statistical summaries in ** HTML format (base64-encoded),** ready for integration into dashboards or reporting tools.
Includes a feedback agent that evaluates insight quality and autonomously determines whether regeneration is required.

How We Built It

We developed SYNTHIA using a service-oriented architecture composed of five Python-based agents:

sample_gen_agent: Uses LLMs to convert user prompts or CSV files into structured pandas DataFrames.
gap_det_agent: Applies statistical imputation, deduplication, and anomaly filtering for data quality enhancement.
gan_agent: Leverages the SDV library’s CTGAN to generate synthetic records from preprocessed datasets.
eda_agent: Interprets user queries using LLMs to generate executable EDA code with pandas, seaborn, and matplotlib.
feed_agent: Evaluates analytical relevance and correctness of insights, deciding whether to re-trigger the generation pipeline.

Technologies used include:

** LangChain + Groq API (LLaMA 3-70B)** for natural language to data/code generation
** SDV’s CTGAN** for synthetic tabular modeling
matplotlib/seaborn for data visualization
Python’s exec with isolated context and stdout capture for safe code execution
HTML rendering using base64 encoding for portability

Challenges We Faced

Ensuring syntactically valid and contextually accurate Python code from LLM-generated responses
Handling schema mismatch and column-case sensitivity between prompts and DataFrame structures
Managing non-determinism in generative models while ensuring analytical reproducibility
Designing a robust feedback loop to handle null outputs, empty plots, or failed analysis
Balancing computational efficiency with flexibility across varying dataset sizes

Accomplishments We're Proud Of

Delivered a fully autonomous, modular pipeline from user prompt to high-quality, visualized insights
Successfully integrated classical statistical methods with LLM-driven generative workflows
Enabled plot and summary rendering without reliance on front-end libraries
Achieved robust execution through exception-handling, sandboxed environments, and feedback-based regeneration
Built an architecture that is production-ready, portable, and extensible across industries and datasets

What We Learned

Advanced prompt engineering for structured and analytical outputs
Best practices in synthetic data modeling with ** CTGAN and class balancing**
Multi-agent orchestration design for AI-driven data workflows
Secure sandboxing for AI-generated Python execution
Practical trade-offs between LLM creativity and deterministic code generation

What's Next for SYNTHIA

Extend support to multimodal datasets (e.g., time series, geospatial, image-derived tabular data)
Deploy as an interactive web application via Streamlit Cloud or Hugging Face Spaces
Integrate with enterprise data lakes and cloud platforms (e.g., BigQuery, AWS S3, Snowflake)
Enable auto-generation of shareable PDF reports and dashboard-ready insights
Implement reinforcement-based feedback for adaptive insight refinement

** NOTE : We had initially requested 300 AWS credits as per the hackathon guidelines and submitted the request form through the official link. However, due to the early closure of the credit distribution or unavailability of confirmation, we were unable to proceed with AWS-hosted deployment.

As a result, we deployed SYNTHIA on Hugging Face Spaces (zero-cost tier), ensuring continued public accessibility, reproducibility, and live demonstration capability without compromising performance or functionality.

Built by Team ** (x2X Coders) - SYNTHIA ** for the AWS & Impetus GenAI Hackathon 2025 — accelerating the future of data science automation.