Inspiration
The majority of individuals dedicate more than half of their time to finding the correct data – "The Data Problem No One Discusses." This problem arises because:
- Data is trapped in disorganized PDFs.
- Important information is missing from public datasets.
- Coding skills are needed to create synthetic data.
NexusCore transforms "I can't find the right data" into "I made it in 60 seconds."
The Anguish Is Actual:
- Need something specific? It is nonexistent.
- Discovered something "near"? Cleaning takes weeks.
- Tried gathering it yourself? Error-prone and frustrating.
NexusCore is designed for anyone who has ever:
- Spent hours scraping websites.
- Given up on a concept due to the lack of a suitable dataset.
- Fought to remove research paper tables.
Our Big Idea:
What if you did more than just look for information? What if it were possible to produce the ideal dataset in an instant? NexusCore is your shortcut to clear, usable, customized data.
What it does
AI-powered dataset generation:
- Prompt-to-CSV: Describe your needs $\rightarrow$ Get structured data (e.g., "100 rows of smartphone sales in Asia 2023").
- Document Intelligence: Upload PDFs $\rightarrow$ Extract tables $\rightarrow$ Expand datasets.
- Prompt-to-CSV + Document Intelligence: Describe your needs $\rightarrow$ Upload PDFs $\rightarrow$ Get structured data.
- One-Click Export: Clean CSVs ready for analysis in Pandas/Excel.
How we built it
A straightforward but creative tech stack powers NexusCore. The frontend, constructed with HTML, CSS, and JavaScript, offers a seamless drag-and-drop interface and live preview, making dataset generation quick and easy. We built a session-based processing pipeline on the backend using Flask and Python to guarantee isolated and responsive task execution. Groq and LLaMA3-70B are integrated by the AI engine, which uses prompt chaining techniques to create structured CSV files dynamically based on user input. We handled documents by accurately extracting text and tables from complex PDFs using PyPDF2 and custom layout-aware logic.
Challenges we ran into
- PDF Complexity: Extracting text from research papers (including math symbols and columns).
- AI Sanitization: Removing markdown artifacts from CSV output.
- Error Recovery: Gently Managing Groq API Timeouts.
- UI Sync: Preserving state throughout the upload, generate, and preview processes.
Accomplishments that we're proud of
- Core Functionality Attained: Developed a functional prompt-to-CSV engine that converts natural language queries into organized datasets.
- PDF Innovation: Effectively extracted intricate tabular data from research papers, including multi-column layouts and mathematical symbols.
- Smooth Integration: Developed a unified pipeline between UI $\rightarrow$ Groq API $\rightarrow$ CSV generation with real-time preview.
- Zero Leak Architecture: Secure session-based processing was implemented to isolate user data.
- User-Centric Design: An easy-to-use interface that doesn't require technical expertise was created.
What we learned
- PDFs Are Surprisingly Difficult: Academic papers fail at basic text extraction; layout awareness is crucial.
- Real-Time AI Constraints: Groq's tradeoffs between speed and timeouts necessitate careful, timely engineering.
- Quick Chaining Magic: Learned how multi-step LLM instructions significantly enhance CSV structure.
- Problems for Users Verified: Tests have verified that "dataset frustration" is a domain-wide phenomenon.
What's next for NexusCore
- Domain-Specific Engines: Templates tailored to a particular industry (medical, finance, research).
- Smart Cleaning Suite: Automatically correct generated datasets for missing values, outliers, and formatting mistakes.
- Collaboration Hub: Group work areas featuring versioned data and commenting.
- Dataset Marketplace: Exchange and remix datasets created by the community.
- AI Validation Assistant: Automatic evaluation of generated data quality.
- Multi-Modal Inputs: Web scraping, spreadsheets, and images are supported.
The ultimate goal is to democratize the creation of datasets so that any professional, from scientists to journalists, can produce flawless data in less than 60 seconds without knowing how to write code.

Log in or sign up for Devpost to join the conversation.