Inspiration
DataCrawl was inspired by a simple frustration: people often know exactly what dataset they want, but turning that idea into a clean, usable file still takes too much manual work. Financial data is especially difficult because it can come from APIs, websites, gated dashboards, or paid providers, each with different rules, formats, and limitations. We wanted to build a system where a user could describe the dataset they needed in plain English, approve a clear plan, and then let the platform handle the rest as autonomously as possible.
What it does
DataCrawl is an AI-powered financial-data pipeline. A user asks for a dataset in natural language, and the system plans how to obtain it, evaluates source feasibility and budget, and then executes the collection flow. It can work across APIs, browser-based sources, and account-gated services, while surfacing compliance concerns, costs, and required user approvals. The platform also validates that the final dataset actually matches what was requested, with a strong focus on schema accuracy and expected data volume.
How we built it
We built DataCrawl as a multi-agent system. A Gemini-based orchestrator handles the planning layer: interpreting the request, identifying candidate sources, checking feasibility, and coordinating execution. Specialized subagents handle compliance review, script generation, browser crawling, normalization, and validation. The backend uses FastAPI and LangGraph to manage stateful execution, while Firebase stores run state and generated artifacts. On the frontend, we built a chat-first interface that shows planning, approvals, live reasoning, progress, and execution status in real time.
Challenges we ran into
The hardest part was not generating plans, but making the whole pipeline reliable. We had to fix orchestration stalls, make generated scripts actually execute and return datasets, and build a strict validator that could detect schema mismatches or row-count failures. Another major challenge was handling expensive or risky actions safely, especially paid provider signups. That required explicit approval gates, budget tracking, manual checkout pauses, and tight control over what the autonomous agents were allowed to do.
Accomplishments
What we are most proud of is turning the project into more than a demo chatbot. DataCrawl can now plan in detail, explain costs, coordinate multiple specialized agents, stream reasoning to the UI, enforce validation, and pause safely when user approval is required. We also built a much more trustworthy experience by making the system observable, controllable, and budget-aware, which is essential for any real autonomous data workflow.
What we learned
We learned that agent systems are mostly systems-engineering problems, not just prompt-engineering problems. Reliability comes from good state management, strong tool contracts, observability, and careful failure handling. We also learned that “autonomy” only works when paired with trust: users need visibility into what the system is doing, why it chose a source, how much it may cost, and when it needs permission to proceed.
What's next for DataCrawl
The next step is making the system even more robust and production-ready. We want to improve source coverage across more financial providers, strengthen retries and fallback logic, and make paid-provider flows safer and smoother. We also want better artifact previews, deeper lineage tracking, and stronger cost controls so users can understand not just the dataset they received, but exactly how it was created.
Extended DataCrawl Demo: https://www.youtube.com/watch?v=3mwebD-jOb4
Built With
- auth0
- beautiful-soup
- css
- fastapi
- firebase
- firestore
- gemini
- html
- langgraph
- pydantic
- python
- qwen
- react
- solana
- stripe
- typescript
- uvicorn
Log in or sign up for Devpost to join the conversation.