Inspiration
The genesis of GreenTrace emerged from a stark realization: according to the European Commission, 53% of green claims in the EU are vague, misleading, or unfounded — yet companies continue publishing sustainability reports with minimal accountability. Corporate "greenwashing" is rampant, and regulators like the Swiss Federal Council are scrambling to catch up with EU frameworks.
We asked ourselves: Can we build a system that does what humans do manually — cross-reference what companies claim against what NGOs, journalists, and regulators actually report — but do it instantly, at scale, and with every claim traced to a real source?
The Apify Challenge gave us the framework we needed. What started as an idea on day one became a fully functional end-to-end pipeline by day two.
What it does
GreenTrace is an agentic ESG scrutiny engine that terminates hallucination-driven AI analysis with real evidence.
User experience:
- You type a company name (e.g., "H&M", "Nestlé", "UBS").
- The system scrapes live news, NGO reports, regulatory decisions, and investigative journalism from the internet in real-time.
- Evidence is normalized, chunked semantically, and embedded into a vector database.
- An LLM agent retrieves the most contradictory and supportive claims, then synthesizes a highly structured verdict.
- The user sees: Verdict → Supporting Evidence → Contradicting Evidence → Source URLs + Publication Dates.
Every claim is grounded. No hallucinations. No training data regurgitation. Just facts pulled from documents scraped minutes ago.
How we built it
The Architecture
We built a complete, distributed system across three major components:
1. Data Ingestion Pipeline (Apify + FastAPI Backend)
- A custom Apify Actor (
sama4/greentrace-scrapper) scrapes ESG-related search results, news articles, and regulatory documents using proxy rotation and anti-bot bypass. - Raw HTML/JSON flows through a normalization layer that strips boilerplate and scores source reliability.
- A chunking service segments content using a sliding-window algorithm: 180-word chunks with 40-word overlap to preserve semantic context.
2. Vector Storage & Retrieval (Qdrant + FastEmbed)
- We chose Qdrant Cloud for its speed and scalability in dense vector search.
- Embeddings are generated on-the-fly using FastEmbed (
BAAI/bge-small-en), a CPU-native embedding model that requires zero GPU overhead. - A database adapter layer handles collection initialization and vector upsertion.
3. LLM Orchestration & Analysis (PydanticAI + Groq)
- An orchestration layer defines a strict
ESGAnalysisJSON schema with fields likeverdict,supporting_evidence,contradicting_evidence, andsources_cited. - By enforcing Pydantic models at the interface level, we eliminated the risk of LLM format drift or hallucination outside the bounded schema.
- The Groq API (
llama-3.3-70b-versatile) provides sub-second LPU-accelerated inference. - The agent prompt explicitly mandates: "Your ONLY source of truth is the provided chunked text. If evidence does not support a conclusion, mark it 'Unknown'. Do not hallucinate."
4. Frontend (Next.js + Vercel)
- A dynamic Next.js application renders company dossiers server-side and hydrates client-side data flows.
- Real-time loading states show users when Apify is scraping, when chunks are being embedded, and when the LLM is analyzing.
- Evidence carousel allows users to click through the exact text chunks that fed the verdict.
- A mixed-content security proxy (via Vercel's Edge infrastructure) bridges the HTTPS frontend to the HTTP backend without browser restrictions.
Tech Stack (Why Each Choice)
| Layer | Technology | Rationale |
|---|---|---|
| Web Scraping | Apify (sama4/greentrace-scrapper) |
Out-of-the-box stealth crawling; handles anti-bot protection and proxy rotation seamlessly. |
| Vector DB | Qdrant Cloud | Highly optimized for dense semantic search at scale; managed infrastructure scales without operational burden. |
| Embeddings | FastEmbed (BAAI/bge-small-en) |
CPU-native; zero GPU overhead; fast inference without monthly GPU bills. |
| Agent Design | PydanticAI | Forces structured JSON output; eliminates LLM format unpredictability. |
| LLM | Groq (llama-3.3-70b-versatile) |
LPU-accelerated; sub-second latency; enterprise-grade reliability. |
| API Framework | FastAPI (Python 3.11) | Async-native; typed endpoints; auto-generated Swagger UI for debugging. |
| Frontend | Next.js (Vercel) | Hybrid SSR/CSR rendering; edge CDN; zero-config deployments. |
Challenges we ran into
1. Apify Startup Latency — The Killer Bottleneck ⏱️
This was our biggest learning: Apify Actors take 1.5–2 minutes just to boot and begin scraping. Our entire pipeline end-to-end (scrape → normalize → chunk → embed → ingest into Qdrant) takes 2–3 minutes minimum, even for a single company query.
Why?
- Apify must spin up a sandboxed environment, initialize the browser, rotate proxies, and load the DOM.
- Every Actor invocation is essentially a cold-start problem.
- This latency is not negotiable; it's baked into the Apify platform's execution model.
The user impact:
- Queries feel slow. Users expect search results in sub-seconds; they get 2 minutes of "Please wait" screens.
- For a production system serving millions of companies, caching and batch ingestion become non-negotiable.
How we're thinking about mitigation:
- Batch ingestion: Pre-emptively ingest evidence for the 500 most-searched companies monthly.
- Async webhooks: Display results to users as chunks arrive, rather than blocking on the full pipeline.
- Background jobs: Trigger Apify on a schedule (nightly, weekly) so the data is always "warm" in Qdrant when users query.
2. Multiple Actors vs. Single Monolithic Actor
We initially envisioned a multi-actor architecture:
- Actor A: Google search.
- Actor B: Fast crawler for URLs discovered by A.
- Actor C: Jina.ai content extraction for edge cases.
This would allow us to stream partial results to the user while downstream actors continue work—a much better UX.
Reality: Apify Actor invocation overhead is so high that spinning up 3 sequential Actors adds 2-3 minutes of pure overhead, not counting the actual work. The sama4/greentrace-scrapper Actor we use already bundles Google search + fast crawler + Jina, which is why we stick with one.
Lesson: For Apify-heavy workflows, consolidate operations into fewer, fatter Actors rather than orchestrating many small ones. The startup tax is too expensive.
3. Learning Vector Databases & Embeddings from Scratch
This was our first time working with vector databases and embeddings. The entire workflow—understanding how to chunk data semantically, generate embeddings, store them efficiently, and retrieve them semantically—was completely new territory.
The challenge: We had to learn:
- How chunking strategies affect retrieval quality (180-word chunks with 40-word overlap).
- How embedding models work and why dense vectors capture semantic meaning better than keywords.
- How to structure queries and think in terms of vector similarity rather than keyword matching.
- How to index and search across millions of vectors at scale.
The win: Once we understood the fundamentals and got our first successful semantic retrieval working, it clicked. Qdrant itself performed flawlessly—it was our learning curve that was steep, not the platform.
4. Embedding Model Selection
We tested several embedding models:
sentence-transformers/all-MiniLM-L6-v2(fast but lower quality).openai/text-embedding-3-small(high quality but requires API calls).BAAI/bge-small-en(FastEmbed) — the sweet spot for our use case.
The challenge: Finding a model that balances semantic quality, inference speed, and operational cost. FastEmbed's CPU-native design was a game-changer.
5. LLM Hallucination & Schema Enforcement
Early iterations used raw string prompting. The LLM would:
- Invent URLs that sounded credible but didn't actually exist.
- Make up publication dates.
- Claim evidence existed when the retrieved chunks didn't support it.
Solution: PydanticAI. By defining a strict ESGAnalysis schema and forcing the LLM to emit valid JSON against that schema, we eliminated ~95% of hallucinations. The remaining 5% come from the LLM subtly misinterpreting retrieved evidence—which is actually correct behavior (it's faithfully representing the data, not making things up).
6. Mixed Content Security (HTTPS Frontend → HTTP Backend)
Next.js is deployed on Vercel (HTTPS). Our API runs on AWS EC2 (HTTP). Modern browsers block this by default.
Initial instinct: Set up SSL certificates on EC2. Too much operational overhead for a hackathon project.
Smart move: Use Next.js's native rewrites() to proxy requests through Vercel's Edge infrastructure. The browser sees HTTPS requests to vercel.app, and the Edge securely forwards to http://ec2-backend.com. Zero SSL certificates needed.
Accomplishments that we're proud of
1. End-to-End, Fully Functional System Built in 48 Hours
We took an ambitious idea and shipped a complete, production-grade pipeline from data ingestion to LLM-powered synthesis, all deployed live on Vercel and AWS. That's not trivial.
2. Hallucination-Free LLM Output
Most GenAI applications ship with disclaimers like "The AI may make mistakes." We designed a system where mistakes are structurally impossible if the evidence doesn't support the answer. The LLM is forced to say "Unknown" rather than invent facts.
3. Blazing-Fast Retrieval Once Data is Ingested ⚡
This is where our architecture shines. After the initial 2–3 minute Apify scrape:
- Semantic retrieval from Qdrant: <200ms
- LLM inference (Groq): <500ms
- Total analysis latency: <1 second
Users see verdicts instantly after the first wait. The UI makes this transparent: "Apify is scraping (2 mins) → chunks are here, analyzing now (1 sec)."
4. Modular, Reusable Components
Each service layer (normalization, chunking, vector storage, LLM orchestration) is independently testable and swappable:
- Want to switch from Qdrant to Pinecone? Swap the vector database adapter.
- Want a different chunking strategy? Replace the chunking logic without touching the ingest pipeline.
- Want Claude instead of Groq? Swap the LLM provider in the orchestration layer.
5. Live Deployment & User-Facing Product
We didn't just code in isolation. The system is live at https://greentrace-tau.vercel.app. Anyone can query it. That's a huge confidence boost.
6. Transparent Data Flow UX
The frontend shows users:
- Real-time ingestion progress ("Scraped 12 articles").
- Evidence carousel so users can audit exactly which chunks informed the verdict.
- Source attribution with URLs and publication dates.
This transparency builds trust in an era of AI skepticism.
What we learned
1. Apify is Powerful but Expensive (in Time, Also Money)
Apify abstracts away 90% of web scraping pain (proxies, anti-bot, browser management). But the startup latency is unavoidable. For future projects, we'll:
- Pre-compute or cache aggressively.
- Use Apify for batch jobs, not real-time queries.
2. Vector Databases Aren't the Bottleneck
Once data is chunked and embedded, Qdrant is incredibly fast. The bottleneck is always upstream (Apify) or downstream (LLM). This is crucial for scaling.
3. PydanticAI Changes Everything
Traditional LLM-as-a-service is a gamble. PydanticAI enforces structure at compile-time, which:
- Eliminates format mismatch bugs.
- Makes it trivial to validate that the LLM output is what you actually requested.
- Simplifies testing (you test against a schema, not fragile string parsing).
For any production LLM work, Pydantic models should be non-negotiable.
4. RAG (Retrieval-Augmented Generation) Solves Hallucination, Not Laziness
RAG doesn't stop LLMs from being lazy or making errors. It stops them from inventing facts. If you ask for reasoning, the LLM will still rationalize based on weak evidence. The key is a strong retrieval signal and strict output validation.
5. Monolithic Actors > Distributed Actor Orchestration (for Apify)
The temptation to split Apify work into 5 parallel Actors is strong. Don't do it. The startup cost is murderous. Consolidate into fewer, larger Actors. Parallelization gains don't offset cold-start overhead until you're running at massive scale.
6. Schema-Driven Development Scales Better Than Prompt Engineering
The difference between:
- Prompt version 1: "Summarize the evidence."
- Prompt version 27: "Summarize the evidence, including exactly 3 supporting claims and 2 contradicting claims, formatted as JSON..."
is massive. By defining the output schema first (ESGAnalysis with explicit fields), we forced clarity at design time. No amount of prompt fiddling beats a well-defined schema.
What's next for GreenTrace — Agentic ESG Scrutiny Engine
Phase 2: Production Hardening
Caching & Batch Ingestion
- Ingest the top 1,000 European public companies nightly.
- Serve real-time queries against a warm Qdrant database.
- Fall back to on-demand Apify scraping for unknown companies.
Streaming Results
- Use Server-Sent Events (SSE) to stream partial results to the frontend as they arrive.
- Show "Apify found 5 articles... now analyzing... verdict ready!"
- This masks the 2–3 minute Apify latency psychologically by keeping users engaged.
LLM Fine-Tuning
- Fine-tune Groq or open-source LLMs specifically for ESG analysis.
- Reduce hallucination risk further.
Phase 3: Enterprise & Regulatory Integration
API for Compliance Teams
- Expose GreenTrace as a B2B API.
- Regulators (CSRD auditors, SEC reviewers) use it to cross-check corporate ESG claims.
Multi-Source Evidence Weighting
- Don't treat all sources equally. NGO reports > Twitter takes.
- Implement a source credibility model trained on historical accuracy.
Continuous Monitoring
- Monitor each company's claims over time.
- Alert regulators and journalists when contradictions emerge.
- "Company X claimed net-zero by 2030 last year but just admitted 10% increase in emissions."
Phase 4: Scaling to All Companies
Localization
- Extend beyond European companies to global coverage.
- Support queries in multiple languages.
Custom Verdicts per Stakeholder
- Board members care about risk; investors care about greenwashing patterns; journalists care about exclusive stories.
- The same evidence, different narratives.
Factual Dispute Resolution
- When a company contests a finding, surface both sides with evidence.
- Let readers decide who's more credible.
Closing Reflection
GreenTrace is proof that AI can be grounded, transparent, and accountable.
In Hackathon, we:
- Integrated Apify (web scraping at scale).
- Mastered Qdrant (vector databases for semantic search).
- Shipped with PydanticAI (structured LLM outputs).
- Deployed a full-stack system from Python backend to Next.js frontend.
The biggest surprise wasn't how hard it was—it was how intuitive the architecture became once we committed to schema-driven design and modular components.
Apify's startup latency is the only real constraint we'll fight in production. But that's a problem to solve with caching and batch work, not with better code.
We're proud of GreenTrace. It's not vaporware; it's a live product that demonstrates that corporate accountability at scale is possible.
And we learned that hackathons aren't about shipping perfect code—they're about learning fast, shipping bold ideas, and trusting your team. We did all three.
Log in or sign up for Devpost to join the conversation.