A prototype system for transforming real-world enterprise SQL logs into high-quality Text-to-SQL benchmarks using human-in-the-loop annotation workflows.
Enterprise Text-to-SQL addresses the challenge of building realistic, domain-specific Text-to-SQL datasets by combining:
- SQL log mining
- Human-in-the-loop annotation
- LLM-assisted generation and validation
This system was developed as part of the BENCHPRESS project and supports benchmark creation from internal SQL query logs.
- Live Demo: Coming soon
- Video Walkthrough: [▶ Watch on YouTube](https://www.youtube.com/coming soon)
- Poster Presentation: View the NEDB 2025 Poster(PDF)
- ✅ Upload and parse enterprise SQL logs
- ✅ Auto-cluster similar queries using LLM embeddings
- ✅ Generate natural language annotations with prompt-based LLMs
- ✅ Verify and edit annotations via an easy-to-use UI
- ✅ Export clean Text-to-SQL benchmark datasets
git clone https://github.com/fabian-wenz/enterprise-txt2sql.git
cd enterprise-txt2sql
pip install -r requirements.txtRequirements:
- Python 3.9+
- OpenAI API Key (or compatible LLM provider)
- Optional: pgvector for clustering via vector search
python website/app.pyThen open your browser and go to:
http://localhost:8000
- Project Setup: Create a new annotation project for a specific enterprise workload.
- Data Ingestion: Upload SQL logs and schema files, or select a public benchmark (Bird, FIBEN, Spider, Beaver).
- Task Configuration: Select annotation direction (SQL→NL) and a language model (e.g., GPT-4o, GPT-3.5, DeepSeek).
- (Optional) Decomposition: Split nested SQL into simpler subqueries using CTEs.
- Context Retrieval: Retrieve similar annotated examples and relevant tables using dense embeddings.
- Candidate Generation: LLM generates 4 NL candidates using retrieval-augmented few-shot prompting.
- (Optional) Recomposition: Merge subquery descriptions into a single coherent explanation.
- Human Feedback: Annotators rank, edit, or discard LLM outputs.
- Review & Export: Export final annotations for training or evaluation; optionally auto-evaluate if ground truth exists.
.
├── demo/ # Screenshots and videos for README
├── website/ # Preprocessing, clustering, and evaluation scripts
├── data/ # Sample SQL logs and generated benchmark data
├── templates/ # HTML templates for visualizinf the website
├── app.py # Main entry point for the UI
├── config.py # Prompts and LLM interaction
├── requirements.txt # Python dependencies
└── README.md # This file
{
"question": "Show the top 10 customers by revenue.",
"query": "SELECT customer_name FROM sales ORDER BY revenue DESC LIMIT 10"
}BENCHPRESS: An Annotation System for Rapid Text-to-SQL Benchmark Curation
Fabian Wenz*, Peter Baile Chen, Moe Kayali, Michael Stonebraker, Cagatay Demiralp
Submitted to CIDR 2026
📄 coming soon on
This project was developed during Fabian Wenz’s time at MIT CSAIL with the support of:
- Prof. Michael Stonebraker
- Dr. Cagatay Demiralp
- Peter Baile Chen
- Dr. Nesime Tatbul
We welcome contributions from the community!
If you encounter bugs, want to request features, or contribute code, please:
- Submit an issue
- Fork the repo and open a pull request
This project is licensed under the MIT License. See LICENSE for more details.

