EducationQ Framework is a comprehensive multi-agent educational framework that transforms and evaluates LLMs' teaching capabilities through simulated dynamic educational scenarios. Grounded in pedagogical theories (Zone of Proximal Development, Scaffolding, Informal Formative Assessment, Bloom's Taxonomy), EducationQ enables LLMs to consolidate their foundational abilitiesβknowledge retrieval, generation, and reasoningβinto comprehensive teaching effectiveness.
EducationQ Benchmark is a quantitative and qualitative mixed-method evaluation of LLMs' teaching capabilities based on the EducationQ Framework.
Constructing research of 14 LLMs across major AI Organizations(OpenAI, Meta, Google, Anthropic, etc.) through mixed-methods approach, we observe LLMs' distinct teaching behaviors & strategies and benchmark capabilities via student learning gains, revealing the need for pedagogical optimization beyond knowledge scaling.
Our research reveals several important findings:
- Teaching effectiveness does not correlate linearly with model scale or general reasoning capabilities - some smaller open-source models outperform larger commercial counterparts in teaching contexts
- 78% agreement between human expert evaluations and our automated qualitative analysis of effective teaching behaviors
- LLMs-as-Teachers require specialized optimization beyond simple scaling, suggesting next-generation educational AI should prioritize targeted enhancement of specific pedagogical effectiveness
@inproceedings{shi-etal-2025-educationq,
title = "{E}ducation{Q}: Evaluating {LLM}s' Teaching Capabilities Through Multi-Agent Dialogue Framework",
author = "Shi, Yao and
Liang, Rongkeng and
Xu, Yong",
editor = "Che, Wanxiang and
Nabende, Joyce and
Shutova, Ekaterina and
Pilehvar, Mohammad Taher",
booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.acl-long.1576/",
doi = "10.18653/v1/2025.acl-long.1576",
pages = "32799--32828",
ISBN = "979-8-89176-251-0",
}EducationQ is grounded in established educational theories:
- Zone of Proximal Development (ZPD) (Vygotsky, 1978) β Defines the ideal learning space where effective teaching must occur
- Scaffolding (Wood, Bruner, & Ross, 1976) β Providing temporary, adaptive supportβgiving hints, not answers
- Strategic Questioning & Bloom's Taxonomy (Bloom, 1956) β Provides a measurable framework to assess teaching quality
- Informal Formative Assessment (IFA) (Ruiz-Primo & Furtak, 2007) β The engine that drives dynamic, personalized teaching
EducationQ employs three specialized agents:
- Student Agent: Takes tests, reflects on feedback and questions, answers with understanding
- Teacher Agent: Evaluates responses, adjusts strategy, provides feedback, teaches by questioning
- Evaluator Agent: Calculates accuracy, verifies compliance, analyzes effectiveness, measures learning gains
The framework operates through three phases:
- Pre-Test: Student takes initial test; Evaluator calculates baseline accuracy
- Interaction: Multi-turn dialogue where Teacher guides Student through questioning and feedback
- Post-Test: Student retakes test; Evaluator measures learning gains
EducationQ supports multiple benchmark datasets and user-defined datasets:
| Dataset | Questions | Domains | Description |
|---|---|---|---|
| MMLU-Pro | 12,032 | 14 subjects | Enhanced MMLU with 10 options per question |
| GPQA | 448 | Science | Graduate-level science questions |
| AGIEval | Varies | Multiple | Human-centric benchmark tasks |
And we construct MMLU-Pro Stratified + GPQA Diamond ββ a high-quality and balanced teaching-oriented testbed for LLMs' teaching capabilities evaluation.
1,498 Teaching Tasks covers 13 different disciplines and 10 difficulty levels: Meticulously curated from two elite benchmark data sources.
Pre-filtered stratified subsets are available:
mmlu_pro_stratified.json: 1,300 stratified questions from MMLU-Progpqa_diamond.csv: Diamond subset of GPQA
Our Goal: To construct a robust evaluation dataset that moves beyond simple knowledge recall to assess deep and pedagogical interaction, isolating and measuring genuine pedagogical skills.
EducationQ provides both quantitative and qualitative evaluation:
| Metric | Description | Formula |
|---|---|---|
| ALG (Absolute Learning Gain) | Direct improvement in student performance | ACC_post - ACC_pre |
| PNIR (Positive-Negative Impact Ratio) | Consistency of teaching effectiveness | N_neg / N_pos |
| CSS (Cross-Subject Stability) | Standard deviation of learning gains across subjects | Ο(SLGPD) |
| UIC (Unique Improvement Count) | Questions where only one specific teacher model achieved improvement | Count(QUI) |
Holistic Interaction Analysis:
- Assessment Effectiveness
- Questioning Effectiveness
- Feedback Effectiveness
- Instructional Adaptation Effectiveness
- Learning Objective Achievement Effectiveness
Teacher-Centric Question Analysis:
- Question Relevance
- Cognitive Level
- Knowledge Dimension
- Question Diversity
- Scaffolding Progression
- Metacognitive Promotion
Student-Centric Response Analysis:
- Response Relevance
- Cognitive Level Demonstration
- Knowledge Dimension Integration
- Response Diversity
- Elaboration Progression
- Metacognitive Reflection
- Multi-Agent Architecture: Teacher and Student agents powered by different LLMs
- Multiple Datasets: Support for MMLU-Pro, GPQA, and AGIEval datasets
- Flexible Evaluation: Both quantitative (accuracy-based) and qualitative (interaction-based) analysis
- Resume Capability: Can continue from any stage using saved results
- Comprehensive Analysis: Multiple evaluation perspectives (interaction, teacher questions, student responses)
- Local Dataset Support: Load datasets from local JSON files without network dependency
# Clone the repository
git clone https://github.com/SunriserFuture/EducationQ.git
cd EducationQ/EducationQ_Framework
# Install dependencies
pip install -r requirements.txtRun the complete evaluation pipeline:
python src/run/main.pyUse a custom configuration file:
python src/run/main.py --config ../data/input/my_config.yamlLoad existing pretest results and continue:
python src/run/main.py --mode load_pretest --input pretest_results.jsonLoad existing interaction results and continue:
python src/run/main.py --mode load_interaction --input interaction_results.jsonRun comprehensive evaluation on existing results:
python src/run/main.py --mode evaluation --posttest posttest.json --csv evaluation_tasks.csv --eval-type comprehensiveThe framework uses YAML configuration files. See src/data/input/config_template.yaml for a complete example.
DATASET_TYPE: "mmlu-pro" # Options: "gpqa", "mmlu-pro", "agieval"
DATASET_NAME: "TIGER-Lab/MMLU-Pro" # Or local file: "mmlu_pro_stratified.json"
SELECTED_CATEGORIES: [] # Empty for all categories
SELECTED_QUESTION_ID: [] # Empty for all questionsTEACHER_CONFIGS:
- name: "Teacher1"
model: "meta-llama/llama-3.1-70b-instruct"
api_key: "your_api_key"
base_url: "your_api_url"
temperature: 0.0
max_tokens: 1024
use_few_shot: false
recommended_question_token_limit: 150STUDENT_CONFIGS:
- name: "Student1"
model: "meta-llama/llama-3.1-70b-instruct"
api_key: "your_api_key"
base_url: "your_api_url"
temperature: 0.0
answer_max_tokens: 1024
test_max_tokens: 2048
include_pretest_info: trueEVALUATOR_CONFIG:
name: "Evaluator"
model: "openai/gpt-4o-mini"
api_key: "your_openai_api_key"
base_url: "your_api_url"
temperature: 0.0
max_tokens: 4096python src/run/main.py [OPTIONS]Options:
--config PATH: Configuration YAML file path (default:../data/input/config_template.yaml)--mode MODE: Execution mode:complete: Full pipeline (default)load_pretest: Load pretest results and continueload_interaction: Load interaction results and continueevaluation: Run specific evaluation on existing results
--input PATH: Input JSON file forload_pretestorload_interactionmodes--posttest PATH: Posttest results JSON file for evaluation mode--csv PATH: CSV file with evaluation tasks for evaluation mode--eval-type TYPE: Evaluation type for evaluation mode:interaction: Analyze conversation processteacher_questions: Analyze teacher questions onlystudent_responses: Analyze student responses onlycomprehensive: All three analyses (default)
The framework generates several types of output files:
- Pretest Results:
pretest_results_{version}_{timestamp}.json - Interaction Results:
pretest_interaction_results_{version}_{timestamp}.json - Posttest Results:
pretest_interaction_posttest_results_{version}_{timestamp}.json - Evaluation Results:
evaluation_results_{version}_{timestamp}.json - Specialized Evaluations:
interaction_evaluation_results_{version}_{timestamp}.jsonteacher_questions_evaluation_results_{version}_{timestamp}.jsonstudent_responses_evaluation_results_{version}_{timestamp}.jsoncomprehensive_evaluation_results_{version}_{timestamp}.json
EducationQ_Framework/
βββ docs/
β βββ figures/ # Framework diagrams and result visualizations
β βββ 2025.acl-long.1576.pdf # ACL 2025 paper
βββ src/
β βββ data/
β β βββ input/ # Configuration files
β β β βββ config_template.yaml
β β β βββ config_teacher0shot_gpqa_diamond.yaml
β β β βββ config_teacher0shot_mmlupro_stratified.yaml
β β βββ dataset/ # Dataset files
β β β βββ gpqa/ # GPQA dataset
β β β βββ AGIEval/ # AGIEval dataset
β β β βββ mmlu-pro/ # MMLU-Pro dataset
β β βββ output/ # Experiment results
β βββ run/
β βββ main.py # Main entry point
βββ README.md
βββ requirements.txt
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
See docs/contributing.md for detailed guidelines.
MIT License
For questions and support, please contact: educationq@sunriser.org
EducationQ: Evaluating LLMs' Teaching Capabilities Through Multi-Agent Dialogue Framework
ACL 2025 | Paper | Code






