The M-MAD framework is a systematic LLM-based multi-agent framework for advanced LLM-as-a-judge MT evaluation. It operates in three stages:
- Dimension Partition: Decomposing the heuristic MQM annotation guideline into distinct dimensions for independent LLM-as-a-judge assessments.
- Multi-Agent Debate: Conducting multi-agent debates within each dimension, harnessing LLMs' inherent knowledge, reasoning, and collaborative abilities.
- Final Judgment: Synthesizing the debated outcomes through a final judge agent to produce a comprehensive evaluation judgment.
For a detailed explanation of the M-MAD framework, please refer to the paper:
Multidimensional Multi-Agent Debate for Advanced Machine Translation Evaluation (arXiv)
code/: Code and prompts for all stagesdata/: Input data and output-annotated data
Our input data is sourced from WMT-23 Metrics Shared Task. You can also downloaded it from https://github.com/google-research/mt-metrics-eval or https://wmt-metrics-task.github.io/.metrics_scores/: Meta-evaluation results
git clone https://github.com/SU-JIAYUAN/M-MAD.git
cd M-MAD
conda create -n MMMD python=3.10
conda activate MMMD
pip install -r requirements.txtsh run_stage1.shsh run_stage2_3.shTo run the meta-evaluation for the metrics, execute the following file, where we use the evaluation tool from https://github.com/google-research/mt-metrics-eval.
wmt23_metrics.ipynb@article{feng2024mmad,
title={M-MAD: Multidimensional Multi-Agent Debate Framework for Fine-grained Machine Translation Evaluation},
author={Feng, Zhaopeng and Su, Jiayuan and Zheng, Jiamei and Ren, Jiahan and Zhang, Yan and Wu, Jian and Wang, Hongwei and Liu, Zuozhu},
journal={arXiv preprint arXiv:2412.20127},
year={2024}
}
