AL-Bench includes a high-quality dataset and a novel dynamic evaluation method focused on runtime logs, addressing key limitations of prior studies and bridging the gap between real-world requirements and existing evaluation frameworks.
.
├── Static_Evaluation/ # Scripts and results for static evaluation
│ ├── eval/ # Evaluation scripts for each logging tool
│ └── data/ # Evaluation result data
└── Dynamic_Evaluation/ # Scripts and results for dynamic evaluation
├── dynamic_evaluation/ # Core scripts for dynamic evaluation
└── init_dynamic_evaluation/ # Dataset construction scripts
The complete evaluation dataset can be accessed at: https://drive.google.com/drive/u/1/folders/1eoK7SaYTuwqcAe9T3ddjeU5oGLRDX2Ps
Static evaluation focuses on the following aspects:
- Log Level Accuracy (LA)
- Log Position Accuracy (PA)
- Log Message Accuracy (MA)
- Average Level Distance (ALD)
- Dynamic Expression Accuracy (DEA)
- Static Text BLEU ROUGE Score (STS)
Figure 2: Static evaluation process and metrics calculation
Dynamic evaluation assesses the performance of logging tools in actual runtime environments:
- Compilation Success Rate
- Log Similarity
Figure 3: Dynamic evaluation process
- Enter the Static_Evaluation directory:
cd Static_Evaluation- Run evaluation script:
python eval/[tool_name]/run_eval.pyStrongly recommend using Docker to run the dynamic evaluation.
- Pull the Docker image:
docker pull boyintan/al-bench:hadoop-build- Run the Docker container:
docker run -it -v $(pwd):/home/al-bench boyintan/al-bench:hadoop-build /bin/bash- Run the evaluation script:
cd Dynamic_Evaluation- Run the evaluation script:
python Dynamic_Evaluation/get_logs_output/execute_unittest.py --execute_id [execute_id] --results_dir [results_dir] --json_path [json_path] --use_catch_point [use_catch_point] --record_error [record_error] --num_thread [num_thread]Note:
Prepare the data for dynamic evaluation, the data should be in the following format:
[{
"uuid": "uuid",
"prediction": "prediction",
"predicted_log_statement": {
"log_statement": "log_statement",
"log_position": "log_position"
}
}]"prediction" should be the standard code format with '/n' as the line break. "predicted_log_statement" should be the log statement in the code. "log_position" should be the line number of the log statement in the code.
The complete evaluation dataset can be accessed at: https://drive.google.com/drive/u/1/folders/1eoK7SaYTuwqcAe9T3ddjeU5oGLRDX2Ps
- FastLog
- UniLog
- LANCE
- LEONID
If you use AL-Bench in your research, please cite our paper:
@misc{tan2025albenchbenchmarkautomaticlogging,
title={AL-Bench: A Benchmark for Automatic Logging},
author={Boyin Tan and Junjielong Xu and Zhouruixing Zhu and Pinjia He},
year={2025},
eprint={2502.03160},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2502.03160},
}This project is licensed under the MIT License - see the LICENSE file for details.




