PhysicsEval: Inference-Time Techniques to Improve the Reasoning Proficiency of Large Language Models on Physics Problems
This repository contains the codes used to run the experiments of the PhysicsEval paper. In this paper, we explore various inference time techniques to improve the performance of LLMs on Physics and evaluate the performances.
To run our Problem Solving Pipeline, go to the BASE SOLUTION directory.
To evaluate the solutions generated by our pipeline, enter EVALUATIONS directory.
To enable large-scale evaluation and training of reasoning-capable language models in physics, we curated a comprehensive dataset of 19,609 annotated problems, sourced from 20 authoritative physics textbooks and verified educational websites.
The dataset spans 19 different categories, including Mechanics, Thermodynamics, Electromagnetism, Waves, Optics, Relativity, and Quantum Physics.
It is available at https://huggingface.co/datasets/IUTVanguard/PhysicsEval
Each problem is processed through the following pipeline:
- Data Cleaning: Raw content is cleaned to remove noise and inconsistencies.
- LaTeX Annotation: All equations are converted into LaTeX for structured mathematical representation.
- Step-Wise Elaboration: Using Gemini 2.5 Pro in “Think” mode, solutions are decomposed into logically coherent steps to enhance interpretability for LLMs.
- Metadata Tagging: Each problem is annotated with topic category, difficulty level, and key physical principles.
Train-Test Split: We apply a 90:10 split, resulting in 17,647 training and 1,962 test samples, supporting generalization across diverse reasoning tasks.
Each problem is assigned a difficulty score from 1 to 10. The number of steps in the elaborated solution is stored, and in some cases, alternative solution methods are also suggested.
Each entry in the dataset includes the following fields:
Problem_ID: Unique identifier for the problem instanceproblem: Original, full problem text from source materialsimplified_problem_statement: Paraphrased version, stripped of complexitycategory: Topical category (e.g., Mechanics, Optics)soft_labels: Tags like numerical, conceptual, multi-step, diagramelaborated_solution_steps: Step-by-step reasoning to the correct answeralternative_solutions: Different valid solution methodsproblem_difficulty: Difficulty rating (1–10)final_answers_in_brief: Final answer(s) only, no reasoningsteps: Number of steps in main solutionsource: The source of the problem
If you find this work useful, please cite our paper:
@inproceedings{siddique-etal-2025-physicseval,
title = "{P}hysics{E}val: Inference-Time Techniques to Improve the Reasoning Proficiency of Large Language Models on Physics Problems",
author = "Siddique, Oshayer and
Alam, J. M Areeb Uzair and
Rafy, Md Jobayer Rahman and
Raiyan, Syed Rifat and
Mahmud, Hasan and
Hasan, Md Kamrul",
booktitle = "Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics",
month = dec,
year = "2025",
address = "Mumbai, India",
publisher = "The Asian Federation of Natural Language Processing and The Association for Computational Linguistics",
url = "https://aclanthology.org/2025.findings-ijcnlp.43/",
pages = "738--760",
ISBN = "979-8-89176-303-6",
abstract = "The discipline of physics stands as a cornerstone of human intellect, driving the evolution of technology and deepening our understanding of the fundamental principles of the cosmos. Contemporary literature includes some works centered on the task of solving physics problems{---}a crucial domain of natural language reasoning. In this paper, we evaluate the performance of frontier LLMs in solving physics problems, both mathematical and descriptive. We also employ a plethora of inference-time techniques and agentic frameworks to improve the performance of the models. This includes the verification of proposed solutions in a cumulative fashion by other, smaller LLM agents, and we perform a comparative analysis of the performance that the techniques entail. There are significant improvements when the multi-agent framework is applied to problems that the models initially perform poorly on. Furthermore, we introduce a new evaluation benchmark for physics problems, PhysicsEval, consisting of 19,609 problems sourced from various physics textbooks and their corresponding correct solutions scraped from physics forums and educational websites. Our code and data are publicly available at https://github.com/areebuzair/PhysicsEval."
}@article{siddique2025physicseval,
title={Physicseval: Inference-time techniques to improve the reasoning proficiency of large language models on physics problems},
author={Siddique, Oshayer and Alam, JM and Rafy, Md Jobayer Rahman and Raiyan, Syed Rifat and Mahmud, Hasan and Hasan, Md Kamrul},
journal={arXiv preprint arXiv:2508.00079},
year={2025}
}