PhysicsEval: Inference-Time Techniques to Improve the Reasoning Proficiency of Large Language Models on Physics Problems

This repository contains the codes used to run the experiments of the PhysicsEval paper. In this paper, we explore various inference time techniques to improve the performance of LLMs on Physics and evaluate the performances.

To run our Problem Solving Pipeline, go to the BASE SOLUTION directory.

To evaluate the solutions generated by our pipeline, enter EVALUATIONS directory.

PhysicsEval Dataset

To enable large-scale evaluation and training of reasoning-capable language models in physics, we curated a comprehensive dataset of 19,609 annotated problems, sourced from 20 authoritative physics textbooks and verified educational websites.

The dataset spans 19 different categories, including Mechanics, Thermodynamics, Electromagnetism, Waves, Optics, Relativity, and Quantum Physics.

It is available at https://huggingface.co/datasets/IUTVanguard/PhysicsEval

Construction

Each problem is processed through the following pipeline:

Data Cleaning: Raw content is cleaned to remove noise and inconsistencies.
LaTeX Annotation: All equations are converted into LaTeX for structured mathematical representation.
Step-Wise Elaboration: Using Gemini 2.5 Pro in “Think” mode, solutions are decomposed into logically coherent steps to enhance interpretability for LLMs.
Metadata Tagging: Each problem is annotated with topic category, difficulty level, and key physical principles.

Train-Test Split: We apply a 90:10 split, resulting in 17,647 training and 1,962 test samples, supporting generalization across diverse reasoning tasks.

Each problem is assigned a difficulty score from 1 to 10. The number of steps in the elaborated solution is stored, and in some cases, alternative solution methods are also suggested.

Data Model

Each entry in the dataset includes the following fields:

Problem_ID: Unique identifier for the problem instance
problem: Original, full problem text from source material
simplified_problem_statement: Paraphrased version, stripped of complexity
category: Topical category (e.g., Mechanics, Optics)
soft_labels: Tags like numerical, conceptual, multi-step, diagram
elaborated_solution_steps: Step-by-step reasoning to the correct answer
alternative_solutions: Different valid solution methods
problem_difficulty: Difficulty rating (1–10)
final_answers_in_brief: Final answer(s) only, no reasoning
steps: Number of steps in main solution
source: The source of the problem

Citation

If you find this work useful, please cite our paper:

@inproceedings{siddique-etal-2025-physicseval,
    title = "{P}hysics{E}val: Inference-Time Techniques to Improve the Reasoning Proficiency of Large Language Models on Physics Problems",
    author = "Siddique, Oshayer  and
      Alam, J. M Areeb Uzair  and
      Rafy, Md Jobayer Rahman  and
      Raiyan, Syed Rifat  and
      Mahmud, Hasan  and
      Hasan, Md Kamrul",
    booktitle = "Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics",
    month = dec,
    year = "2025",
    address = "Mumbai, India",
    publisher = "The Asian Federation of Natural Language Processing and The Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-ijcnlp.43/",
    pages = "738--760",
    ISBN = "979-8-89176-303-6",
    abstract = "The discipline of physics stands as a cornerstone of human intellect, driving the evolution of technology and deepening our understanding of the fundamental principles of the cosmos. Contemporary literature includes some works centered on the task of solving physics problems{---}a crucial domain of natural language reasoning. In this paper, we evaluate the performance of frontier LLMs in solving physics problems, both mathematical and descriptive. We also employ a plethora of inference-time techniques and agentic frameworks to improve the performance of the models. This includes the verification of proposed solutions in a cumulative fashion by other, smaller LLM agents, and we perform a comparative analysis of the performance that the techniques entail. There are significant improvements when the multi-agent framework is applied to problems that the models initially perform poorly on. Furthermore, we introduce a new evaluation benchmark for physics problems, PhysicsEval, consisting of 19,609 problems sourced from various physics textbooks and their corresponding correct solutions scraped from physics forums and educational websites. Our code and data are publicly available at https://github.com/areebuzair/PhysicsEval."
}

@article{siddique2025physicseval,
  title={Physicseval: Inference-time techniques to improve the reasoning proficiency of large language models on physics problems},
  author={Siddique, Oshayer and Alam, JM and Rafy, Md Jobayer Rahman and Raiyan, Syed Rifat and Mahmud, Hasan and Hasan, Md Kamrul},
  journal={arXiv preprint arXiv:2508.00079},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
BASE SOLUTION		BASE SOLUTION
EVALUATIONS		EVALUATIONS
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PhysicsEval: Inference-Time Techniques to Improve the Reasoning Proficiency of Large Language Models on Physics Problems

PhysicsEval Dataset

Construction

Data Model

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PhysicsEval: Inference-Time Techniques to Improve the Reasoning Proficiency of Large Language Models on Physics Problems

PhysicsEval Dataset

Construction

Data Model

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages