Skip to content

areebuzair/PhysicsEval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 

Repository files navigation

PhysicsEval: Inference-Time Techniques to Improve the Reasoning Proficiency of Large Language Models on Physics Problems

arXiv anthology HuggingFace PDF

This repository contains the codes used to run the experiments of the PhysicsEval paper. In this paper, we explore various inference time techniques to improve the performance of LLMs on Physics and evaluate the performances.

To run our Problem Solving Pipeline, go to the BASE SOLUTION directory.

To evaluate the solutions generated by our pipeline, enter EVALUATIONS directory.

PhysicsEval Dataset

To enable large-scale evaluation and training of reasoning-capable language models in physics, we curated a comprehensive dataset of 19,609 annotated problems, sourced from 20 authoritative physics textbooks and verified educational websites.

The dataset spans 19 different categories, including Mechanics, Thermodynamics, Electromagnetism, Waves, Optics, Relativity, and Quantum Physics.

It is available at https://huggingface.co/datasets/IUTVanguard/PhysicsEval

Construction

Each problem is processed through the following pipeline:

  • Data Cleaning: Raw content is cleaned to remove noise and inconsistencies.
  • LaTeX Annotation: All equations are converted into LaTeX for structured mathematical representation.
  • Step-Wise Elaboration: Using Gemini 2.5 Pro in “Think” mode, solutions are decomposed into logically coherent steps to enhance interpretability for LLMs.
  • Metadata Tagging: Each problem is annotated with topic category, difficulty level, and key physical principles.

Train-Test Split: We apply a 90:10 split, resulting in 17,647 training and 1,962 test samples, supporting generalization across diverse reasoning tasks.

Each problem is assigned a difficulty score from 1 to 10. The number of steps in the elaborated solution is stored, and in some cases, alternative solution methods are also suggested.


Data Model

Each entry in the dataset includes the following fields:

  • Problem_ID: Unique identifier for the problem instance
  • problem: Original, full problem text from source material
  • simplified_problem_statement: Paraphrased version, stripped of complexity
  • category: Topical category (e.g., Mechanics, Optics)
  • soft_labels: Tags like numerical, conceptual, multi-step, diagram
  • elaborated_solution_steps: Step-by-step reasoning to the correct answer
  • alternative_solutions: Different valid solution methods
  • problem_difficulty: Difficulty rating (1–10)
  • final_answers_in_brief: Final answer(s) only, no reasoning
  • steps: Number of steps in main solution
  • source: The source of the problem

Citation

If you find this work useful, please cite our paper:

@inproceedings{siddique-etal-2025-physicseval,
    title = "{P}hysics{E}val: Inference-Time Techniques to Improve the Reasoning Proficiency of Large Language Models on Physics Problems",
    author = "Siddique, Oshayer  and
      Alam, J. M Areeb Uzair  and
      Rafy, Md Jobayer Rahman  and
      Raiyan, Syed Rifat  and
      Mahmud, Hasan  and
      Hasan, Md Kamrul",
    booktitle = "Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics",
    month = dec,
    year = "2025",
    address = "Mumbai, India",
    publisher = "The Asian Federation of Natural Language Processing and The Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-ijcnlp.43/",
    pages = "738--760",
    ISBN = "979-8-89176-303-6",
    abstract = "The discipline of physics stands as a cornerstone of human intellect, driving the evolution of technology and deepening our understanding of the fundamental principles of the cosmos. Contemporary literature includes some works centered on the task of solving physics problems{---}a crucial domain of natural language reasoning. In this paper, we evaluate the performance of frontier LLMs in solving physics problems, both mathematical and descriptive. We also employ a plethora of inference-time techniques and agentic frameworks to improve the performance of the models. This includes the verification of proposed solutions in a cumulative fashion by other, smaller LLM agents, and we perform a comparative analysis of the performance that the techniques entail. There are significant improvements when the multi-agent framework is applied to problems that the models initially perform poorly on. Furthermore, we introduce a new evaluation benchmark for physics problems, PhysicsEval, consisting of 19,609 problems sourced from various physics textbooks and their corresponding correct solutions scraped from physics forums and educational websites. Our code and data are publicly available at https://github.com/areebuzair/PhysicsEval."
}
@article{siddique2025physicseval,
  title={Physicseval: Inference-time techniques to improve the reasoning proficiency of large language models on physics problems},
  author={Siddique, Oshayer and Alam, JM and Rafy, Md Jobayer Rahman and Raiyan, Syed Rifat and Mahmud, Hasan and Hasan, Md Kamrul},
  journal={arXiv preprint arXiv:2508.00079},
  year={2025}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages