Abstract
This paper explores the potential of Large Language Models (LLMs), specifically GPT-4, to enhance the precision and effectiveness of Automated Assessment Systems (AAS) for open-ended mathematics problems. While LLMs have demonstrated transformative capabilities across various disciplines, their application in AAS, particularly for mathematical logic and open-ended problem-solving, still needs to be explored. Our research addresses this gap by developing and critically evaluating a GPT-4-based AAS. We analyzed 4,180 responses to open-ended mathematics questions from 380 6th-grade primary school students. Three human experts and the GPT-4 model independently assessed these responses using a pre-established rubric. Our findings reveal high consistency between human and GPT-4 assessments in most instances, highlighting the potential of integrating GPT-4 into AAS. We categorized scoring discrepancies from GPT-4 and human raters by error type and identified specific mathematical content areas where automated assessment faced limitations. We evaluated two strategies to enhance GPT-4’s assessment capabilities: (1) using elaborate prompts and (2) implementing advanced prompt engineering techniques such as Chain-of-thought, Self-consistency, and Tree-of-thought. While comprehensive prompts significantly improved assessment quality, applying advanced prompt engineering techniques directly produced suboptimal results, indicating a need for further refinement. This study contributes to the emerging body of research evaluating GPT-4 in the context of AAS for open-ended mathematics problems, shedding light on both the strengths and limitations of this approach. Our findings provide valuable insights and a foundation for future research to refine the integration of LLMs in AAS, particularly in mathematics education.








Data Availability
The data used in this study were collected after obtaining consent from the participants. The collected data were cleansed to remove any personally identifiable information before being used for research purposes. However, separate consent was not obtained for the public disclosure of the data used in the research. Therefore, it is difficult to disclose the data if there is a specific request relevant to the research objective.
References
Adeshola, I., & Adepoju, A. P. (2023). The opportunities and challenges of ChatGPT in education. Interactive Learning Environments, 1–14. https://doi.org/10.1080/10494820.2023.2253858
Ahn, D. Y., Son, T. K., & Lee, K. H. (2023). ChatGPT as a scaffolding tool: Evaluating the impact on elementary students’ mathematical logic problem-solving skills. Brain, Digital, & Learning, 13(2), 189–196.
Amarasinghe, I., Marques, F., Ortiz-Beltrán, A., & Hernández-Leo, D. (2023). Generative pre-trained transformers for coding text data? An analysis with classroom orchestration data. In European Conference on Technology Enhanced Learning (pp. 32–43). Cham: Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-42682-7_3
Anil, R., Dai, A. M., Firat, O., Johnson, M., Lepikhin, D., Passos, A., Shakeri, S., Taropa, E., Bailey, P., Chen, Z., Chu, E., Clark, J. H., El Shafey, L., Huang, Y., Meier-Hellstern, K., Mishra, G., Moreira, E., Omernick, M., Robinson, K., & Wu, Y. (2023). PaLM 2 Technical Report. arXiv preprint arXiv: https://doi.org/10.48550/arXiv.2305.10403
Baidoo-Anu, D., & Ansah, L. O. (2023). Education in the era of generative artificial intelligence (AI): Understanding the potential benefits of ChatGPT in promoting teaching and learning. Journal of AI, 7(1), 52–62. https://doi.org/10.61969/jai.1337500
Baral, S., Botelho, A. F., Erickson, J. A., Benachamardi, P., & Heffernan, N. T. (2021). Improving automated scoring of student open responses in mathematics. International Educational Data Mining Society
Baral, S., Seetharaman, K., Botelho, A. F., Wang, A., Heineman, G., & Heffernan, N. T. (2022). Enhancing Auto-scoring of Student Open Responses in the Presence of Mathematical Terms and Expressions. In Artificial Intelligence in Education: 23rd International Conference, AIED 2022, Proceedings, 1, 685–690. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-031-11644-5_68
Beevers, C., & Paterson, J. (2003). Automatic assessment of problem-solving skills in mathematics. Active Learning in Higher Education, 4(2), 127–144. https://doi.org/10.1177/1469787403004002002
Botelho, A., Baral, S., Erickson, J. A., Benachamardi, P., & Heffernan, N. T. (2023). Leveraging natural language processing to support automated assessment and feedback for student open responses in mathematics. Journal of Computer Assisted Learning. https://doi.org/10.1111/jcal.12793
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901. https://doi.org/10.48550/arXiv.2005.14165
Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., & Zhang, Y. (2023). Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv Preprint arXiv: https://doi.org/10.48550/arXiv.2303.12712
Caines, A., Benedetto, L., Taslimipoor, S., Davis, C., Gao, Y., Andersen, O., Yuan, Z., Elliott, M., Moore, R., Bryant, C., Rei, M., Yannakoudakis, H., Mullooly, A., Nicholls, D., & Buttery, P. (2023). On the application of large language models for language teaching and assessment technology. arXiv Preprint. https://doi.org/10.48550/arXiv.2307.08393
Chase, H. (2023). LangChain. Retrieved from https://github.com/hwchase17/langchain
Cicchetti, D. (1994). Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology. Psychological Assessment, 6(4), 284. https://doi.org/10.1037/1040-3590.6.4.284
Conejo, R., Guzmán, E., & Trella, M. (2016). The SIETTE automatic assessment environment. International Journal of Artificial Intelligence in Education, 26, 270–292. https://doi.org/10.1007/s40593-015-0078-4
Crust, G. (2023). ChatGPT employability study skills and curriculum development. https://doi.org/10.13140/RG.2.2.35643.28960
Cui, L., Wu, Y., Liu, J., Yang, S., & Zhang, Y. (2021). Template-based named entity recognition using BART. In findings of the association for computational linguistics: ACL-IJCNLP 2021, 1835–1845. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.findings-acl.161
Cutrone, L. A., & Chang, M. (2010). Automarking: Automatic assessment of open questions. In 2010 10th IEEE International Conference on Advanced Learning Technologies, 143–147. IEEE. https://doi.org/10.1109/ICALT.2010.47
Denny, P., Kumar, V., & Giacaman, N. (2023). Conversing with Copilot: Exploring prompt engineering for solving CS1 problems using natural language. In Proceedings of the 54th ACM Technical Symposium on Computer Science Education, 1, 1136–1142. https://doi.org/10.1145/3545945.3569823
Erickson, J. A., Botelho, A. F., McAteer, S., Varatharaj, A., & Heffernan, N. T. (2020). The automated grading of student open responses in mathematics. In Proceedings of the tenth international conference on learning analytics & knowledge, 615–624. https://doi.org/10.1145/3375462.3375523
Foong, P. Y. (2000). Open-ended problems for higher-order thinking in mathematic. Teaching and Learning, 20(2), 49–57.
Foong, P. Y. (2002). The role of problems to enhance pedagogical practices in the Singapore. The Mathematics Educator, 6(2), 15–31.
Gao, R., Merzdorf, H. E., Anwar, S., Hipwell, M. C., & Srinivasa, A. (2024). Automatic assessment of text-based responses in post-secondary education: A systematic review. Computers and Education: Artificial Intelligence, 100206. https://doi.org/10.1016/j.caeai.2024.100206
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press.
Hallgren, K. A. (2012). Computing inter-rater reliability for observational data: An overview and tutorial. Tutorials in Quantitative Methods for Psychology, 8(1), 23. https://doi.org/10.20982/tqmp.08.1.p023
Hancock, C. L. (1995). Implementing the assessment standards for school mathematics: Enhancing mathematics learning with open-ended questions. Mathematics Teacher, 88(6). https://doi.org/10.5951/MT.88.6.0496. 496 – 99.
Harjula, M. (2008). Mathematics exercise system with automatic assessment. Master’s thesis, Helsinki University of Technology, Department of Automation and Systems Technology.
Hoofman, J., & Secord, E. (2021). The effect of COVID-19 on education. Pediatric Clinics, 68(5), 1071–1079. https://doi.org/10.1016/j.pcl.2021.05.009
Hutchins, E. (1995). How a cockpit remembers its speeds. Cognitive Science, 19(3), 265–288. https://doi.org/10.1016/0364-0213
Insani, S., & Akbar, P. (2019). Development of open-ended based mathematics problem to measure high-level thinking ability. Journal Physics Conference Series, 1315(1), 012047. https://doi.org/10.1088/1742-6596/1315/1/012047
Jiang, Z., Xu, F. F., Araki, J., & Neubig, G. (2020). How can we know what language models know? Transactions of the Association for Computational Linguistics, 8, 423–438. https://doi.org/10.1162/tacl_a_00324
Kajetanowicz, P., & Wierzejewski, J. (2008). Application of computer algebra systems in automatic assessment of math skills. . https://doi.org/10.5485/TMCS.2008.0187
Kasneci, E., Sessler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., Gasser, U., Groh, G., Günnemann, S., Hüllermeier, E., Krusche, S., Kutyniok, G., Michaeli, T., Nerdel, C., Pfeffer, J., Poquet, O., Sailer, M., Schmidt, A., Seidel, T., … Kasneci, G. (2023). ChatGPT for good? On opportunities and challenges of large language models for education. Learning and Individual Differences, 103, 102274. https://doi.org/10.1016/j.lindif.2023.102274
Keh, L. K., Ismail, Z., & Yusof, Y. M. (2016). A review of open-ended mathematical problem. Anatolian Journal of Education, 1(1), 1–18. https://doi.org/10.29333/aje.2016.111a
Kim, J., Lee, H., & Cho, Y. H. (2022). Learning design to support student-AI collaboration: Perspectives of leading teachers for AI in education. Education and Information Technologies, 27(5), 6069–6104. https://doi.org/10.1007/s10639-021-10831-6
Kwon, O., Park, J., & Park, J. (2006). Cultivating divergent thinking in mathematics through an open-ended approach. Asia Pacific Education Review, 7(1), 51–61. https://doi.org/10.1007/BF03036784
Kwon, O. N., Oh, S. J., Yoon, J. E., Shin, B. C., & Jung, W. (2023). Analyzing mathematical performances of ChatGPT: Focusing on the solution of national assessment of educational achievement and the college scholastic ability test. Communications of Mathematical Education, 37(2), 233–256. https://doi.org/10.7468/jksmee.2023.37.2.233
La Velle, L., Newman, S., Montgomery, C., & Hyatt, D. (2020). Initial teacher education in England and the Covid-19 pandemic: Challenges and opportunities. Journal of Education for Teaching, 46(4), 596–608. https://doi.org/10.1080/02607476.2020.1803051
Lai, S., Liu, K., He, S., & Zhao, J. (2016). How to generate a good word embedding. IEEE Intelligent Systems, 31(6), 5–14. https://doi.org/10.1109/MIS.2016.45
Lee, G. G., Latif, E., Wu, X., Liu, N., & Zhai, X. (2024). Applying large language models and chain-of-thought for automatic scoring. Computers and Education: Artificial Intelligence, 6, 100213. https://doi.org/10.1016/j.caeai.2024.100213
Li, X. L., & Liang, P. (2021). Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv: https://doi.org/10.48550/arXiv.2101.00190
Lin, Z. L., Yen, C. H., Xu, J. C., Watty, D., & Hsieh, S. K. (2023). Solving linguistic olympiad problems with tree-of-thought prompting. In Proceedings of the 35th conference on computational linguistics and speech processing (ROCLING 2023) (pp. 262–269). https://aclanthology.org/2023.rocling-1.
Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., & Neubig, G. (2023). Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9), 1–35. https://doi.org/10.1145/3560815
Lo, C. (2023). What is the impact of ChatGPT on education? A rapid review of the literature. Education Sciences, 13(4), 410. https://doi.org/10.3390/educsci13040410
Luo, R., Zhang, A., Wang, Y., Li, H., Xu, Y., Guo, K., & Si, J. (2024). Math attitudes and math anxiety predict students’ perception of teacher support in primary school, but not vice versa. British Journal of Educational Psychology, 94(1), 6–21. https://doi.org/10.1111/bjep.12628
Marchisio, M., Barana, A., Fioravera, M., Rabellino, S., & Conte, A. (2018). A model of formative automatic assessment and interactive feedback for STEM. In 2018 IEEE 42nd Annual Computer Software and Applications Conference (COMPSAC), 1, 1016–1025. https://doi.org/10.1109/compsac.2018.00178
Nandini, V., & Uma Maheswari, P. (2020). Automatic assessment of descriptive answers in online examination system using semantic relational features. The Journal of Supercomputing, 76(6), 4430–4448. https://doi.org/10.1007/s11227-018-2381-y
Nardi, B., & O’Day, V. (2000). Information ecologies: Using technology with heart. MIT Press. https://doi.org/10.7551/mitpress/3767.001.0001
Nguyen, H. A., Stec, H., Hou, X., Di, S., & McLaren, B. M. (2023). Evaluating chatgpt’s decimal skills and feedback generation in a digital learning game. In European Conference on Technology Enhanced Learning (pp. 278–293). Cham: Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-42682-7_19
Oh, S. J. (2023). Effective ChatGPT prompts in mathematical problem solving: Focusing on quadratic equations and quadratic functions. Communications of Mathematical Education, 37(3), 545–567.
OpenAI. (2023). GPT-4 Technical Report. arXiv Preprint arXiv: 2303 08774. https://doi.org/10.48550/arXiv.2303.08774
Oppenlaender, J. (2022). Prompt engineering for text-based generative art. Researchgate Preprint. https://doi.org/10.48550/arXiv.2204.13988
Pankiewicz, M., & Baker, R. S. (2023). Large Language models (GPT) for automating feedback on programming assignments. arXiv Preprint. https://doi.org/10.48550/arXiv.2307.00150
Perkins, D. N. (1993). Person-plus: A distributed view of thinking and learning. Distributed cognitions: Psychological and educational considerations, 88–110.
Petroni, F., Rocktäschel, T., Lewis, P., Bakhtin, A., Wu, Y., Miller, A. H., & Riedel, S. (2019). Language models as knowledge bases? arXiv Preprint. https://doi.org/10.48550/arXiv.1909.01066
Ranaldi, L., & Zanzotto, F. (2023). Empowering multi-step reasoning across languages via tree-of-thoughts. arXiv Preprint. https://doi.org/10.48550/arXiv.2311.08097
Rasila, A., Malinen, J., & Tiitu, H. (2015). On automatic assessment and conceptual understanding. Teaching Mathematics and Its Applications: An International Journal of the IMA, 34(3), 149–159. https://doi.org/10.1093/teamat/hrv013
Rasila, A., Harjula, M., & Zenger, K. (2007). Automatic assessment of mathematics exercises: Experiences and future prospects. In ReflekTori 2007 Symposium of Engineering Education, 1, 70–80.
Reimers, N., & Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint from https://doi.org/10.48550/arXiv.1908.10084
Rodrigues, F., & Araújo, L. (2012). Automatic assessment of short free text answers. In Proceedings of the 4th International Conference on Computer Supported Education, 2, 50–57. https://doi.org/10.5220/0003920800500057
Rudolph, J., Tan, S., & Tan, S. (2023). ChatGPT: Bullshit spewer or the end of traditional assessments in higher education? Journal of Applied Learning and Teaching, 6(1), 342–363. https://doi.org/10.37074/jalt.2023.6.1.9
Sallam, M. (2023). ChatGPT utility in healthcare education, research, and practice: Systematic review on the promising perspectives and valid concerns. Healthcare, 11(6), 887. https://doi.org/10.3390/healthcare11060887
Sangwin, C., & Köcher, N. (2016). Automation of mathematics examinations. Computers & Education, 94, 215–227. https://doi.org/10.1016/j.compedu.2015.11.014
Schleicher, A. (2020). The Impact of COVID-19 on Education: Insights from Education at a Glance 2020. OECD Publishing
Senk, S. L., Beckmann, C. E., & Thompson, D. R. (1997). Assessment and grading in high school mathematics classrooms. Journal for Research in Mathematics Education, 28(2), 187–215. https://doi.org/10.5951/jresematheduc.28.2.0187
Shin, R., Lin, C. H., Thomson, S., Chen, C., Roy, S., Platanios, E. A., Pauls, A., Klein, D., Eisner, J., & Van Durme, B. (2021). Constrained language models yield few-shot semantic parsers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 7699–7715. https://doi.org/10.18653/v1/2021.emnlp-main.608
Shuster, K., Xu, J., Komeili, M., Ju, D., Smith, E. M., Roller, S., Ung, M., Chen, M., Arora, K., Lane, J., Behrooz, M., Ngan, W., Poff, S., Goyal, N., Szlam, A., Boureau, Y.-L., Kambadur, M., & Weston, J. (2022). Blenderbot 3: A deployed conversational agent that continually learns to responsibly engage. arXiv preprint. https://doi.org/10.48550/arXiv.2208.03188
Silver, E. A. (1995). The nature and use of open problems in mathematics education: Mathematical and pedagogical perspectives. Zentralblatt fur Didaktik Der Mathematik/International Reviews on Mathematical Education, 27(2), 67–72.
Stephan, M., & Clements, D. H. (2003). Linear and area measurement in prekindergarten to grade 2. Learning and Teaching Measurement, 5(1), 3–16.
Strauss, A., & Corbin, J. (1998). Basics of qualitative research: Techniques and procedures for developing grounded theory (2nd ed.). Sage Publications, Inc.
Su, J., & Yang, W. (2023). Unlocking the power of ChatGPT: A framework for applying generative AI in education. ECNU Review of Education, 20965311231168424. https://doi.org/10.1177/20965311231168423
Sullivan, P. A. (2003). The potential of open-ended mathematics tasks for overcoming barriers to learning. In L. Bragg, C. Campbell, G. Herbert, & J. Mousley (Eds.), Mathematics education research: innovation, networking, opportunity, proceedings of the 26th annual conference of the Mathematics Education Research Group of Australasia, 2, 813–816.
Tanudjaya, C. P., & Doorman, M. (2020). Examining higher order thinking in indonesian lower secondary mathematics classrooms. Journal on Mathematics Education, 11(2), 277–300. https://doi.org/10.22342/jme.11.2.11000.277-300
Van Vaerenbergh, S., & Pérez-Suay, A. (2022). A classification of artificial intelligence systems for mathematics education. Mathematics Education in the Age of Artificial Intelligence: How Artificial Intelligence can Serve Mathematical Human Learning (pp. 89–106). Springer International Publishing. https://doi.org/10.1007/978-3-030-86909-0_5
Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., & Zhou, D. (2022). Self-consistency improves chain of thought reasoning in language models. arXiv Preprint from. https://doi.org/10.48550/arXiv.2203.11171
Wardat, Y., Tashtoush, M. A., AlAli, R., & Jarrah, A. M. (2023). ChatGPT: A revolutionary tool for teaching and learning mathematics. Eurasia Journal of Mathematics Science and Technology Education, 19(7), em2286. https://doi.org/10.29333/ejmste/13272
Wei, J., Wang, X., Schuurmans, D., Bosma, M., ichter, B., Xia, F., Chi, E., Le, V., & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35, 24824–24837.
Wu, H. (1994). The role of open-ended problems in mathematics education. The Journal of Mathematical Behavior, 13(1), 115–128. https://doi.org/10.1016/0732-3123(94)90044-2
Wu, C., Yin, S., Qi, W., Wang, X., Tang, Z., & Duan, N. (2023a). Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint from https://doi.org/10.48550/arXiv.2303.04671
Wu, Y., Jia, F., Zhang, S., Wu, Q., Li, H., Zhu, E., Wang, Y., Lee, Y., Peng, R., Wu, Q., & Wang, C. (2023c). An empirical study on challenging math problem solving with GPT-4. arXiv preprint from https://doi.org/10.48550/arXiv.2306.01337
Wu, Y., Henriksson, A., Duneld, M., & Nouri, J. (2023b). Towards improving the reliability and transparency of ChatGPT for educational question answering. In European conference on technology enhanced learning (pp. 475–488). Cham: Springer Nature. https://doi.org/10.1007/978-3-031-42682-7_32
Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y., & Narasimhan, K. (2023). Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint from https://doi.org/10.48550/arXiv.2305.10601
Yu, H. (2023). Reflection on whether Chat GPT should be banned by academia from the perspective of education and teaching. Frontiers in Psychology, 14, 1181712. https://doi.org/10.3389/fpsyg.2023.1181712
Zhao, Y., & Watterston, J. (2021). The changes we need: Education post COVID-19. Journal of Educational Change, 22(1), 3–12. https://doi.org/10.1007/s10833-021-09417-3
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest
None.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix 1. Evaluation Rubric of Each Question by Knowledge Elements
Appendix 2
Appendix 3
Appendix 4. Detail Analysis
Comparing after modifying the prompt: graph shows reliability values for human evaluations and GPTs before (left) and after (right) modifying the prompt for question 1_1. The x-axis represents each human individual, and the y-axis is the reliability value. Circles are point estimates, vertical lines are 95% confidence values. a and (b) are ICC values (top), and (c) and (d) are Krippendorffs’ alpha values (bottom)
Comparing after modifying the prompt: graph shows reliability values for human evaluations and GPTs before (left) and after (right) modifying the prompt for question 2_3. The x-axis represents each human individual, and the y-axis is the reliability value. Circles are point estimates, vertical lines are 95% confidence values. a and (b) are ICC values (top), and (c) and (d) are Krippendorffs’ alpha values (bottom)
Comparing after modifying the prompt: graph shows reliability values for human evaluations and GPTs before (left) and after (right) modifying the prompt for question 4_1. The x-axis represents each human individual, and the y-axis is the reliability value. Circles are point estimates, vertical lines are 95% confidence values. a and (b) are ICC values (top), and (c) and (d) are Krippendorffs’ alpha values (bottom)
Rights and permissions
About this article
Cite this article
Lee, U., Kim, Y., Lee, S. et al. Can we Use GPT-4 as a Mathematics Evaluator in Education?: Exploring the Efficacy and Limitation of LLM-based Automatic Assessment System for Open-ended Mathematics Question. Int J Artif Intell Educ 35, 1560–1596 (2025). https://doi.org/10.1007/s40593-024-00448-4
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s40593-024-00448-4


